Model Optimization - Profiling Collection, Analysis and Optimization Ideas#

Performance optimization is a critical step when training models on Ascend NPUs. Performance analysis (Profiling) can effectively identify performance bottlenecks and optimize model training efficiency. This guide will detail how to collect and analyze profiling data, including relevant configurations, tool usage, and typical performance problem analysis methods.

Profiling Collection Configuration and Description#

VeOmni’s profiling configuration is located under the train.profile.* namespace, defined by the ProfileConfig class in veomni/arguments/arguments_types.py .

Configuration Item Description#

Configuration Item

Type

Default Value

Description

enable

bool

False

Whether to enable profiling

start_step

int

1

The step to start profiling

end_step

int

2

The step to end profiling

trace_dir

str

“./trace”

Directory to save profiling traces

record_shapes

bool

True

Whether to record input tensor shapes

profile_memory

bool

True

Whether to profile memory usage

with_stack

bool

True

Whether to record stack traces

with_modules

bool

False

Whether to record module hierarchy in profiling traces

rank0_only

bool

True

Whether to profile only rank 0

Configuration Items That May Affect Performance#

The following configuration items will impact training performance and need to be set according to the scenario:

  • record_shapes: Recording tensor shapes increases profiling overhead

  • profile_memory: Enabling memory profiling adds additional overhead

  • with_stack: Recording stack traces significantly increases profiling overhead

  • rank0_only: When set to False, all ranks will be profiled, generating a large number of files and consuming significant disk space and time

Typical Configuration Method#

Add profiling configuration in the model’s YAML configuration file:

train:
    profile:
        enable: true
        start_step: 5
        end_step: 6
        record_shapes: true
        trace_dir: ./profiling

Profiling Analysis Tool - MindStudio Insight#

After configuring the collection script, start the training script to begin performance data collection. Results are output to the specified folder. MindStudio is typically used for visual analysis of profiling data.

Use MindStudio Insight’s visualization tools for performance analysis, viewing operator execution time, communication time, memory usage, etc. For details, refer to the Ascend Tool Official Documentation.

Typical Performance Problem Analysis#

1. Computational Bottleneck Analysis#

Check NPU Utilization:

  • Use TensorBoard or MindStudio Insight to view operator execution time

  • Identify operators with long execution times, analyze their input shapes and types to determine if they are computational bottlenecks

  • Examine operator call stacks to identify redundant operations

  • Identify computationally intensive operations (such as attention, matmul)

  • Check for serialization operations causing NPU idle time

2. Memory Bottleneck Analysis#

Memory Usage Analysis:

  • Use TensorBoard or MindStudio Insight to view memory usage

  • Identify steps with high memory usage, analyze memory allocation and deallocation

  • Determine if memory rearrangement exists

3. Multi-Machine Multi-Card Communication Bottleneck Analysis#

In Distributed Training:

  • Use MindStudio Insight to view the multi-card communication overview, analyzing computation, communication, and idle time for each card

  • Find cards with long communication times, analyze their communication matrices to identify slow cards and links

  • Check the time consumption of collective communications such as all-reduce and all-gather

  • Analyze if GPU idle waiting is caused by communication

4. Data Loading Bottleneck Analysis#

CPU Activity Analysis:

  • View data preprocessing time

  • Check if the dataloader is a bottleneck

  • Analyze the overlap between data loading and computation