Model Optimization - Profiling Collection, Analysis and Optimization Ideas#

Performance optimization is a critical step when training models on Ascend NPUs. Performance analysis (Profiling) can effectively identify performance bottlenecks and optimize model training efficiency. This guide will detail how to collect and analyze profiling data, including relevant configurations, tool usage, and typical performance problem analysis methods.

Profiling Collection Configuration and Description#

VeOmni’s profiling configuration is located under the train.profile.* namespace, defined by the ProfileConfig class in veomni/arguments/arguments_types.py .

Configuration Item Description#

Configuration Item	Type	Default Value	Description
enable	bool	False	Whether to enable profiling
start_step	int	1	The step to start profiling
end_step	int	2	The step to end profiling
trace_dir	str	“./trace”	Directory to save profiling traces
record_shapes	bool	True	Whether to record input tensor shapes
profile_memory	bool	True	Whether to profile memory usage
with_stack	bool	True	Whether to record stack traces
with_modules	bool	False	Whether to record module hierarchy in profiling traces
rank0_only	bool	True	Whether to profile only rank 0

Configuration Items That May Affect Performance#

The following configuration items will impact training performance and need to be set according to the scenario:

record_shapes: Recording tensor shapes increases profiling overhead
profile_memory: Enabling memory profiling adds additional overhead
with_stack: Recording stack traces significantly increases profiling overhead
rank0_only: When set to False, all ranks will be profiled, generating a large number of files and consuming significant disk space and time

Typical Configuration Method#

Add profiling configuration in the model’s YAML configuration file:

train:
    profile:
        enable: true
        start_step: 5
        end_step: 6
        record_shapes: true
        trace_dir: ./profiling

Profiling Analysis Tool - MindStudio Insight#

After configuring the collection script, start the training script to begin performance data collection. Results are output to the specified folder. MindStudio is typically used for visual analysis of profiling data.

Use MindStudio Insight’s visualization tools for performance analysis, viewing operator execution time, communication time, memory usage, etc. For details, refer to the Ascend Tool Official Documentation.

Typical Performance Problem Analysis#

1. Computational Bottleneck Analysis#

Check NPU Utilization:

Use TensorBoard or MindStudio Insight to view operator execution time
Identify operators with long execution times, analyze their input shapes and types to determine if they are computational bottlenecks
Examine operator call stacks to identify redundant operations
Identify computationally intensive operations (such as attention, matmul)
Check for serialization operations causing NPU idle time

2. Memory Bottleneck Analysis#

Memory Usage Analysis:

Use TensorBoard or MindStudio Insight to view memory usage
Identify steps with high memory usage, analyze memory allocation and deallocation
Determine if memory rearrangement exists

3. Multi-Machine Multi-Card Communication Bottleneck Analysis#

In Distributed Training:

Use MindStudio Insight to view the multi-card communication overview, analyzing computation, communication, and idle time for each card
Find cards with long communication times, analyze their communication matrices to identify slow cards and links
Check the time consumption of collective communications such as all-reduce and all-gather
Analyze if GPU idle waiting is caused by communication

4. Data Loading Bottleneck Analysis#

CPU Activity Analysis:

View data preprocessing time
Check if the dataloader is a bottleneck
Analyze the overlap between data loading and computation

Model Optimization - Profiling Collection, Analysis and Optimization Ideas

Contents