Model Optimization - Profiling Collection, Analysis and Optimization Ideas#
Performance optimization is a critical step when training models on Ascend NPUs. Performance analysis (Profiling) can effectively identify performance bottlenecks and optimize model training efficiency. This guide will detail how to collect and analyze profiling data, including relevant configurations, tool usage, and typical performance problem analysis methods.
Profiling Collection Configuration and Description#
VeOmni’s profiling configuration is located under the train.profile.* namespace, defined by the ProfileConfig class in veomni/arguments/arguments_types.py .
Configuration Item Description#
Configuration Item |
Type |
Default Value |
Description |
|---|---|---|---|
enable |
bool |
False |
Whether to enable profiling |
start_step |
int |
1 |
The step to start profiling |
end_step |
int |
2 |
The step to end profiling |
trace_dir |
str |
“./trace” |
Directory to save profiling traces |
record_shapes |
bool |
True |
Whether to record input tensor shapes |
profile_memory |
bool |
True |
Whether to profile memory usage |
with_stack |
bool |
True |
Whether to record stack traces |
with_modules |
bool |
False |
Whether to record module hierarchy in profiling traces |
rank0_only |
bool |
True |
Whether to profile only rank 0 |
Configuration Items That May Affect Performance#
The following configuration items will impact training performance and need to be set according to the scenario:
record_shapes: Recording tensor shapes increases profiling overhead
profile_memory: Enabling memory profiling adds additional overhead
with_stack: Recording stack traces significantly increases profiling overhead
rank0_only: When set to False, all ranks will be profiled, generating a large number of files and consuming significant disk space and time
Typical Configuration Method#
Add profiling configuration in the model’s YAML configuration file:
train:
profile:
enable: true
start_step: 5
end_step: 6
record_shapes: true
trace_dir: ./profiling
Profiling Analysis Tool - MindStudio Insight#
After configuring the collection script, start the training script to begin performance data collection. Results are output to the specified folder. MindStudio is typically used for visual analysis of profiling data.
Use MindStudio Insight’s visualization tools for performance analysis, viewing operator execution time, communication time, memory usage, etc. For details, refer to the Ascend Tool Official Documentation.
Typical Performance Problem Analysis#
1. Computational Bottleneck Analysis#
Check NPU Utilization:
Use TensorBoard or MindStudio Insight to view operator execution time
Identify operators with long execution times, analyze their input shapes and types to determine if they are computational bottlenecks
Examine operator call stacks to identify redundant operations
Identify computationally intensive operations (such as attention, matmul)
Check for serialization operations causing NPU idle time
2. Memory Bottleneck Analysis#
Memory Usage Analysis:
Use TensorBoard or MindStudio Insight to view memory usage
Identify steps with high memory usage, analyze memory allocation and deallocation
Determine if memory rearrangement exists
3. Multi-Machine Multi-Card Communication Bottleneck Analysis#
In Distributed Training:
Use MindStudio Insight to view the multi-card communication overview, analyzing computation, communication, and idle time for each card
Find cards with long communication times, analyze their communication matrices to identify slow cards and links
Check the time consumption of collective communications such as all-reduce and all-gather
Analyze if GPU idle waiting is caused by communication
4. Data Loading Bottleneck Analysis#
CPU Activity Analysis:
View data preprocessing time
Check if the dataloader is a bottleneck
Analyze the overlap between data loading and computation