FAQ: Common Issues and Solutions for Ascend NPU#
This document addresses frequently asked questions and common issues encountered when using VeOmni with Ascend NPUs.
Q: How to resolve memory fragmentation issues on NPU?#
A: Set the multi-stream memory reuse environment variable#
# Enable NPU multi-stream memory reuse
export MULTI_STREAM_MEMORY_REUSE=2
This enables the NPU’s multi-stream memory reuse feature, which reduces memory fragmentation and improves utilization. Recommended value: 2.
Note: This environment variable is already set by default in
train.sh
Q: How to configure multi-node training?#
A: Modify environment variables in train.sh#
Below is a 2-node example (adjust according to your cluster size):
# Number of nodes (2 nodes in this example)
NNODES=${NNODES:=2}
# Current node rank (0 to 1 for 2 nodes - must be different for each machine)
NODE_RANK=${NODE_RANK:=0}
# Master node address (IP address - must be the same across all machines)
MASTER_ADDR=${MASTER_ADDR:=192.168.1.100}
# Master node port (default works for most cases)
MASTER_PORT=${MASTER_PORT:=12345}
# Number of NPUs per node (A2: max 8, A3: max 16)
NPROC_PER_NODE=${NPROC_PER_NODE:=8}
Configuration Location: These parameters are in
train.sh(lines 9-37).
Parameter Explanations:
NNODES: Total number of nodes in your clusterNODE_RANK: Unique identifier for each node (0 to NNODES-1)MASTER_ADDR: IP address of the master node (same for all machines)MASTER_PORT: Communication port (default: 12345)NPROC_PER_NODE: NPUs per node (A2: max 8, A3: max 16)
Important Notes:
All nodes must communicate via
MASTER_ADDR:MASTER_PORTAll nodes need the same configuration files and data paths
Ensure network connectivity between all nodes
Q: How to resolve “‘liger_kernel’ is not supported on Ascend NPU” error?#
A: Set model.ops_implementation parameters in YAML#
Configure the operators in your YAML configuration file:
model:
ops_implementation:
# Attention implementation selection
attn_implementation: "flash_attention_2" # Options: "eager", "sdpa", "flash_attention_2", "flash_attention_3", "flash_attention_4", "native-sparse"
# MoE implementation selection
moe_implementation: "fused_npu" # Options: "eager", "fused_npu"
Configuration Location: These parameters are in your YAML configuration file (see
arguments.mdlines 127-135).
NPU Optimized Operators: VeOmni automatically detects NPU environment and uses optimized operators:
npu_group_gemm: MoE GroupGEMM operator (npu_group_gemm.py:1-114)npu_rms_norm: RMS normalization operator (npu_fused_operator.py:20-26)npu_rotary_mul: RoPE positional encoding operator (npu_fused_operator.py:28-52)
Note: NPU automatically selects optimal operator implementations. For Attention, it uses SDPA or CANN built-in operators; for MoE, it uses npu_group_gemm.
Q: How to resolve “‘global batch size’ should be a multiple of 8/16/32” error?#
A: Ensure proper batch size configuration#
Make sure the global batch size meets the multiple requirement:
global_batch_size = micro_batch_size × data_parallel_size × gradient_accumulation_steps
Important Notes:
If
global_batch_sizeis not set, the system automatically calculates it asmicro_batch_size × dp_sizeEnsure
global_batch_sizecan be divided by all parallel dimensions
Q: How to set NPU device visibility?#
A: Use ASCEND_RT_VISIBLE_DEVICES environment variable#
# Make only NPUs 0,1,2,3 visible
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
Automatic Detection: If not set, the system automatically detects all available NPU devices:
# Automatically detect number of available NPUs
NPROC_PER_NODE=$(ls -l /dev/davinci* | grep -v "davinci_manager" | wc -l)
Note: Similar to CUDA’s CUDA_VISIBLE_DEVICES, this controls which NPU devices are visible to processes.
Q: How to resolve Transformers version incompatibility issues?#
A: Use compatible Transformers versions#
Ensure you’re using a compatible Transformers version:
# Check current Transformers version
python -c "import transformers; print(transformers.__version__)"
# Install using uv
uv sync --locked --extra npu --extra audio --group dev
# Or install using pip
pip install transformers==5.2.0
Version Recommendations:
VeOmni pins Transformers
5.2.0(seepyproject.toml). Other v5 minor versions may work but are not exercised in CI.