FAQ: Common Issues and Solutions for Ascend NPU#

This document addresses frequently asked questions and common issues encountered when using VeOmni with Ascend NPUs.

Q: How to resolve memory fragmentation issues on NPU?#

A: Set the multi-stream memory reuse environment variable#

# Enable NPU multi-stream memory reuse
export MULTI_STREAM_MEMORY_REUSE=2

This enables the NPU’s multi-stream memory reuse feature, which reduces memory fragmentation and improves utilization. Recommended value: 2.

Note: This environment variable is already set by default in train.sh

Q: How to configure multi-node training?#

A: Modify environment variables in train.sh#

Below is a 2-node example (adjust according to your cluster size):

# Number of nodes (2 nodes in this example)
NNODES=${NNODES:=2}
# Current node rank (0 to 1 for 2 nodes - must be different for each machine)
NODE_RANK=${NODE_RANK:=0}
# Master node address (IP address - must be the same across all machines)
MASTER_ADDR=${MASTER_ADDR:=192.168.1.100}
# Master node port (default works for most cases)
MASTER_PORT=${MASTER_PORT:=12345}
# Number of NPUs per node (A2: max 8, A3: max 16)
NPROC_PER_NODE=${NPROC_PER_NODE:=8}

Configuration Location: These parameters are in train.sh (lines 9-37).

Parameter Explanations:

  • NNODES: Total number of nodes in your cluster

  • NODE_RANK: Unique identifier for each node (0 to NNODES-1)

  • MASTER_ADDR: IP address of the master node (same for all machines)

  • MASTER_PORT: Communication port (default: 12345)

  • NPROC_PER_NODE: NPUs per node (A2: max 8, A3: max 16)

Important Notes:

  • All nodes must communicate via MASTER_ADDR:MASTER_PORT

  • All nodes need the same configuration files and data paths

  • Ensure network connectivity between all nodes

Q: How to resolve “‘liger_kernel’ is not supported on Ascend NPU” error?#

A: Set model.ops_implementation parameters in YAML#

Configure the operators in your YAML configuration file:

model:
  ops_implementation:
    # Attention implementation selection
    attn_implementation: "flash_attention_2"  # Options: "eager", "sdpa", "flash_attention_2", "flash_attention_3", "flash_attention_4", "native-sparse"
    # MoE implementation selection
    moe_implementation: "fused_npu"    # Options: "eager", "fused_npu"

Configuration Location: These parameters are in your YAML configuration file (see arguments.md lines 127-135).

NPU Optimized Operators: VeOmni automatically detects NPU environment and uses optimized operators:

  • npu_group_gemm: MoE GroupGEMM operator (npu_group_gemm.py:1-114)

  • npu_rms_norm: RMS normalization operator (npu_fused_operator.py:20-26)

  • npu_rotary_mul: RoPE positional encoding operator (npu_fused_operator.py:28-52)

Note: NPU automatically selects optimal operator implementations. For Attention, it uses SDPA or CANN built-in operators; for MoE, it uses npu_group_gemm.

Q: How to resolve “‘global batch size’ should be a multiple of 8/16/32” error?#

A: Ensure proper batch size configuration#

Make sure the global batch size meets the multiple requirement:

global_batch_size = micro_batch_size × data_parallel_size × gradient_accumulation_steps

Important Notes:

  • If global_batch_size is not set, the system automatically calculates it as micro_batch_size × dp_size

  • Ensure global_batch_size can be divided by all parallel dimensions

Q: How to set NPU device visibility?#

A: Use ASCEND_RT_VISIBLE_DEVICES environment variable#

# Make only NPUs 0,1,2,3 visible
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

Automatic Detection: If not set, the system automatically detects all available NPU devices:

# Automatically detect number of available NPUs
NPROC_PER_NODE=$(ls -l /dev/davinci* | grep -v "davinci_manager" | wc -l)

Note: Similar to CUDA’s CUDA_VISIBLE_DEVICES, this controls which NPU devices are visible to processes.

Q: How to resolve Transformers version incompatibility issues?#

A: Use compatible Transformers versions#

Ensure you’re using a compatible Transformers version:

# Check current Transformers version
python -c "import transformers; print(transformers.__version__)"

# Install using uv
uv sync --locked --extra npu --extra audio --group dev

# Or install using pip
pip install transformers==5.2.0

Version Recommendations:

  • VeOmni pins Transformers 5.2.0 (see pyproject.toml). Other v5 minor versions may work but are not exercised in CI.