Typical Usage: Qwen3-VL 8B Training on Ascend NPU#

This document provides a complete step-by-step guide for training the Qwen3-VL 8B model on Ascend NPUs. Follow these instructions carefully to ensure a successful training experience.

Prerequisites#

  • Ascend NPU environment with CANN 9.0.0 installed

  • VeOmni framework installed (see Installation section)

  • Sufficient storage space for dataset and model weights

Step 1: Download Dataset#

We’ll use the COCO2017 dataset and ShareGPT4V annotations for training. Follow these steps to prepare the dataset:

# Download COCO2017 dataset
wget https://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip

# Download ShareGPT4V annotations
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json

Step 2: Preprocess Dataset#

Modify the annotation file to match the expected format for VeOmni:

import json
with open('sharegpt4v_instruct_gpt4-vision_cap100k.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
filtered_data = []
for item in data:
    if item.get('image', '').startswith('coco'):
        new_item = item.copy()
        image_path = new_item.pop('image')
        # Update the image path to point to your downloaded COCO dataset
        new_item['images'] = [f'./train2017/{image_path.split("/")[-1]}']
        filtered_data.append(new_item)
with open('sharegpt4v_instruct_gpt4-vision_cap100k_coco.json', 'w', encoding='utf-8') as f:
    json.dump(filtered_data, f, ensure_ascii=False, indent=4)

Step 3: Download Pre-trained Model#

Download the Qwen3-VL-8B-Instruct model weights:

python3 scripts/download_hf_model.py \
    --repo_id Qwen/Qwen3-VL-8B-Instruct \
    --local_dir ./Qwen3-VL-8B-Instruct

Step 4: Configure Training#

VeOmni uses YAML configuration files for training. You can directly modify the configuration file at configs/multimodal/qwen3_vl/qwen3_vl_dense.yaml to adjust parameters like batch size, learning rate, and other hyperparameters according to your needs.

NPU-Friendly Operator Configurations#

Important: The default configuration files are optimized for GPU environments. When running on Ascend NPUs, we recommend adding the following minimal NPU-friendly operator configurations to your model’s YAML file. These settings optimize operator implementations specifically for NPU hardware and are generally applicable to most models in the repository:

model:
  ops_implementation:
    attn_implementation: flash_attention_2
    moe_implementation: fused_npu
    cross_entropy_loss_implementation: npu
    rms_norm_implementation: npu
    rotary_pos_emb_implementation: npu
    swiglu_mlp_implementation: eager           # no NPU backend available
    load_balancing_loss_implementation: eager  # triton-ascend not exposed as `triton`
    # Qwen3.5 GatedDeltaNet trio — only meaningful for Qwen3.5 MoE models;
    # pin to eager on NPU (no NPU kernel) and set train.dyn_bsz=False:
    rms_norm_gated_implementation: eager
    causal_conv1d_implementation: eager
    chunk_gated_delta_rule_implementation: eager

These configurations specify the optimal implementation for each operator type when running on NPUs:

  • Use NPU-optimized implementations where available (rms_norm_implementation: npu)

  • Fall back to compatible implementations for operations without NPU support (eager)

  • Configure specialized settings for model-specific components (Qwen3.5 GatedDeltaNet)

Note: Some models with structurally-incompatible kernels (e.g., Wan rope_apply, Qwen2-VL multimodal RoPE) already include these NPU-friendly configurations in their default YAML files.

Step 5: Start Training#

Run the training command with the appropriate parameters:

bash train.sh tasks/train_vlm.py configs/multimodal/qwen3_vl/qwen3_vl_dense.yaml \
    --model.model_path ./Qwen3-VL-8B-Instruct \
    --data.train_path ./sharegpt4v_instruct_gpt4-vision_cap100k_coco.json \
    --data.dataloader.type native \
    --data.datasets_type iterable \
    --data.source_name sharegpt4v_sft \

Step 6: Checkpoint Configuration#

VeOmni automatically saves checkpoints during training. You can configure the checkpoint behavior in the YAML configuration file:

train:
  checkpoint:
    output_dir: ./checkpoints
    save_steps: 1000
    save_epochs: 1
    save_hf_weights: true

Key checkpoint configuration parameters:

  • output_dir: Directory to save checkpoints

  • save_steps: Number of steps between checkpoint saves

  • save_epochs: Number of epochs between checkpoint saves

  • save_hf_weights: Whether to save Hugging Face model weights in addition to the VeOmni checkpoint format (only in the last checkpoint directory)