Qwen3 VL training guide#

Download dataset#

Download the COCO2017 dataset and download the data annotation JSON file sharegpt4v_instruct_gpt4-vision_cap100k.json.

Modify the sharegpt4v_instruct_gpt4-vision_cap100k.json

import json
with open('sharegpt4v_instruct_gpt4-vision_cap100k.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
filtered_data = []
for item in data:
    if item.get('image', '').startswith('coco'):
        new_item = item.copy()
        image_path = new_item.pop('image')
        new_item['images'] = [image_path]
        filtered_data.append(new_item)
with open('sharegpt4v_instruct_gpt4-vision_cap100k_coco.json', 'w', encoding='utf-8') as f:
    json.dump(filtered_data, f, ensure_ascii=False, indent=4)

Download Qwen3 VL model#

Qwen3-VL-8B#

python3 scripts/download_hf_model.py \
    --repo_id Qwen/Qwen3-VL-8B-Instruct \
    --local_dir .

Qwen3-VL-30B#

python3 scripts/download_hf_model.py \
    --repo_id Qwen/Qwen3-VL-30B-A3B-Instruct \
    --local_dir .

Start training on GPU/NPU#

Qwen3-VL-8B#

bash train.sh tasks/train_vlm.py configs/multimodal/qwen3_vl/qwen3_vl_dense.yaml \
    --model.model_path ./Qwen3-VL-8B-Instruct \
    --data.train_path ./sharegpt4v_instruct_gpt4-vision_cap100k_coco.json \
    --data.dataloader.type native \
    --data.datasets_type iterable \
    --data.source_name sharegpt4v_sft \
    --data.dataloader.num_workers 8 \
    --train.micro_batch_size 3

Qwen3-VL-30B#

bash train.sh tasks/train_vlm.py configs/multimodal/qwen3_vl/qwen3_vl_moe.yaml \
    --model.model_path ./Qwen3-VL-30B-A3B-Instruct \
    --data.train_path ./sharegpt4v_instruct_gpt4-vision_cap100k_coco.json \
    --data.dataloader.type native \
    --data.datasets_type iterable \
    --data.source_name sharegpt4v_sft \
    --data.dataloader.num_workers 8 \
    --train.micro_batch_size 2