Qwen3 DPO training guide#
DPO training with Qwen3-0.6B on the trl-lib/ultrafeedback_binarized dataset.
Config: configs/text/qwen3_dpo.yaml
Step 1 — Prepare the dataset#
python scripts/download_hf_data.py \
--repo_id trl-lib/ultrafeedback_binarized \
--local_dir ./ultrafeedback_binarized \
--allow_patterns "*.parquet"
This downloads all train and test Parquet files from the Hub into ./ultrafeedback_binarized/.
Step 2 — Prepare the model#
python scripts/download_hf_model.py \
--repo_id Qwen/Qwen3-0.6B \
--local_dir .
The script appends the model name to --local_dir, saving to ./Qwen3-0.6B.
Step 3 — Start DPO training#
bash train.sh tasks/train_text_dpo.py configs/text/qwen3_dpo.yaml
Key config values (full DPO argument reference: arguments.md — DPOConfig):
Parameter |
Value |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Step 4 — Monitor training#
Training outputs (DPO loss, chosen/rejected rewards, reward accuracy, grad norm) are printed every step and logged to Weights & Biases.
Checkpoints#
Checkpoints are saved under train.checkpoint.output_dir every save_steps steps.
With save_hf_weights: true, a HuggingFace-compatible checkpoint is also written:
Qwen3-0.6B-dpo-ultrafeedback/
└── checkpoints/
└── global_step_200/
├── ... ← DCP distributed checkpoint
└── hf_ckpt/ ← HuggingFace safetensors