Checkpoint Conversion#
This guide explains how to convert VeOmni’s Distributed Checkpoint (DCP) format to HuggingFace format using the merge_dcp_to_hf.py script.
Overview#
The merge_dcp_to_hf.py script provides memory-efficient conversion from PyTorch Distributed Checkpoint (DCP) format to HuggingFace format. It processes checkpoints shard-by-shard to minimize memory usage, making it suitable for large models.
Usage#
Basic Usage#
python scripts/merge_dcp_to_hf.py --load-dir <DCP_CHECKPOINT_PATH>
This will create a HuggingFace format checkpoint in <DCP_CHECKPOINT_PATH>/hf_ckpt.
Advanced Usage#
python scripts/merge_dcp_to_hf.py \
--load-dir <DCP_CHECKPOINT_PATH> \
--save-dir <OUTPUT_PATH> \
--model-assets-dir <MODEL_CONFIG_PATH> \
--shard-size 2000000000
Command-Line Arguments#
Argument |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
str |
Yes |
- |
Directory containing the DCP checkpoint |
|
str |
No |
|
Output directory for HuggingFace format checkpoint |
|
str |
No |
None |
Directory containing model config and processor (e.g., tokenizer) |
|
int |
No |
2000000000 |
Maximum shard size in bytes (default: 2GB) |
Examples#
Convert DCP Checkpoint to HuggingFace Format#
python scripts/merge_dcp_to_hf.py \
--load-dir checkpoints/my_model/dcp_checkpoint
Output will be saved to: checkpoints/my_model/dcp_checkpoint/hf_ckpt
Convert with Custom Output Directory#
python scripts/merge_dcp_to_hf.py \
--load-dir checkpoints/my_model/dcp_checkpoint \
--save-dir hf_models/my_model
Include Model Assets (Config & Tokenizer)#
python scripts/merge_dcp_to_hf.py \
--load-dir checkpoints/my_model/dcp_checkpoint \
--save-dir hf_models/my_model \
--model-assets-dir pretrained_models/qwen3-8b
This will copy the model configuration and tokenizer from pretrained_models/qwen3-8b to the output directory.
Customize Shard Size#
python scripts/merge_dcp_to_hf.py \
--load-dir checkpoints/my_model/dcp_checkpoint \
--shard-size 5000000000
This sets the maximum shard size to 5GB instead of the default 2GB.
Output Format#
The script generates a HuggingFace-compatible checkpoint with the following structure:
output_directory/
├── model.safetensors # Single file (if total size < shard_size)
└── config.json # Model config (if --model-assets-dir provided)
└── tokenizer.json # Tokenizer (if --model-assets-dir provided)
Or for sharded checkpoints:
output_directory/
├── model-00001-of-00005.safetensors
├── model-00002-of-00005.safetensors
├── model-00003-of-00005.safetensors
├── model-00004-of-00005.safetensors
├── model-00005-of-00005.safetensors
├── model.safetensors.index.json # Weight mapping index
├── config.json # Model config (if --model-assets-dir provided)
└── tokenizer.json # Tokenizer (if --model-assets-dir provided)
Key Conversion Details#
Weight Name Mapping#
The script automatically converts DCP key names to HuggingFace format:
model.model.*→model.*(removes first “model.” prefix)model.lm_head.weight→lm_head.weightOther
model.*keys → strips “model.” prefix with warning
Non-model weights (keys not starting with model.) are filtered out.
Data Type Conversion#
By default, all weights are converted to bfloat16 format. This can be customized in the code by modifying the save_dtype parameter in save_model_weights().
Memory Efficiency#
The script uses a shard-by-shard processing approach:
Analyzes checkpoint metadata to plan sharding
Loads only one shard’s worth of weights at a time
Converts and saves the shard
Frees memory before processing the next shard
This approach allows conversion of very large models without requiring all weights to fit in memory simultaneously.
Troubleshooting#
No Model Weights Found#
If you see the warning “No model weights found!”, check:
The checkpoint path is correct
The checkpoint contains keys starting with
model.The checkpoint was saved in DCP format
Out of Memory#
If you encounter OOM errors:
Reduce the
--shard-sizeto a smaller valueEnsure no other processes are consuming GPU/CPU memory
Consider using a machine with more RAM
Missing Model Assets#
If model config or tokenizer is missing from the output:
Ensure
--model-assets-dirpoints to a valid HuggingFace model directoryCheck that the directory contains
config.jsonand tokenizer filesVerify the model type is supported by HuggingFace’s
AutoConfigandAutoProcessor
See Also#
Basic Modules - Understanding VeOmni’s checkpoint saving
Arguments - Checkpoint-related training arguments