Arguments API Reference#
Training arguments use nested dataclasses defined in veomni.arguments.arguments_types.
The root config VeOmniArguments assembles three top-level groups — model, data, and train —
each of which contains further nested sub-configs.
Example YAML structure:
train:
wandb:
enable: true
project: VeOmni
accelerator:
fsdp_config:
fsdp_mode: fsdp1
full_shard: true
checkpoint:
manager: dcp
Configuration#
Top-level configuration that assembles all argument groups.
VeOmniArguments— Root config:model+data+trainVeOmniVLMArguments— VLM extension ofVeOmniArguments
Model#
Model architecture, paths, and multimodal encoder / decoder setup.
ModelArguments—model.*OpsImplementationConfig—model.ops_implementation.*
VLM Extensions#
VLMMModelArguments— extendsModelArgumentswith encoder data-balancing options
Data#
Dataset paths, tokenization, and batching configuration.
DataArguments—data.*DataloaderConfig—data.dataloader.*
VLM Extensions#
VLMMDataArguments— extendsDataArgumentswith multimodal configs (mm_configs)
Training#
Training loop, optimizer, parallelism, checkpointing, profiling, and logging.
TrainingArguments—train.*OptimizerConfig—train.optimizer.*WandbConfig—train.wandb.*ProfileConfig—train.profile.*GradientCheckpointingConfig—train.gradient_checkpointing.*AcceleratorConfig—train.accelerator.*FSDPConfig—train.accelerator.fsdp_config.*MixedPrecisionConfig—train.accelerator.fsdp_config.mixed_precision
OffloadConfig—train.accelerator.offload_config.*
CheckpointConfig—train.checkpoint.*
VLM Extensions#
VLMTrainingArguments— extendsTrainingArgumentswith ViT / audio freeze & learning-rate options
DPO#
DPO-specific hyperparameters, accessed via dpo_config.*.
Root config: VeOmniDPOArguments (extends VeOmniArguments).
DPOConfig—dpo_config.*
Inference#
Standalone inference configuration.
InferArguments
Detailed Reference#
VeOmniArguments#
Root config — assembles model, data, and train.
Field |
Type |
Default |
Description |
|---|---|---|---|
model |
|
— |
Model configuration |
data |
|
— |
Data configuration |
train |
|
— |
Training configuration |
ModelArguments#
model.* — Model architecture, paths, and multimodal encoder / decoder setup.
Field |
Type |
Default |
Description |
|---|---|---|---|
config_path |
|
|
Path to the model HuggingFace config (e.g. |
model_path |
|
|
Path to the pre-trained model weights. If unset, random init is used. |
tokenizer_path |
|
|
Path to the tokenizer. Defaults to |
safetensor_idx_path |
|
|
Path to |
foundation |
|
|
Foundation model extra config. |
encoders |
|
|
Multimodal encoder configs keyed by modality ( |
decoders |
|
|
Multimodal decoder configs keyed by modality ( |
input_encoder |
|
|
Whether to use the encoder or decoder to encode input images. |
output_encoder |
|
|
Whether to use the encoder or decoder to encode output images. |
encode_target |
|
|
Whether to encode training targets with decoder (diffusion only). |
basic_modules |
|
|
Additional modules beyond |
ops_implementation |
|
— |
Attention / MoE kernel configuration. |
OpsImplementationConfig#
model.ops_implementation.* — Attention and MoE kernel implementation.
Field |
Type |
Default |
Description |
|---|---|---|---|
attn_implementation |
|
|
Attention implementation to use. |
moe_implementation |
|
|
MoE implementation: |
DataArguments#
data.* — Dataset paths, tokenization, and batching.
Field |
Type |
Default |
Description |
|---|---|---|---|
train_path |
|
Required |
Path of the training dataset. Use comma to separate multiple datasets. |
eval_path |
|
|
Path of the evaluation dataset. |
train_size |
|
|
Number of tokens for training (used to compute steps under dynamic batch). |
train_sample |
|
|
Number of samples for training (used to compute steps under non-dynamic batch). |
data_type |
|
|
Type of the training data. |
datasets_type |
|
|
|
multisource_datasets_type |
|
|
Dataset type for multisource training. |
source_name |
|
|
Dataset name. Loaded from multisource YAML if multisource is enabled. |
dyn_bsz_buffer_size |
|
|
Buffer size for dynamic batch size. |
text_keys |
|
|
Key to retrieve text from data. Auto-resolved: |
chat_template |
|
|
Chat template name. |
max_seq_len |
|
|
Maximum sequence length. |
silent_exception |
|
|
Whether to ignore exceptions when loading data. |
dataloader |
|
— |
DataLoader construction parameters. |
DataloaderConfig#
data.dataloader.* — DataLoader construction parameters.
Field |
Type |
Default |
Description |
|---|---|---|---|
type |
|
|
Type of the dataloader. |
num_workers |
|
|
Number of workers for data loading. |
prefetch_factor |
|
|
Number of batches loaded in advance per worker. |
drop_last |
|
|
Whether to drop the last incomplete batch. |
pin_memory |
|
|
Whether to pin memory for the dataloader. |
TrainingArguments#
train.* — Top-level training configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
train_architecture |
|
|
Parameter update strategy: |
dyn_bsz |
|
|
Enable dynamic batch size for padding-free training. |
micro_batch_size |
|
|
Number of samples per iteration on each device. |
global_batch_size |
|
|
Global batch size. If |
num_train_epochs |
|
|
Number of training epochs. |
pad_to_length |
|
|
Pad packed sequences to a fixed length (requires |
bsz_warmup_ratio |
|
|
Ratio of batch size warmup steps. |
bsz_warmup_init_mbtoken |
|
|
Initial number of tokens in a batch during warmup. |
init_device |
|
|
Device for model weight initialization. Use |
broadcast_model_weights_from_rank0 |
|
|
Only rank 0 reads weights from disk; other ranks receive via broadcast. |
enable_full_determinism |
|
|
Enable full determinism (bitwise alignment). |
enable_batch_invariant_mode |
|
|
Enable batch invariant mode. |
empty_cache_steps |
|
|
Steps between two |
gc_steps |
|
|
Steps between two |
eval_steps |
|
|
Steps between evaluations. |
eval_epochs |
|
|
Epochs between evaluations. |
seed |
|
|
Random seed. |
enable_compile |
|
|
Enable |
max_steps |
|
|
Max training steps per epoch (debug only). |
optimizer |
|
— |
Optimizer and learning-rate schedule. |
wandb |
|
— |
Weights & Biases logging. |
profile |
|
— |
Torch profiler settings. |
gradient_checkpointing |
|
— |
Gradient checkpointing settings. |
accelerator |
|
— |
Parallelism and distributed-training topology. |
checkpoint |
|
— |
Checkpoint saving and loading. |
OptimizerConfig#
train.optimizer.* — Optimizer and learning-rate schedule.
Field |
Type |
Default |
Description |
|---|---|---|---|
type |
|
|
Optimizer type. |
lr |
|
|
Maximum / default learning rate. |
lr_min |
|
|
Minimum learning rate. |
lr_start |
|
|
Starting learning rate for warmup. |
lr_warmup_ratio |
|
|
Ratio of learning rate warmup steps. |
lr_decay_style |
|
|
Learning rate scheduler ( |
lr_decay_ratio |
|
|
Ratio of learning rate decay steps. |
weight_decay |
|
|
L2 regularization strength. |
no_decay_modules |
|
|
Modules excluded from weight decay (e.g. |
no_decay_params |
|
|
Parameters excluded from weight decay (e.g. |
max_grad_norm |
|
|
Gradient clipping norm. |
WandbConfig#
train.wandb.* — Weights & Biases logging.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable W&B logging. |
project |
|
|
W&B project name. |
name |
|
|
W&B experiment name. |
id |
|
|
W&B run ID for resuming a previous run. |
ProfileConfig#
train.profile.* — Torch profiler settings.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable profiling. |
start_step |
|
|
Start step for profiling. |
end_step |
|
|
End step for profiling. |
trace_dir |
|
|
Directory to save profiling traces. |
record_shapes |
|
|
Record input tensor shapes. |
profile_memory |
|
|
Record memory usage. |
with_stack |
|
|
Record stack traces. |
rank0_only |
|
|
Profile rank 0 only. |
GradientCheckpointingConfig#
train.gradient_checkpointing.* — Activation recomputation settings.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable gradient checkpointing. |
debug |
|
|
Enable checkpoint debugging. |
enable_reentrant |
|
|
Use reentrant gradient checkpointing. |
AcceleratorConfig#
train.accelerator.* — Parallelism and distributed-training topology.
Field |
Type |
Default |
Description |
|---|---|---|---|
dp_replicate_size |
|
|
Data parallel replicate size. |
dp_shard_size |
|
|
Data parallel shard degree. |
tp_size |
|
|
Tensor parallel size. |
ep_size |
|
|
Expert parallel size. |
ep_outside |
|
|
Expert parallelism outside in EP-FSDP. |
pp_size |
|
|
Pipeline parallel size. |
ulysses_size |
|
|
Ulysses sequence parallel size. |
enable_async |
|
|
Enable async Ulysses. |
cp_size |
|
|
Ring-attention context parallel size. |
fsdp_config |
|
— |
FSDP sharding configuration. |
offload_config |
|
— |
Activation offload settings. |
FSDPConfig#
train.accelerator.fsdp_config.* — FSDP sharding configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
fsdp_mode |
|
|
Data parallel mode. |
reshard_after_forward |
|
|
Reshard after forward (FSDP2). |
reshard_after_backward |
|
|
Reshard after backward (FSDP2). |
full_shard |
|
|
Enable full sharding — equivalent to ZeRO-3. |
forward_prefetch |
|
|
Enable forward prefetch (FSDP1). |
offload |
|
|
Enable CPU offload (FSDP1 only). |
mixed_precision |
|
— |
Mixed precision configuration. |
MixedPrecisionConfig#
train.accelerator.fsdp_config.mixed_precision.* — Mixed precision configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable mixed precision training. |
param_dtype |
|
|
Dtype for the unsharded parameter (DDP, FSDP1). |
reduce_dtype |
|
|
Dtype for gradient reduction (i.e. reduce-scatter or all-reduce) (DDP, FSDP1). |
buffer_dtype |
|
|
Dtype for the buffer (DDP, FSDP1). |
output_dtype |
|
|
Dtype for casting floating-point forward outputs (DDP, FSDP1). |
cast_forward_inputs |
|
|
Enable mixed precision cast forward inputs (FSDP2). |
OffloadConfig#
train.accelerator.offload_config.* — Activation offload settings.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable_activation |
|
|
Enable activation offload to CPU. |
activation_gpu_limit |
|
|
GB of activations allowed to remain on GPU. |
CheckpointConfig#
train.checkpoint.* — Checkpoint saving and loading.
Field |
Type |
Default |
Description |
|---|---|---|---|
output_dir |
|
|
Path to save model checkpoints. |
manager |
|
|
Checkpoint manager. |
save_async |
|
|
Save checkpoints asynchronously. |
load_path |
|
|
Path to checkpoint for resuming training. Use |
save_steps |
|
|
Steps between checkpoint saves. |
save_epochs |
|
|
Epochs between checkpoint saves. |
hf_save_steps |
|
|
Steps between HuggingFace weight saves. |
hf_save_epochs |
|
|
Epochs between HuggingFace weight saves. |
save_hf_weights |
|
|
Save HuggingFace-format weights to the last checkpoint directory. |
InferArguments#
Standalone inference configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
model_path |
|
Required |
Path to the pre-trained model. |
tokenizer_path |
|
|
Path to the tokenizer. Defaults to |
seed |
|
|
Random seed. |
do_sample |
|
|
Enable sampling in decoding. |
temperature |
|
|
Sampling temperature. |
top_p |
|
|
Nucleus sampling top-p value. |
max_tokens |
|
|
Maximum tokens to generate. |
VLM Extensions#
Additional fields for Vision-Language Model training, defined in veomni.trainer.vlm_trainer.
VLMTrainingArguments#
Extends TrainingArguments with ViT / audio tower controls.
Field |
Type |
Default |
Description |
|---|---|---|---|
freeze_vit |
|
|
Freeze ViT parameters. |
freeze_audio_tower |
|
|
Freeze audio tower parameters. |
vit_lr |
|
|
Maximum learning rate for ViT parameters. |
VLMMModelArguments#
Extends ModelArguments with encoder data-balancing options.
Field |
Type |
Default |
Description |
|---|---|---|---|
encoder_data_balance |
|
|
Enable encoder data balancing (e.g. for Qwen3-VL). |
encoder_data_balance_sorting_algo |
|
|
Sorting algorithm for encoder data balancing. |
VLMMDataArguments#
Extends DataArguments with multimodal input configs.
Field |
Type |
Default |
Description |
|---|---|---|---|
mm_configs |
|
|
Multimodal input configuration. |
DPO Reference#
DPOConfig#
dpo_config.* — Direct Preference Optimization hyperparameters.
Field |
Type |
Default |
Description |
|---|---|---|---|
beta |
|
|
KL penalty coefficient. Controls deviation from the reference model. |
label_smoothing |
|
|
Label smoothing for DPO loss. Non-zero values assume noisy preference labels. |
reference_free |
|
|
If |
loss_type |
|
|
DPO loss variant: |
average_log_prob |
|
|
If |
refer_model_precision |
|
|
dtype used to load the frozen reference model. |