Arguments API Reference#
Training arguments use nested dataclasses defined in veomni.arguments.arguments_types.
The root config VeOmniArguments assembles three top-level groups — model, data, and train —
each of which contains further nested sub-configs.
Example YAML structure:
train:
wandb:
enable: true
project: VeOmni
accelerator:
fsdp_config:
fsdp_mode: fsdp2
init_device: meta
checkpoint:
manager: dcp
Configuration#
Top-level configuration that assembles all argument groups.
VeOmniArguments— Root config:model+data+trainVeOmniVLMArguments— VLM extension ofVeOmniArguments
Model#
Model architecture, paths, and multimodal encoder / decoder setup.
ModelArguments—model.*OpsImplementationConfig—model.ops_implementation.*
VLM Extensions#
VLMMModelArguments— extendsModelArgumentswith encoder data-balancing options
Data#
Dataset paths, tokenization, and batching configuration.
DataArguments—data.*DataloaderConfig—data.dataloader.*
VLM Extensions#
VLMMDataArguments— extendsDataArgumentswith multimodal configs (mm_configs)
Training#
Training loop, optimizer, parallelism, checkpointing, profiling, and logging.
TrainingArguments—train.*OptimizerConfig—train.optimizer.*WandbConfig—train.wandb.*ProfileConfig—train.profile.*GradientCheckpointingConfig—train.gradient_checkpointing.*AcceleratorConfig—train.accelerator.*FSDPConfig—train.accelerator.fsdp_config.*MixedPrecisionConfig—train.accelerator.fsdp_config.mixed_precision
OffloadConfig—train.accelerator.offload_config.*
CheckpointConfig—train.checkpoint.*
VLM Extensions#
VLMTrainingArguments— extendsTrainingArgumentswith ViT / audio freeze & learning-rate options
DPO#
DPO-specific hyperparameters, accessed via dpo_config.*.
Root config: VeOmniDPOArguments (extends VeOmniArguments).
DPOConfig—dpo_config.*
Inference#
Standalone inference configuration.
InferArguments
Detailed Reference#
VeOmniArguments#
Root config — assembles model, data, and train.
Field |
Type |
Default |
Description |
|---|---|---|---|
model |
|
— |
Model configuration |
data |
|
— |
Data configuration |
train |
|
— |
Training configuration |
ModelArguments#
model.* — Model architecture, paths, and multimodal encoder / decoder setup.
Field |
Type |
Default |
Description |
|---|---|---|---|
config_path |
|
|
Path to the model HuggingFace config (e.g. |
model_path |
|
|
Path to the pre-trained model weights. If unset, random init is used. |
tokenizer_path |
|
|
Path to the tokenizer. Defaults to |
safetensor_idx_path |
|
|
Path to |
foundation |
|
|
Foundation model extra config. |
encoders |
|
|
Multimodal encoder configs keyed by modality ( |
decoders |
|
|
Multimodal decoder configs keyed by modality ( |
input_encoder |
|
|
Whether to use the encoder or decoder to encode input images. |
output_encoder |
|
|
Whether to use the encoder or decoder to encode output images. |
encode_target |
|
|
Whether to encode training targets with decoder (diffusion only). |
basic_modules |
|
|
Additional modules beyond |
ops_implementation |
|
— |
Attention / MoE kernel configuration. |
OpsImplementationConfig#
model.ops_implementation.* — Attention, MoE, and fused kernel implementation.
Each *_implementation field selects the kernel backend for that operation.
The type is str (not Literal) so third-party backends can be registered
without modifying the config class.
Defaults are GPU-optimal (Liger / Triton / fused_triton). On Ascend NPU
these defaults raise; NPU users must set every field explicitly to an
NPU-supported value ("npu", "chunk_loss", "fused_npu", "triton" for
load-balancing loss via triton-ascend) or to "eager" when the op has no
NPU backend (e.g. swiglu_mlp_implementation, DeepSeek-V3 / Qwen2-VL
multimodal RoPE).
NPU validation runs at two times:
Config-parse time (
OpsImplementationConfig.__post_init__) for the six general-purpose ops (moe,cross_entropy_loss,rms_norm,swiglu_mlp,rotary_pos_emb,load_balancing_loss). Errors fire immediately with a model-agnostic allow-list.OpSlot-bind time (
KERNEL_REGISTRY.resolvevia the kernel’sHardwareRequirement) for Qwen3.5-only ops (rms_norm_gated,causal_conv1d,chunk_gated_delta_rule). Validating these at config parse would force every NPU user to override them even when training non-Qwen3.5 models, so the check fires only when Qwen3.5’s patched modeling is actually loaded. Qwen3.5 GatedDeltaNet has no NPU kernel today — varlen training (dyn_bsz=True, the default) is not supported on NPU; non-varlen training works only with all three fields pinned to"eager".
Field |
Type |
Default |
Description |
|---|---|---|---|
attn_implementation |
|
|
Attention implementation to use. |
moe_implementation |
|
|
MoE experts forward implementation. |
cross_entropy_loss_implementation |
|
|
Cross-entropy loss. |
rms_norm_implementation |
|
|
RMSNorm. Known values: |
swiglu_mlp_implementation |
|
|
SwiGLU MLP. Known values: |
rotary_pos_emb_implementation |
|
|
Rotary pos emb. Known values: |
load_balancing_loss_implementation |
|
|
MoE load-balancing loss. |
rms_norm_gated_implementation |
|
|
Gated RMSNorm (Qwen3.5 GatedDeltaNet |
causal_conv1d_implementation |
|
|
Varlen depthwise causal conv1d (Qwen3.5 GatedDeltaNet pre-mixer). Known values: |
chunk_gated_delta_rule_implementation |
|
|
Chunk gated delta-rule kernel for Qwen3.5 linear attention. Known values: |
DataArguments#
data.* — Dataset paths, tokenization, and batching.
Field |
Type |
Default |
Description |
|---|---|---|---|
train_path |
|
Required |
Path of the training dataset. Use comma to separate multiple datasets. |
eval_path |
|
|
Path of the evaluation dataset. |
train_size |
|
|
Number of tokens for training (used to compute steps under dynamic batch). |
train_sample |
|
|
Number of samples for training (used to compute steps under non-dynamic batch). |
data_type |
|
|
Type of the training data. |
datasets_type |
|
|
|
multisource_datasets_type |
|
|
Dataset type for multisource training. |
source_name |
|
|
Dataset name. Loaded from multisource YAML if multisource is enabled. |
dyn_bsz_buffer_size |
|
|
Buffer size for dynamic batch size. |
text_keys |
|
|
Key to retrieve text from data. Auto-resolved: |
chat_template |
|
|
Chat template name. |
max_seq_len |
|
|
Maximum sequence length. |
silent_exception |
|
|
Whether to ignore exceptions when loading data. |
dataloader |
|
— |
DataLoader construction parameters. |
DataloaderConfig#
data.dataloader.* — DataLoader construction parameters.
Field |
Type |
Default |
Description |
|---|---|---|---|
type |
|
|
Type of the dataloader. |
num_workers |
|
|
Number of workers for data loading. |
prefetch_factor |
|
|
Number of batches loaded in advance per worker. |
drop_last |
|
|
Whether to drop the last incomplete batch. |
pin_memory |
|
|
Whether to pin memory for the dataloader. |
TrainingArguments#
train.* — Top-level training configuration.
Field |
Type |
Default |
Description |
|---|
| dyn_bsz | bool | True | Enable dynamic batch size for padding-free training. |
| micro_batch_size | int | 1 | Number of samples per iteration on each device. |
| global_batch_size | Optional[int] | None | Global batch size. If None, uses micro_batch_size × dp_size. |
| num_train_epochs | int | 1 | Number of training epochs. |
| pad_to_length | bool | False | Pad packed sequences to a fixed length (requires dyn_bsz). |
| bsz_warmup_ratio | float | 0 | Ratio of batch size warmup steps. |
| bsz_warmup_init_mbtoken | int | 200 | Initial number of tokens in a batch during warmup. |
| init_device | Literal["cpu", "cuda", "meta", "npu"] | "meta" | Device for model weight initialization. "meta" is required for FSDP2. |
| broadcast_model_weights_from_rank0 | bool | True | Only rank 0 reads weights from disk; other ranks receive via broadcast. |
| enable_full_determinism | bool | False | Enable full determinism (bitwise alignment). |
| enable_batch_invariant_mode | bool | False | Enable batch invariant mode. |
| empty_cache_steps | int | 500 | Steps between two torch.cuda.empty_cache() calls. |
| gc_steps | int | 500 | Steps between two gc.collect() calls. Disabled if positive. |
| eval_steps | int | 0 | Steps between evaluations. 0 to disable. |
| eval_epochs | int | 1 | Epochs between evaluations. 0 to disable. |
| seed | int | 42 | Random seed. |
| enable_compile | bool | False | Enable torch.compile. |
| max_steps | Optional[int] | None | Max training steps per epoch (debug only). |
| optimizer | OptimizerConfig | — | Optimizer and learning-rate schedule. |
| wandb | WandbConfig | — | Weights & Biases logging. |
| profile | ProfileConfig | — | Torch profiler settings. |
| gradient_checkpointing | GradientCheckpointingConfig | — | Gradient checkpointing settings. |
| accelerator | AcceleratorConfig | — | Parallelism and distributed-training topology. |
| checkpoint | CheckpointConfig | — | Checkpoint saving and loading. |
OptimizerConfig#
train.optimizer.* — Optimizer and learning-rate schedule.
Field |
Type |
Default |
Description |
|---|---|---|---|
type |
|
|
Optimizer type. |
lr |
|
|
Maximum / default learning rate. |
lr_min |
|
|
Minimum learning rate. |
lr_start |
|
|
Starting learning rate for warmup. |
lr_warmup_ratio |
|
|
Ratio of learning rate warmup steps. |
lr_decay_style |
|
|
Learning rate scheduler ( |
lr_decay_ratio |
|
|
Ratio of learning rate decay steps. |
weight_decay |
|
|
L2 regularization strength. |
no_decay_modules |
|
|
Modules excluded from weight decay (e.g. |
no_decay_params |
|
|
Parameters excluded from weight decay (e.g. |
max_grad_norm |
|
|
Gradient clipping norm. |
WandbConfig#
train.wandb.* — Weights & Biases logging.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable W&B logging. |
project |
|
|
W&B project name. |
name |
|
|
W&B experiment name. |
id |
|
|
W&B run ID for resuming a previous run. |
ProfileConfig#
train.profile.* — Torch profiler settings.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable profiling. |
start_step |
|
|
Start step for profiling. |
end_step |
|
|
End step for profiling. |
trace_dir |
|
|
Directory to save profiling traces. |
record_shapes |
|
|
Record input tensor shapes. |
profile_memory |
|
|
Record memory usage. |
with_stack |
|
|
Record stack traces. |
rank0_only |
|
|
Profile rank 0 only. |
GradientCheckpointingConfig#
train.gradient_checkpointing.* — Activation recomputation settings.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable gradient checkpointing. |
debug |
|
|
Enable checkpoint debugging. |
enable_reentrant |
|
|
Use reentrant gradient checkpointing. |
AcceleratorConfig#
train.accelerator.* — Parallelism and distributed-training topology.
Field |
Type |
Default |
Description |
|---|---|---|---|
dp_replicate_size |
|
|
Data parallel replicate size. |
dp_shard_size |
|
|
Data parallel shard degree. |
tp_size |
|
|
Tensor parallel size. |
ep_size |
|
|
Expert parallel size. |
ep_outside |
|
|
Expert parallelism outside in EP-FSDP. |
pp_size |
|
|
Pipeline parallel size. |
ulysses_size |
|
|
Ulysses sequence parallel size. |
enable_async |
|
|
Enable async Ulysses. |
cp_size |
|
|
Ring-attention context parallel size. |
fsdp_config |
|
— |
FSDP sharding configuration. |
offload_config |
|
— |
Activation offload settings. |
FSDPConfig#
train.accelerator.fsdp_config.* — FSDP sharding configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
fsdp_mode |
|
|
Data parallel mode. |
reshard_after_forward |
|
|
Reshard after forward (FSDP2). |
reshard_after_backward |
|
|
Reshard after backward (FSDP2). |
forward_prefetch |
|
|
Enable forward prefetch. |
offload |
|
|
Enable CPU offload. |
max_load_broadcast_size |
|
|
Maximum size (in GB) of parameters broadcasted from rank 0 during loading weights (FSDP2). Parameters exceeding this threshold will be chunked according to the parallel plan before broadcasting. |
mixed_precision |
|
— |
Mixed precision configuration. |
MixedPrecisionConfig#
train.accelerator.fsdp_config.mixed_precision.* — Mixed precision configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable |
|
|
Enable mixed precision training. |
param_dtype |
|
|
Dtype for the unsharded parameter. |
reduce_dtype |
|
|
Dtype for gradient reduction (i.e. reduce-scatter or all-reduce). |
output_dtype |
|
|
Dtype for casting floating-point forward outputs (FSDP2). |
cast_forward_inputs |
|
|
Enable mixed precision cast forward inputs (FSDP2). |
OffloadConfig#
train.accelerator.offload_config.* — Activation offload settings.
Field |
Type |
Default |
Description |
|---|---|---|---|
enable_activation |
|
|
Enable activation offload to CPU. |
activation_gpu_limit |
|
|
GB of activations allowed to remain on GPU. |
CheckpointConfig#
train.checkpoint.* — Checkpoint saving and loading.
Field |
Type |
Default |
Description |
|---|---|---|---|
output_dir |
|
|
Path to save model checkpoints. |
manager |
|
|
Checkpoint manager. |
save_async |
|
|
Save checkpoints asynchronously. |
load_path |
|
|
Path to checkpoint for resuming training. Use |
save_steps |
|
|
Steps between checkpoint saves. |
save_epochs |
|
|
Epochs between checkpoint saves. |
hf_save_steps |
|
|
Steps between HuggingFace weight saves. |
hf_save_epochs |
|
|
Epochs between HuggingFace weight saves. |
save_hf_weights |
|
|
Save HuggingFace-format weights to the last checkpoint directory. |
InferArguments#
Standalone inference configuration.
Field |
Type |
Default |
Description |
|---|---|---|---|
model_path |
|
Required |
Path to the pre-trained model. |
tokenizer_path |
|
|
Path to the tokenizer. Defaults to |
seed |
|
|
Random seed. |
do_sample |
|
|
Enable sampling in decoding. |
temperature |
|
|
Sampling temperature. |
top_p |
|
|
Nucleus sampling top-p value. |
max_tokens |
|
|
Maximum tokens to generate. |
VLM Extensions#
Additional fields for Vision-Language Model training, defined in veomni.trainer.vlm_trainer.
VLMTrainingArguments#
Extends TrainingArguments with ViT / audio tower controls.
Field |
Type |
Default |
Description |
|---|---|---|---|
freeze_vit |
|
|
Freeze ViT parameters. |
freeze_audio_tower |
|
|
Freeze audio tower parameters. |
vit_lr |
|
|
Maximum learning rate for ViT parameters. |
VLMMModelArguments#
Extends ModelArguments with encoder data-balancing options.
Field |
Type |
Default |
Description |
|---|---|---|---|
encoder_data_balance |
|
|
Enable encoder data balancing (e.g. for Qwen3-VL). |
encoder_data_balance_sorting_algo |
|
|
Sorting algorithm for encoder data balancing. |
VLMMDataArguments#
Extends DataArguments with multimodal input configs.
Field |
Type |
Default |
Description |
|---|---|---|---|
mm_configs |
|
|
Multimodal input configuration. |
DPO Reference#
DPOConfig#
dpo_config.* — Direct Preference Optimization hyperparameters.
Field |
Type |
Default |
Description |
|---|---|---|---|
beta |
|
|
KL penalty coefficient. Controls deviation from the reference model. |
label_smoothing |
|
|
Label smoothing for DPO loss. Non-zero values assume noisy preference labels. |
reference_free |
|
|
If |
loss_type |
|
|
DPO loss variant: |
average_log_prob |
|
|
If |
refer_model_precision |
|
|
dtype used to load the frozen reference model. |