Wan2.1-T2V Training Guide#
This guide covers LoRA fine-tuning of Wan2.1-T2V-1.3B using VeOmni, including dataset preparation, multi-GPU training with Ulysses Sequence Parallelism (SP), and inference with trained adapters.
1. Environment Setup#
uv sync --extra gpu --dev
source .venv/bin/activate
For inference, install the video I/O backend:
pip install imageio imageio-ffmpeg
2. Download Model#
python3 scripts/download_hf_model.py \
--repo_id Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--local_dir ./Wan2.1-T2V-1.3B-Diffusers
3. Prepare Dataset#
VeOmni supports two training workflows:
Workflow |
|
Description |
|---|---|---|
Offline (recommended) |
|
Pre-embed videos once; re-use embeddings across epochs. Saves GPU memory during training. |
Online |
|
Embed videos on-the-fly each step. Requires the VAE + text encoder to stay on GPU throughout training. |
3.1 Download the Tom and Jerry dataset#
This guide uses Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset (~6 000 video clips, 540×360, 6 s each).
python3 scripts/download_hf_model.py \
--repo_id Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset \
--repo_type dataset \
--local_dir ./Tom-and-Jerry-VideoGeneration-Dataset
The downloaded directory has the following structure:
Tom-and-Jerry-VideoGeneration-Dataset/
├── captions.txt # one caption per line
├── videos.txt # one relative video path per line (mirrors captions.txt)
└── videos/ # video files
3.2 Convert to VeOmni Parquet format#
The conversion script reads captions.txt and videos.txt, loads each video as raw bytes, and writes sharded Parquet files (0.parquet, 1.parquet, …) with columns prompt, video_bytes, and source.
python3 scripts/multimodal/convert_data/tom-and-jerry.py \
--dataset_path ./Tom-and-Jerry-VideoGeneration-Dataset \
--output_dir ./Tom-and-Jerry-VideoGeneration-Dataset-parquet
3.3 Offline Workflow (recommended)#
Step 1 – Run offline embedding (once)#
This step encodes every video with the VAE and every caption with the T5 text encoder, saving the embeddings as Parquet shards. It only needs to run once per dataset.
NPROC_PER_NODE=4 bash train.sh tasks/train_dit.py configs/dit/wan2.1_I2V_1.3B_lora.yaml \
--model.model_path ./Wan2.1-T2V-1.3B-Diffusers/transformer \
--model.condition_model_path ./Wan2.1-T2V-1.3B-Diffusers \
--data.train_path ./Tom-and-Jerry-VideoGeneration-Dataset-parquet \
--data.source_name Tom-and-Jerry-VideoGeneration-Dataset \
--data.offline_embedding_save_dir ./Tom-and-Jerry-VideoGeneration-Dataset_offline \
--train.training_task offline_embedding \
--train.global_batch_size 4 \
--train.accelerator.ulysses_size 1
The resulting Tom-and-Jerry-VideoGeneration-Dataset_offline/ directory contains rank_N_shard_M.parquet files. Each row stores two pickled tensors:
Column |
Shape |
Description |
|---|---|---|
|
|
VAE posterior parameters (mean + log-variance concatenated along the channel axis; |
|
|
T5 text embedding |
Step 2 – Train on the offline dataset#
SP_SIZE=2
NPROC_PER_NODE=8 # 4 DP replicas × SP_SIZE=2
bash train.sh tasks/train_dit.py configs/dit/wan2.1_I2V_1.3B_lora.yaml \
--model.model_path ./Wan2.1-T2V-1.3B-Diffusers/transformer \
--model.condition_model_path ./Wan2.1-T2V-1.3B-Diffusers \
--data.train_path ./Tom-and-Jerry-VideoGeneration-Dataset_offline \
--data.source_name Tom-and-Jerry-VideoGeneration-Dataset \
--train.training_task offline_training \
--train.global_batch_size 8 \
--train.micro_batch_size 1 \
--train.accelerator.ulysses_size ${SP_SIZE} \
--train.checkpoint.output_dir ./exp/Wan2.1-T2V-1.3B-Diffusers_lora \
--train.checkpoint.save_hf_weights true \
--train.checkpoint.save_epochs 5 \
--train.checkpoint.load_path auto \
--train.num_train_epochs 30 \
--train.wandb.enable false
3.4 Online Workflow#
Pass raw Parquet videos directly during training. The VAE and text encoder run each step.
NPROC_PER_NODE=4 bash train.sh tasks/train_dit.py configs/dit/wan2.1_I2V_1.3B_lora.yaml \
--model.model_path ./Wan2.1-T2V-1.3B-Diffusers/transformer \
--model.condition_model_path ./Wan2.1-T2V-1.3B-Diffusers \
--data.train_path ./Tom-and-Jerry-VideoGeneration-Dataset-parquet \
--data.source_name Tom-and-Jerry-VideoGeneration-Dataset \
--data.mm_configs.fps 24 \
--data.mm_configs.max_frames 81 \
--train.training_task online_training \
--train.global_batch_size 4 \
--train.micro_batch_size 1 \
--train.accelerator.ulysses_size 1 \
--train.checkpoint.output_dir ./exp/Wan2.1-T2V-1.3B-Diffusers_lora \
--train.checkpoint.save_hf_weights true \
--train.num_train_epochs 30
4. Training Configuration#
The default LoRA config (configs/dit_new/wan_lora.yaml) targets the attention and feed-forward projections:
model:
lora_config:
rank: 128
alpha: 64
lora_modules:
- to_q
- to_k
- to_v
- to_out.0
- ffn.net.0.proj
- ffn.net.2
Sequence Parallelism (SP)#
VeOmni supports Ulysses SP for long video sequences. SP splits the sequence dimension across GPUs within each data-parallel replica, reducing per-GPU memory while keeping training numerically equivalent to SP=1.
|
GPUs (with 4 DP replicas) |
|---|---|
1 |
4 |
2 |
8 |
Set --train.accelerator.ulysses_size to enable SP. The loss and gradient norms are aligned between SP=1 and SP=2 at equal DP sizes.
5. Checkpoint Output#
When --train.checkpoint.save_hf_weights true is set, each save produces a directory compatible with load_lora_adapter:
exp/Wan2.1-T2V-1.3B-Diffusers_lora/checkpoints/
└── global_step_200/
├── adapter_config.json
└── adapter_model.safetensors
6. Inference#
6.1 Base model (no LoRA)#
import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video
model_id = "./Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = (
"Tom, the mischievous gray cat, is sprawled out on a vibrant red pillow, "
"his body relaxed and his eyes half-closed, as if he's just woken up or is "
"about to doze off. His white paws are stretched out in front of him, and his "
"tail is casually draped over the edge of the pillow."
)
negative_prompt = (
"Bright tones, overexposed, static, blurred details, subtitles, style, works, "
"paintings, images, static, overall gray, worst quality, low quality, JPEG "
"compression residue, ugly, incomplete, extra fingers, poorly drawn hands, "
"poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, "
"still picture, messy background, three legs, many people in the background, "
"walking backwards"
)
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
num_frames=81,
guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=15)
6.2 With trained LoRA adapter#
import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video
model_id = "./Wan2.1-T2V-1.3B-Diffusers"
lora_dir = "./exp/Wan2.1-T2V-1.3B-Diffusers_lora/checkpoints/global_step_200"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe.transformer.load_lora_adapter(lora_dir, prefix="base_model.model", adapter_name="wan_lora")
pipe.set_adapters("wan_lora", adapter_weights=1.0) # adjust strength between 0.5–1.0
prompt = "..."
negative_prompt = "..."
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
num_frames=81,
guidance_scale=5.0,
).frames[0]
export_to_video(output, "output_lora.mp4", fps=15)