Wan2.1-T2V Training Guide#

This guide covers LoRA fine-tuning of Wan2.1-T2V-1.3B using VeOmni, including dataset preparation, multi-GPU training with Ulysses Sequence Parallelism (SP), and inference with trained adapters.


1. Environment Setup#

uv sync --extra gpu --dev
source .venv/bin/activate

For inference, install the video I/O backend:

pip install imageio imageio-ffmpeg

2. Download Model#

python3 scripts/download_hf_model.py \
    --repo_id Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --local_dir ./Wan2.1-T2V-1.3B-Diffusers

3. Prepare Dataset#

VeOmni supports two training workflows:

Workflow

training_task

Description

Offline (recommended)

offline_training

Pre-embed videos once; re-use embeddings across epochs. Saves GPU memory during training.

Online

online_training

Embed videos on-the-fly each step. Requires the VAE + text encoder to stay on GPU throughout training.

3.1 Download the Tom and Jerry dataset#

This guide uses Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset (~6 000 video clips, 540×360, 6 s each).

python3 scripts/download_hf_model.py \
    --repo_id Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset \
    --repo_type dataset \
    --local_dir ./Tom-and-Jerry-VideoGeneration-Dataset

The downloaded directory has the following structure:

Tom-and-Jerry-VideoGeneration-Dataset/
├── captions.txt   # one caption per line
├── videos.txt     # one relative video path per line (mirrors captions.txt)
└── videos/        # video files

3.2 Convert to VeOmni Parquet format#

The conversion script reads captions.txt and videos.txt, loads each video as raw bytes, and writes sharded Parquet files (0.parquet, 1.parquet, …) with columns prompt, video_bytes, and source.

python3 scripts/multimodal/convert_data/tom-and-jerry.py \
    --dataset_path ./Tom-and-Jerry-VideoGeneration-Dataset \
    --output_dir   ./Tom-and-Jerry-VideoGeneration-Dataset-parquet

3.4 Online Workflow#

Pass raw Parquet videos directly during training. The VAE and text encoder run each step.

NPROC_PER_NODE=4 bash train.sh tasks/train_dit.py configs/dit/wan2.1_I2V_1.3B_lora.yaml \
    --model.model_path           ./Wan2.1-T2V-1.3B-Diffusers/transformer \
    --model.condition_model_path ./Wan2.1-T2V-1.3B-Diffusers \
    --data.train_path            ./Tom-and-Jerry-VideoGeneration-Dataset-parquet \
    --data.source_name           Tom-and-Jerry-VideoGeneration-Dataset \
    --data.mm_configs.fps        24 \
    --data.mm_configs.max_frames 81 \
    --train.training_task        online_training \
    --train.global_batch_size    4 \
    --train.micro_batch_size     1 \
    --train.accelerator.ulysses_size 1 \
    --train.checkpoint.output_dir ./exp/Wan2.1-T2V-1.3B-Diffusers_lora \
    --train.checkpoint.save_hf_weights true \
    --train.num_train_epochs 30

4. Training Configuration#

The default LoRA config (configs/dit_new/wan_lora.yaml) targets the attention and feed-forward projections:

model:
  lora_config:
    rank: 128
    alpha: 64
    lora_modules:
      - to_q
      - to_k
      - to_v
      - to_out.0
      - ffn.net.0.proj
      - ffn.net.2

Sequence Parallelism (SP)#

VeOmni supports Ulysses SP for long video sequences. SP splits the sequence dimension across GPUs within each data-parallel replica, reducing per-GPU memory while keeping training numerically equivalent to SP=1.

ulysses_size

GPUs (with 4 DP replicas)

1

4

2

8

Set --train.accelerator.ulysses_size to enable SP. The loss and gradient norms are aligned between SP=1 and SP=2 at equal DP sizes.


5. Checkpoint Output#

When --train.checkpoint.save_hf_weights true is set, each save produces a directory compatible with load_lora_adapter:

exp/Wan2.1-T2V-1.3B-Diffusers_lora/checkpoints/
└── global_step_200/
    ├── adapter_config.json
    └── adapter_model.safetensors

6. Inference#

6.1 Base model (no LoRA)#

import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "./Wan2.1-T2V-1.3B-Diffusers"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = (
    "Tom, the mischievous gray cat, is sprawled out on a vibrant red pillow, "
    "his body relaxed and his eyes half-closed, as if he's just woken up or is "
    "about to doze off. His white paws are stretched out in front of him, and his "
    "tail is casually draped over the edge of the pillow."
)
negative_prompt = (
    "Bright tones, overexposed, static, blurred details, subtitles, style, works, "
    "paintings, images, static, overall gray, worst quality, low quality, JPEG "
    "compression residue, ugly, incomplete, extra fingers, poorly drawn hands, "
    "poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, "
    "still picture, messy background, three legs, many people in the background, "
    "walking backwards"
)

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    num_frames=81,
    guidance_scale=5.0,
).frames[0]

export_to_video(output, "output.mp4", fps=15)

6.2 With trained LoRA adapter#

import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "./Wan2.1-T2V-1.3B-Diffusers"
lora_dir = "./exp/Wan2.1-T2V-1.3B-Diffusers_lora/checkpoints/global_step_200"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")

pipe.transformer.load_lora_adapter(lora_dir, prefix="base_model.model", adapter_name="wan_lora")
pipe.set_adapters("wan_lora", adapter_weights=1.0)  # adjust strength between 0.5–1.0

prompt = "..."
negative_prompt = "..."

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    num_frames=81,
    guidance_scale=5.0,
).frames[0]

export_to_video(output, "output_lora.mp4", fps=15)