Multimodal Data Processing#
This guide explains how VeOmni processes image and video inputs for vision-language model training, including resolution control, frame sampling, and the dynamic video pixel budget introduced for Qwen3-VL.
Overview#
The multimodal data pipeline lives in veomni/data/multimodal/ and handles:
Preprocessing (
preprocess.py) — converts raw data samples into a unified conversation format via the Preprocessor Registry.Image processing (
image_utils.py) — loads images, resizes them to fit pixel budgets while preserving aspect ratio and ViT patch alignment.Video processing (
video_utils.py) — loads videos (via torchcodec), samples frames by FPS, resizes spatially, and optionally extracts audio.Transform (
multimodal_transform.py) — orchestrates the above into tokenized model inputs with proper masking.
All processing parameters are configured through the mm_configs section in your YAML config.
Image Resolution Control#
Images are resized by smart_resize to satisfy three constraints simultaneously:
Parameter |
Description |
|---|---|
|
Minimum total pixels (H x W) after resize |
|
Maximum total pixels (H x W) after resize |
|
Align H and W to multiples of this (e.g., 28 for Qwen-VL ViT with patch_size=14, merge_size=2) |
|
Maximum allowed aspect ratio (max_dim / min_dim) |
The resize preserves aspect ratio: it scales H and W by the same factor, then rounds to scale_factor multiples.
Example config:
data:
mm_configs:
image_max_pixels: 602112 # 28 * 28 * 768
# image_min_pixels: 3136 # optional lower bound
Video Frame Sampling#
Video processing has two stages: temporal sampling (selecting which frames to keep) and spatial resize (adjusting resolution per frame).
Temporal Sampling Parameters#
Parameter |
Description |
Default |
|---|---|---|
|
Target sampling FPS |
2.0 |
|
Minimum output frames |
None |
|
Maximum output frames |
None |
|
Align frame count to multiples of this |
None |
|
Remainder for frame alignment (e.g., 1 for counts like 1, 5, 9, 13…) |
0 |
Frame sampling flow:
Compute target frame count:
nframes = total_frames / video_fps * fpsClamp to
[min_frames, max_frames]Align to
frame_factor(round down)Uniformly sample
nframesindices from the videoPad with last frame if
nframes > total_frames
Spatial Resize Parameters#
Parameter |
Description |
|---|---|
|
Minimum per-frame pixels (H x W) |
|
Maximum per-frame pixels (H x W) |
|
Align H and W to multiples of this |
Example config:
data:
mm_configs:
video_max_pixels: 602112 # 28 * 28 * 768
max_frames: 16
fps: 2.0
Dynamic Video Pixel Budget (video_total_pixels)#
Introduced for Qwen3-VL. This is a no-op for models that don’t set it.
Problem#
With a fixed video_max_pixels, every frame gets the same maximum resolution regardless of how many frames there are. A 4-frame video and a 64-frame video would each have frames at the same resolution, causing long videos to produce far more visual tokens and potentially exceed the model’s context window.
Solution#
video_total_pixels sets a total pixel budget across all frames. Before spatial resizing, the per-frame video_max_pixels is dynamically adjusted:
dynamic_max = video_total_pixels / nframes * temporal_merge_factor
dynamic_max = min(dynamic_max, video_max_pixels) # don't exceed original cap
dynamic_max = max(dynamic_max, video_min_pixels * 1.05) # don't go below minimum
This mirrors the official qwen-vl-utils logic: more frames -> lower per-frame resolution, keeping total visual tokens predictable.
How to Set video_total_pixels#
The formula is:
video_total_pixels = max_seq_len * (patch_size * merge_size)^2 * budget_ratio
Where:
max_seq_len: your training sequence length (e.g., 4096)patch_size * merge_size: ViT spatial granularity (14 * 2 = 28 for Qwen-VL family)budget_ratio: fraction of context reserved for visual tokens (0.9 = 90%)
Example (max_seq_len=4096):
4096 * 28^2 * 0.9 = 4096 * 784 * 0.9 ≈ 2,889,523
For inference-scale contexts (128K tokens):
128000 * 784 * 0.9 ≈ 90,316,800
Config Example#
data:
max_seq_len: 4096
mm_configs:
image_max_pixels: 602112
video_max_pixels: 602112
video_total_pixels: 2889523 # dynamic per-frame budget
max_frames: 16
fps: 2.0
use_audio_in_video: false
Behavior Summary#
Scenario |
|
|
|---|---|---|
4 frames |
Each frame up to 602112 px |
Each frame up to min(602112, 2889523/4*2) ≈ 602112 px |
16 frames |
Each frame up to 602112 px |
Each frame up to min(602112, 2889523/16*2) ≈ 361190 px |
64 frames |
Each frame up to 602112 px |
Each frame up to min(602112, 2889523/64*2) ≈ 90297 px |
With video_total_pixels, longer videos automatically get lower per-frame resolution to stay within the token budget.
Audio Processing#
When use_audio_in_video: true, audio is extracted from video files and resampled to the target sample rate (default 16kHz). Audio processing parameters:
Parameter |
Description |
Default |
|---|---|---|
|
Whether to extract audio from videos |
false |
|
Target audio sample rate |
16000 |
Full mm_configs Reference#
data:
mm_configs:
# Image
image_min_pixels: 3136 # optional, min pixels per image
image_max_pixels: 602112 # max pixels per image
scale_factor: 28 # ViT patch alignment (patch_size * merge_size)
max_ratio: 200 # max aspect ratio
# Video - temporal
fps: 2.0 # target sampling FPS
min_frames: 4 # optional, minimum frames
max_frames: 16 # maximum frames
frame_factor: 2 # align frame count to multiples of this
frame_factor_remainder: 0 # remainder for frame alignment
# Video - spatial
video_min_pixels: 3136 # optional, min pixels per frame
video_max_pixels: 602112 # max pixels per frame
video_total_pixels: 2889523 # optional, total pixel budget (Qwen3-VL)
# Audio
use_audio_in_video: false
sample_rate: 16000