Multimodal Data Processing#

This guide explains how VeOmni processes image and video inputs for vision-language model training, including resolution control, frame sampling, and the dynamic video pixel budget introduced for Qwen3-VL.

Overview#

The multimodal data pipeline lives in veomni/data/multimodal/ and handles:

Preprocessing (preprocess.py) — converts raw data samples into a unified conversation format via the Preprocessor Registry.
Image processing (image_utils.py) — loads images, resizes them to fit pixel budgets while preserving aspect ratio and ViT patch alignment.
Video processing (video_utils.py) — loads videos (via torchcodec), samples frames by FPS, resizes spatially, and optionally extracts audio.
Transform (multimodal_transform.py) — orchestrates the above into tokenized model inputs with proper masking.

All processing parameters are configured through the mm_configs section in your YAML config.

Image Resolution Control#

Images are resized by smart_resize to satisfy three constraints simultaneously:

Parameter	Description
`image_min_pixels`	Minimum total pixels (H x W) after resize
`image_max_pixels`	Maximum total pixels (H x W) after resize
`scale_factor`	Align H and W to multiples of this (e.g., 28 for Qwen-VL ViT with patch_size=14, merge_size=2)
`max_ratio`	Maximum allowed aspect ratio (max_dim / min_dim)

The resize preserves aspect ratio: it scales H and W by the same factor, then rounds to scale_factor multiples.

Example config:

data:
  mm_configs:
    image_max_pixels: 602112  # 28 * 28 * 768
    # image_min_pixels: 3136  # optional lower bound

Video Frame Sampling#

Video processing has two stages: temporal sampling (selecting which frames to keep) and spatial resize (adjusting resolution per frame).

Temporal Sampling Parameters#

Parameter	Description	Default
`fps`	Target sampling FPS	2.0
`min_frames`	Minimum output frames	None
`max_frames`	Maximum output frames	None
`frame_factor`	Align frame count to multiples of this	None
`frame_factor_remainder`	Remainder for frame alignment (e.g., 1 for counts like 1, 5, 9, 13…)	0

Frame sampling flow:

Compute target frame count: nframes = total_frames / video_fps * fps
Clamp to [min_frames, max_frames]
Align to frame_factor (round down)
Uniformly sample nframes indices from the video
Pad with last frame if nframes > total_frames

Spatial Resize Parameters#

Parameter	Description
`video_min_pixels`	Minimum per-frame pixels (H x W)
`video_max_pixels`	Maximum per-frame pixels (H x W)
`scale_factor`	Align H and W to multiples of this

Example config:

data:
  mm_configs:
    video_max_pixels: 602112  # 28 * 28 * 768
    max_frames: 16
    fps: 2.0

Dynamic Video Pixel Budget (`video_total_pixels`)#

Introduced for Qwen3-VL. This is a no-op for models that don’t set it.

Problem#

With a fixed video_max_pixels, every frame gets the same maximum resolution regardless of how many frames there are. A 4-frame video and a 64-frame video would each have frames at the same resolution, causing long videos to produce far more visual tokens and potentially exceed the model’s context window.

Solution#

video_total_pixels sets a total pixel budget across all frames. Before spatial resizing, the per-frame video_max_pixels is dynamically adjusted:

dynamic_max = video_total_pixels / nframes * temporal_merge_factor
dynamic_max = min(dynamic_max, video_max_pixels)       # don't exceed original cap
dynamic_max = max(dynamic_max, video_min_pixels * 1.05) # don't go below minimum

This mirrors the official qwen-vl-utils logic: more frames -> lower per-frame resolution, keeping total visual tokens predictable.

How to Set `video_total_pixels`#

The formula is:

video_total_pixels = max_seq_len * (patch_size * merge_size)^2 * budget_ratio

Where:

max_seq_len: your training sequence length (e.g., 4096)
patch_size * merge_size: ViT spatial granularity (14 * 2 = 28 for Qwen-VL family)
budget_ratio: fraction of context reserved for visual tokens (0.9 = 90%)

Example (max_seq_len=4096):

4096 * 28^2 * 0.9 = 4096 * 784 * 0.9 ≈ 2,889,523

For inference-scale contexts (128K tokens):

128000 * 784 * 0.9 ≈ 90,316,800

Config Example#

data:
  max_seq_len: 4096
  mm_configs:
    image_max_pixels: 602112
    video_max_pixels: 602112
    video_total_pixels: 2889523  # dynamic per-frame budget
    max_frames: 16
    fps: 2.0
    use_audio_in_video: false

Behavior Summary#

Scenario	`video_total_pixels` absent	`video_total_pixels` present
4 frames	Each frame up to 602112 px	Each frame up to min(602112, 2889523/4*2) ≈ 602112 px
16 frames	Each frame up to 602112 px	Each frame up to min(602112, 2889523/16*2) ≈ 361190 px
64 frames	Each frame up to 602112 px	Each frame up to min(602112, 2889523/64*2) ≈ 90297 px

With video_total_pixels, longer videos automatically get lower per-frame resolution to stay within the token budget.

Audio Processing#

When use_audio_in_video: true, audio is extracted from video files and resampled to the target sample rate (default 16kHz). Audio processing parameters:

Parameter	Description	Default
`use_audio_in_video`	Whether to extract audio from videos	false
`sample_rate`	Target audio sample rate	16000

Full `mm_configs` Reference#

data:
  mm_configs:
    # Image
    image_min_pixels: 3136        # optional, min pixels per image
    image_max_pixels: 602112      # max pixels per image
    scale_factor: 28              # ViT patch alignment (patch_size * merge_size)
    max_ratio: 200                # max aspect ratio

    # Video - temporal
    fps: 2.0                      # target sampling FPS
    min_frames: 4                 # optional, minimum frames
    max_frames: 16                # maximum frames
    frame_factor: 2               # align frame count to multiples of this
    frame_factor_remainder: 0     # remainder for frame alignment

    # Video - spatial
    video_min_pixels: 3136        # optional, min pixels per frame
    video_max_pixels: 602112      # max pixels per frame
    video_total_pixels: 2889523   # optional, total pixel budget (Qwen3-VL)

    # Audio
    use_audio_in_video: false
    sample_rate: 16000

Multimodal Data Processing

Contents