VeOmni Flash Attention Custom Name Adapter (Transformers 5.x)

VeOmni Flash Attention Custom Name Adapter (Transformers 5.x)#

Problem Background#

VeOmni uses custom attention implementation names:

  • veomni_flash_attention_2_with_sp

  • veomni_flash_attention_3_with_sp

  • veomni_flash_attention_4_with_sp

These names are registered into ALL_ATTENTION_FUNCTIONS and routed to VeOmni’s SP-aware attention wrapper.

With Transformers 5.x, model init and flash-attention preload logic may still call transformers.modeling_flash_attention_utils._lazy_imports(...) for the configured implementation string. For non-native names, _lazy_imports falls back to hub-kernel loading and can fail with:

ValueError: Could not find the currently requested flash attention implementation at veomni_flash_attention_2_with_sp

even though VeOmni already registered the custom attention function.

Why This Happens#

The failure path is:

  1. Model config keeps VeOmni custom name in _attn_implementation.

  2. Transformers flash preload code tries to resolve low-level flash kernels from the implementation string.

  3. Custom VeOmni names are not hub kernel identifiers.

  4. Hub fallback returns no valid kernel entry for this name.

  5. _lazy_imports raises before normal ALL_ATTENTION_FUNCTIONS dispatch takes effect.

Adapter Strategy Implemented#

Instead of patching _lazy_imports directly, VeOmni patches:

transformers.integrations.hub_kernels.load_and_register_attn_kernel

and intercepts VeOmni custom names only. This compatibility adapter is applied only when transformers>=5.0.0.

For VeOmni names, the adapter returns a local kernel-like object exposing:

  • flash_attn_func

  • flash_attn_varlen_func

mapped to local FA2/FA3/FA4 backends:

  • veomni_flash_attention_2_with_sp -> flash_attn.flash_attn_func / flash_attn.flash_attn_varlen_func

  • veomni_flash_attention_3_with_sp -> flash_attn_interface.flash_attn_func / flash_attn_interface.flash_attn_varlen_func

  • veomni_flash_attention_4_with_sp -> flash_attn.cute.flash_attn_func / flash_attn.cute.flash_attn_varlen_func

For simplicity, paged VeOmni aliases (for example paged|veomni_flash_attention_2_with_sp) are not handled by this adapter.

All non-VeOmni implementations are delegated to the original Transformers loader unchanged.

Design Goals#

  • Keep VeOmni custom implementation names unchanged.

  • Keep existing VeOmni ALL_ATTENTION_FUNCTIONS.register(...) behavior unchanged.

  • Avoid hub-kernel lookup for VeOmni private names.

  • Minimize patch surface by touching a single integration point.

  • Fail fast with clear ImportError when required FA backend is missing.

Expected Runtime Behavior#

After import veomni:

  • VeOmni custom names remain registered in ALL_ATTENTION_FUNCTIONS.

  • _lazy_imports("veomni_flash_attention_2_with_sp") and _lazy_imports("veomni_flash_attention_4_with_sp") can resolve through the adapter.

  • No spurious “kernel hub name not found” error for VeOmni custom names.

  • Paged VeOmni aliases are outside the adapter scope.

Notes#

  • This adapter is a compatibility bridge for Transformers 5.x behavior around flash preload.

  • It does not change VeOmni SP attention semantics.

  • It does not require the kernels Python package for VeOmni custom names.

  • FA2 and FA3 have dedicated branches in _lazy_imports (both v4 and v5) and are resolved directly without reaching the hub-kernel path. The adapter is therefore a no-op for those two in practice, but is kept for safety.

  • FA4 (veomni_flash_attention_4_with_sp) has no such branch in _lazy_imports and always falls through to the hub-kernel path in Transformers v5. The adapter is the critical component that makes FA4 usable on v5.

  • On Transformers v4, FA4 is supported via the VeOmni SP variant (veomni_flash_attention_4_with_sp). Instead of the string name, VeOmni passes a SimpleNamespace object (from _load_veomni_local_flash_kernel) directly to _lazy_imports, which v4 accepts in its kernels-fallback branch via getattr(). The bare flash_attention_4 name still requires Transformers v5; for Transformers v4, use attn_implementation="veomni_flash_attention_4_with_sp".

  • FA4 requires the flash-attn-cute package (flash_attn.cute). To install FA4:

    • Transformers v5: uv sync --extra gpu --extra fa4 --extra transformers5-exp --no-group transformers-stable

    • Transformers v4: uv sync --extra gpu --extra fa4