Name: Gemma-4-31B-IT-NVFP4
Author: NVIDIA

BenchmarksNVIDIAToday

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

NVIDIA published benchmark or leaderboard evidence for Gemma 4 31B IT NVFP4, gemma-4-31b-it-nvfp4.

View source

BenchmarksNVIDIAToday

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

NVIDIA published benchmark or leaderboard evidence for Gemma-4-31B-IT-NVFP4.

View source

ResearchNVIDIA2d ago

EarlyTom: Early Token Compression Completes Fast Video Understanding

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

View source

ResearchNVIDIA3d ago

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64times single-GPU speedup and over 1.52times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69times and 2.27times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

View source

Gemma-4-31B-IT-NVFP4

Similar Models

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

Research Papers10

Other

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

EarlyTom: Early Token Compression Completes Fast Video Understanding

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

EarlyTom: Early Token Compression Completes Fast Video Understanding

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face

nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face