Name: Gemini 3 Pro
Price: 20 USD
Availability: InStock
Rating: 60.9 (1 reviews)
Author: Google

Gemini 3 Pro by Google | AI Market Cap

HF PapersGoogleresearch3mo ago

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

View Source

#huggingface#daily-papers

HF PapersGoogleresearch3mo ago

Multimodal OCR: Parse Anything from Documents

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

View Source

#huggingface#daily-papers

Gemini 3 Pro

Similar Models

As generative AI tools continue to evolve, we believe it's more important than ever to know what's AI-generated and what isn't. That’s why @GoogleDeepMind launched SynthID in 2023—a technology that ad

Social & Blog Posts2

Research Papers25

Other

gemini-3-pro-preview - SWE-Bench Verified

Google DeepMind 🤝 @A24 We’re launching a research partnership with A24 to ensure the tools of the future are shaped by the creators who use them. Find out more → https://t.co/KN3HdGVjGS https://t.co/

gemini-3-pro-preview - SWE-Bench Verified

gemini-3-pro-preview - SWE-Bench Verified

Gemini 3 Pro - GAIA

As generative AI tools continue to evolve, we believe it's more important than ever to know what's AI-generated and what isn't. That’s why @GoogleDeepMind launched SynthID in 2023—a technology that ad

Google DeepMind 🤝 @A24 We’re launching a research partnership with A24 to ensure the tools of the future are shaped by the creators who use them. Find out more → https://t.co/KN3HdGVjGS https://t.co/

Representation Distribution Matching for One-Step Visual Generation

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

Towards Automating Scientific Review with Google's Paper Assistant Tool

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Structured Distillation of Web Agent Capabilities Enables Generalization

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

GPA: Learning GUI Process Automation from Demonstrations

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Multimodal OCR: Parse Anything from Documents

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Intent Laundering: AI Safety Datasets Are Not What They Seem

gemini-3-pro-preview - SWE-Bench Verified

Gemini 3 Pro - GAIA