Name: Gemini 3 Flash
Price: 20 USD
Availability: InStock
Author: Google

Gemini 3 Flash by Google | AI Market Cap

HF PapersGoogleresearch2w ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

View Source

#huggingface#daily-papers

Gemini 3 Flash

Similar Models

As generative AI tools continue to evolve, we believe it's more important than ever to know what's AI-generated and what isn't. That’s why @GoogleDeepMind launched SynthID in 2023—a technology that ad

Social & Blog Posts2

Research Papers16

Other

Gemini 3 Flash is now available on Ollama

Google DeepMind 🤝 @A24 We’re launching a research partnership with A24 to ensure the tools of the future are shaped by the creators who use them. Find out more → https://t.co/KN3HdGVjGS https://t.co/

gemini-3-flash-preview - SWE-Bench Verified

Gemini 3 Flash - GAIA

Gemini 3 Flash is now available on Ollama

As generative AI tools continue to evolve, we believe it's more important than ever to know what's AI-generated and what isn't. That’s why @GoogleDeepMind launched SynthID in 2023—a technology that ad

Google DeepMind 🤝 @A24 We’re launching a research partnership with A24 to ensure the tools of the future are shaped by the creators who use them. Find out more → https://t.co/KN3HdGVjGS https://t.co/

Representation Distribution Matching for One-Step Visual Generation

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Towards Automating Scientific Review with Google's Paper Assistant Tool

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

gemini-3-flash-preview - SWE-Bench Verified

Gemini 3 Flash - GAIA