Name: Gemini 3.1 Pro
Price: 20 USD
Availability: InStock
Rating: 67.5 (1 reviews)
Author: Google

Gemini 3.1 Pro by Google | AI Market Cap

HF PapersGoogleresearch4w ago

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

View Source

#huggingface#daily-papers

HF PapersGoogleresearch1mo ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Long-context large language models (LLMs)-for example, Gemini-3.1-Pro and Qwen-3.5-are widely used to empower many real-world applications, such as retrieval-augmented generation, autonomous agents, and AI assistants. However, security remains a major concern for their widespread deployment, with threats such as prompt injection and knowledge corruption. To quantify the security risks faced by LLMs under these threats, the research community has developed heuristic-based and optimization-based red-teaming methods. Optimization-based methods generally produce stronger attacks than heuristic attacks and thus provide a more rigorous assessment of LLM security risks. However, they are often resource-intensive, requiring significant computation and GPU memory, especially for long context scenarios. The resource-intensive nature poses a major obstacle for the community (especially academic researchers) to systematically evaluate the security risks of long-context LLMs and assess the effectiveness of defense strategies at scale. In this work, we propose FlashRT, the first framework to improve the efficiency (in terms of both computation and memory) for optimization-based prompt injection and knowledge corruption attacks under long-context LLMs. Through extensive evaluations, we find that FlashRT consistently delivers a 2x-7x speedup (e.g., reducing runtime from one hour to less than ten minutes) and a 2x-4x reduction in GPU memory consumption (e.g., reducing from 264.1 GB to 65.7 GB GPU memory for a 32K token context) compared to state-of-the-art baseline nanoGCG. FlashRT can be broadly applied to black-box optimization methods, such as TAP and AutoDAN. We hope FlashRT can serve as a red-teaming tool to enable systematic evaluation of long-context LLM security. The code is available at: https://github.com/Wang-Yanting/FlashRT

View Source

#huggingface#daily-papers

HF PapersGoogleresearch2mo ago

Building a Precise Video Language with Human-AI Oversight

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

View Source

#huggingface#daily-papers

Gemini 3.1 Pro

Similar Models

Google DeepMind 🤝 @A24 We’re launching a research partnership with A24 to ensure the tools of the future are shaped by the creators who use them. Find out more → https://t.co/KN3HdGVjGS https://t.co/

Social & Blog Posts6

Research Papers22

Other

gemini-3.1-pro (openrouter) - GAIA

Our Robotics Accelerator has launched with 15 startups helping shape the future of physical AI in Europe. 🤖 This three-month program will connect them with access to our AI stack, Gemini Robotics mod

When millions of AI agents interact with each other, new collective behaviors can emerge. 🌐 Together with @schmidtsciences, @coop_ai, @ARIA_research and supported by @GoogleOrg, we’re launching a $10

Channeling the Fire Horse energy this week 🔥🐎 here’s what we launched: — Gemini 3.1 Pro, a step forward in core intelligence that helps you ta...

gemini-3.1-pro (openrouter) - GAIA

Google DeepMind 🤝 @A24 We’re launching a research partnership with A24 to ensure the tools of the future are shaped by the creators who use them. Find out more → https://t.co/KN3HdGVjGS https://t.co/

Our Robotics Accelerator has launched with 15 startups helping shape the future of physical AI in Europe. 🤖 This three-month program will connect them with access to our AI stack, Gemini Robotics mod

When millions of AI agents interact with each other, new collective behaviors can emerge. 🌐 Together with @schmidtsciences, @coop_ai, @ARIA_research and supported by @GoogleOrg, we’re launching a $10

In Sierra Leone, a surging student population is outpacing available teachers. Our latest research explores how AI can act as a partner to support educators in these environments – amplifying their re

Deep Research and Deep Research Max are our latest autonomous research agents powered by Gemini 3.1 Pro. They can safely navigate both the web and your custom data, like internal docs and specialized

Channeling the Fire Horse energy this week 🔥🐎 here’s what we launched: — Gemini 3.1 Pro, a step forward in core intelligence that helps you ta...

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Physics-IQ Verified

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Building a Precise Video Language with Human-AI Oversight

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

Gemini 3.1 Pro: A smarter model for your most complex tasks