Name: GPT-5.4
Price: 20 USD
Availability: InStock
Rating: 59.3 (1 reviews)
Author: OpenAI

GPT-5.4 by OpenAI | AI Market Cap

HF PapersOpenAIresearch1w ago

How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring

LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic exploration and make navigation more predictable. Starting from a strong baseline, Codex from OpenAI, we systematically inject varying granularities of structural annotations and measure their effects on localization, trajectory behavior, and run-to-run stability. Our study identifies what we call the deterministic anchoring effect: static structure helps less by making agents "smarter" and more by making their navigation disciplined and reproducible. Three observations support this finding: (1) Anchoring works: lightweight call/inheritance topology improves function-level localization (+2.2pp Func@5) and shortens trajectories (-1.6 interaction rounds); (2) Anchoring is scale-sensitive: the optimal granularity and directionality depend on repository characteristics, where denser semantics show diminishing returns and hub-heavy projects benefit from inverse-only links that expose "who-calls-me" without forward edges; (3) Anchoring stabilizes: tags raise link-following rate from 0.15-0.18 to 0.21-0.24, roughly halve run-to-run variance, and improve single-run reliability (Pass@1 +3.4 pp) on medium-scale repositories, at the cost of roughly 10% more input tokens. These observations suggest practical guidelines: default to lightweight topology on medium projects, prune forward edges in large repositories, and reserve dense tags for implicit-dependency cases.

View Source

#huggingface#daily-papers

HF PapersMoonshot AIresearch1mo ago

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

View Source

#huggingface#daily-papers

GPT-5.4

Similar Models

Introducing GeneBench-Pro

Social & Blog Posts10

Research Papers17

Other

Introducing GPT-5.4

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment call

Introducing GPT-5.4

Introducing GPT‑5 for developers

GPT5.4 - GAIA

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment call

How ChatGPT adoption has expanded

Introducing GeneBench-Pro

HP Inc. launches Frontier strategic partnership with OpenAI

We’ve designed and built our first AI chip: Jalapeño. Designed from the ground up by OpenAI and brought to production with @Broadcom, Jalapeño is purpose-built for the LLM workloads powering ChatGPT,

OpenAI and Broadcom unveil LLM-optimized inference chip

GPT-5.4 helped drive a medicinal chemistry project from literature review to a validated experimental result. Paired with https://t.co/gcDaph8b2B’s Maria AI and specialized lab, the model proposed an

Earlier this month, an Erdős problem that had been open for 60 years was solved with help from GPT-5.4 Pro. What happens now that AI is getting good at math? OpenAI researchers @SebastienBubeck and @E

We’re expanding Trusted Access for Cyber with additional tiers for authenticated cybersecurity defenders. Customers in the highest tiers can request access to GPT-5.4-Cyber, a version of GPT-5.4 fine-

We're publishing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. We find that GPT-5.4 Thinking shows low ability to obscure its reasoning—suggesting CoT monitoring

WARP: Weight-Space Analysis for Recovering Training Data Portfolios

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring

Forecasting Future Behavior as a Learning Task

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Useful Memories Become Faulty When Continuously Updated by LLMs

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support

Introducing GPT‑5 for developers

GPT5.4 - GAIA