Name: GPT-5.2
Price: 20 USD
Availability: InStock
Rating: 60.2 (1 reviews)
Author: OpenAI

GPT-5.2 by OpenAI | AI Market Cap

HF PapersOpenAIresearch1w ago

How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring

LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic exploration and make navigation more predictable. Starting from a strong baseline, Codex from OpenAI, we systematically inject varying granularities of structural annotations and measure their effects on localization, trajectory behavior, and run-to-run stability. Our study identifies what we call the deterministic anchoring effect: static structure helps less by making agents "smarter" and more by making their navigation disciplined and reproducible. Three observations support this finding: (1) Anchoring works: lightweight call/inheritance topology improves function-level localization (+2.2pp Func@5) and shortens trajectories (-1.6 interaction rounds); (2) Anchoring is scale-sensitive: the optimal granularity and directionality depend on repository characteristics, where denser semantics show diminishing returns and hub-heavy projects benefit from inverse-only links that expose "who-calls-me" without forward edges; (3) Anchoring stabilizes: tags raise link-following rate from 0.15-0.18 to 0.21-0.24, roughly halve run-to-run variance, and improve single-run reliability (Pass@1 +3.4 pp) on medium-scale repositories, at the cost of roughly 10% more input tokens. These observations suggest practical guidelines: default to lightweight topology on medium projects, prune forward edges in large repositories, and reserve dense tags for implicit-dependency cases.

View Source

#huggingface#daily-papers

arXivDeepSeekai3mo ago

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety--utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.

View Source

#cs.AI#cs.AI

GPT-5.2

Similar Models

Introducing GeneBench-Pro

Social & Blog Posts6

Research Papers10

Other

gpt-5.2-2025-12-11 - SWE-Bench Verified

gpt-5.2-2025-12-11 - SWE-Bench Verified

gpt-5-2 - SWE-Bench Verified

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment call

Introducing GPT-5.2

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment call

How ChatGPT adoption has expanded

Introducing GeneBench-Pro

HP Inc. launches Frontier strategic partnership with OpenAI

We’ve designed and built our first AI chip: Jalapeño. Designed from the ground up by OpenAI and brought to production with @Broadcom, Jalapeño is purpose-built for the LLM workloads powering ChatGPT,

OpenAI and Broadcom unveil LLM-optimized inference chip

WARP: Weight-Space Analysis for Recovering Training Data Portfolios

How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

AI Can Learn Scientific Taste

gpt-5-2 - SWE-Bench Verified

Introducing GPT-5.2

Introducing GPT‑5 for developers

Models | OpenAI API