Name: DeepSeek-V3
Rating: 76.9 (28000 reviews)
Author: DeepSeek

DeepSeek-V3 by DeepSeek | AI Market Cap

HF PapersDeepSeekresearch1w ago

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `<think>` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `<think>` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.

View Source

#huggingface#daily-papers

DeepSeek-V3

Similar Models

Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀 🧠 Hybrid inference: Think & Non-Think — one model, two modes ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs

Social & Blog Posts5

Research Papers4

Other

deepseek-v3.2-reasoner - SWE-Bench Verified

Models & Pricing

⚡️ Efficiency Gains 🤖 DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost. 📊 Benchmarks show V3.2-Exp perform

deepseek-v3.2-reasoner - SWE-Bench Verified

DeepSeek-V3 - LiveCodeBench

Models & Pricing

The Temperature Parameter

⚡️ Efficiency Gains 🤖 DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost. 📊 Benchmarks show V3.2-Exp perform

Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀 🧠 Hybrid inference: Think & Non-Think — one model, two modes ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs

Context Caching is Available 2024/08/02

Information-Aware KV Cache Compression for Long Reasoning

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

DeepSeek-V3 - LiveCodeBench

DeepSeek-V3 is now available on Ollama

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

DeepSeek-V3

Similar Models

Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀 🧠 Hybrid inference: Think & Non-Think — one model, two modes ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs

Social & Blog Posts5

Research Papers4

Other

deepseek-v3.2-reasoner - SWE-Bench Verified

Models &amp; Pricing

⚡️ Efficiency Gains 🤖 DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost. 📊 Benchmarks show V3.2-Exp perform

deepseek-v3.2-reasoner - SWE-Bench Verified

DeepSeek-V3 - LiveCodeBench

Models &amp; Pricing

The Temperature Parameter

⚡️ Efficiency Gains 🤖 DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost. 📊 Benchmarks show V3.2-Exp perform

Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀 🧠 Hybrid inference: Think & Non-Think — one model, two modes ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs

Context Caching is Available 2024/08/02

Information-Aware KV Cache Compression for Long Reasoning

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

DeepSeek-V3 - LiveCodeBench

DeepSeek-V3 is now available on Ollama

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

deepseek-v3 — LiveBench Scores

Models & Pricing

Models & Pricing