Name: Qwen3 8B
Price: 0.117 USD
Availability: InStock
Rating: 31.5 (1 reviews)
Author: Qwen

Qwen3 8B by Qwen | AI Market Cap

HF PapersQwenresearch1mo ago

ECHO: Terminal Agents Learn World Models for Free

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.

View Source

#huggingface#daily-papers

HF PapersQwenresearch2mo ago

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

View Source

#huggingface#daily-papers

Qwen3 8B

Similar Models

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct - SWE-Bench Verified

Qwen3 8B is now available on Ollama

Research Papers18

Other

Qwen3 8B is now available on Ollama

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

ECHO: Terminal Agents Learn World Models for Free

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Synthetic Sandbox for Training Machine Learning Engineering Agents

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct - SWE-Bench Verified