Skip to main content

Models Deploy Leaderboards Marketplace

Track, rank, and compare every AI model in the world.

Platform

Models
Deploy
Leaderboards
Compare
News
Marketplace
Workspace
Deployments
Discover Watchlists
Pricing

Categories

LLMs
Image Gen
Vision
Multimodal
Embeddings
Speech
Video
Code
Browser Agents
Specialized

Company

About
Roadmap
Contact
FAQ
Providers
API
Terms
Privacy

© 2026 AI Market Cap. All rights reserved.

gpt-oss-120b by OpenAI | AI Market Cap

gpt-oss-120b

#247Large Language ModelsOpen Weights

O

OpenAI

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...

Running this yourself: likely needs a high-memory cloud gpu.

Model updates refreshed6h agoJun 23, 2026news + changelog

Website View Updates

Archived

This model is still tracked for research and discovery, but it is excluded from default public rankings until it returns to active status.

30.6

Quality Score

---

Arena ELO

120B

Parameters

131K

Context

Similar Models

Discussion (0)

Sign in to join the discussion

Loading comments...

0

Downloads

0

Likes

Aug 2025

Released

Launches

4

high

Benchmarks

4

high

Open Source

1

medium

Research

5

low

General

3

low

What Changed Recently

Recent launch, pricing, benchmark, and API signals linked to this model or its provider.

LaunchesOpenAI6d ago

Introducing LifeSciBench

Social & Blog Posts9

BlogOpenAIannouncementgeneral

Research Papers4

HF PapersQwenresearch2w ago

Other

ollama-libraryOpenAIopen_sourceopen sourceToday

gpt-oss-120b is now available on Ollama

77.8

Llama 4 Maverick#77

qwen3-235b-a22b-instruct-2507#225

LaunchesOpenAI6d ago

We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate mo

We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate model responses. https://t.co/7RJzBfNniQ

LaunchesOpenAI1w ago

Predicting model behavior before release by simulating deployment

LaunchesOpenAI1w ago

Introducing the OpenAI Partner Network

BenchmarksToday

gpt-oss-120b - SWE-Bench Verified

SWE-Bench Verified resolved rate 26.0

Yesterday

Samsung Electronics brings ChatGPT and Codex to employees

#openai#blog#rss

X/Twitter@OpenAIOpenAIresearchresearch4d ago

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new r

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new research on training models to be broadly and persistently beneficial.

#openai#twitter#x

BlogOpenAIannouncementgeneral5d ago

Improving health intelligence in ChatGPT

#openai#blog#rss

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, L

X/Twitter@OpenAIOpenAIbenchmark5d ago

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, L

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, LifeSciBench includes 750 expert-authored tasks across seven biological research https://t.co/JDkKWcnL9F

#openai#twitter#x

BlogOpenAIlaunchlaunch6d ago

Introducing LifeSciBench

#openai#blog#rss

X/Twitter@OpenAIOpenAIlaunchlaunch6d ago

We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate mo

We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate model responses. https://t.co/7RJzBfNniQ

#openai#twitter#x

Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed. @tejalpatwardhan, who leads our frontier evals tea

X/Twitter@OpenAIOpenAIannouncementgeneral6d ago

Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed. @tejalpatwardhan, who leads our frontier evals tea

Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed. @tejalpatwardhan, who leads our frontier evals team, spoke to @andrewmayne about why evals matter and what models need to be https://t.co/Q3oRCuNxYB

#openai#twitter#x

BlogOpenAIlaunchlaunch1w ago

Predicting model behavior before release by simulating deployment

#openai#blog#rss

BlogOpenAIlaunchlaunch1w ago

Introducing the OpenAI Partner Network

#openai#blog#rss

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

#huggingface#daily-papers

HF PapersAnthropicresearch1mo ago

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

#huggingface#daily-papers

HF PapersOpenAIresearch1mo ago

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval

#huggingface#daily-papers

HF PapersOpenAIresearch3mo ago

OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER-AI-Lab/OpenResearcher.

#huggingface#daily-papers

gpt-oss-120b is now available through local Ollama runtime. 128K context window listed. OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.

#deployability#ollama#openai

swe-benchToday

gpt-oss-120b - SWE-Bench Verified

SWE-Bench Verified resolved rate 26.0

#benchmark#coding#agent

provider-benchmarksOpenAI4d ago

Introducing GPT‑5 for developers

Introducing GPT‑5 for developers | OpenAI Skip to main content Research Products Business Developers Company Foundation (opens in a new window) Log in Try ChatGPT (opens in a new window) Research Products Business Developers Company Foundation (opens in a new window) Try ChatGPT (opens in a new window) Login OpenAI August 7, 2025 Product Introducing GPT‑5 for developers The best model for coding and agentic tasks. Loading… Share Introduction Introduction Coding Frontend engin

#benchmark#provider-reported#official

provider-benchmarksOpenAI1mo ago

Models | OpenAI API

Models | OpenAI API Home API Docs Guides and concepts for the OpenAI API API reference Endpoints, parameters, and responses Codex Docs Guides, concepts, and product docs for Codex Use cases Example workflows and tasks teams hand to Codex ChatGPT Apps SDK Build apps to extend ChatGPT Commerce Build commerce flows in ChatGPT Ads Publish and measure ads in ChatGPT Resources Showcase Demo apps to get inspired Blog Learnings and experiences from developers Cookbook Notebook exampl

#benchmark#provider-reported#official