Skip to main content

Models Deploy Leaderboards Marketplace

Track, rank, and compare every AI model in the world.

Platform

Models
Deploy
Leaderboards
Compare
News
Marketplace
Workspace
Deployments
Discover Watchlists
Pricing

Categories

LLMs
Image Gen
Vision
Multimodal
Embeddings
Speech
Video
Code
Browser Agents
Specialized

Company

About
Roadmap
Contact
FAQ
Providers
API
Terms
Privacy

© 2026 AI Market Cap. All rights reserved.

Qwen3-235B by Alibaba | AI Market Cap

Qwen3-235B

#68Large Language ModelsOpen Weights

A

Alibaba

Qwen3-235B-A22B is Alibaba's most powerful open model. A 235B parameter MoE with 22B active parameters and seamless switching between thinking and non-thinking modes.

Running this yourself: likely needs a high-memory cloud gpu.

Model updates refreshed1h agoJun 24, 2026news + changelog

Website View Updates Start Free Trial

77.1

Quality Score

1392

Arena ELO

235B

Parameters

131K

Context

Similar Models

DeepSeek-R1#184

Discussion (0)

Sign in to join the discussion

Loading comments...

180.0M

Downloads

15.0K

Likes

Apr 2025

Released

Benchmarks

5

high

Open Source

1

medium

Research

6

low

What Changed Recently

Recent launch, pricing, benchmark, and API signals linked to this model or its provider.

Benchmarks2mo ago

Qwen3-235B-A22B - Arena-Hard-Auto

Arena-Hard-Auto official Gemini-2.5 judged score 58.4 with CI -1.9/2.1

Benchmarks2mo ago

Qwen3-235B-A22B - LiveCodeBench

LiveCodeBench pass@1 80.4 across 1055 tasks

Research Papers6

HF PapersAlibabaresearch1mo ago

Other

ollama-libraryAlibabaopen_sourceopen sourceToday

Qwen3-235B is now available on Ollama

75.6

Llama 4 Maverick#72

DeepSeek-V3.2#283

DeepSeek·Unknown

qwen3-235b-a22b-instruct-2507#168

Benchmarks2mo ago

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct - SWE-Bench Verified

SWE-Bench Verified resolved rate 69.6

Benchmarks3mo ago

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct - SWE-Bench Verified

SWE-Bench Verified resolved rate 69.6

Benchmarks10mo ago

Qwen3 - GAIA

GAIA score 44.2 from WA0824

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.

#huggingface#daily-papers

HF PapersAnthropicresearch1mo ago

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

#huggingface#daily-papers

HF PapersAlibabaresearch1mo ago

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round-robin manner, achieving a near-zero-bubble pipeline. To ensure training correctness and system efficiency, RoundPipe integrates a priority-aware transfer scheduling engine, a fine-grained distributed event-based synchronization protocol, and an automated layer partitioning algorithm. Evaluations on an 8times RTX 4090 server demonstrate that RoundPipe achieves 1.48--2.16times speedups over state-of-the-art baselines when fine-tuning 1.7B to 32B models. Remarkably, RoundPipe enables LoRA fine-tuning of the Qwen3-235B model with 31K sequence length on a single server. RoundPipe is publicly available as an open-source Python library with comprehensive documentation.

#huggingface#daily-papers

HF PapersDeepSeekresearch2mo ago

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

#huggingface#daily-papers

arXivAlibabanlp3mo ago

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

#cs.AI#cs.CL#cs.AI

HF PapersAlibabaresearch3mo ago

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

#huggingface#daily-papers

Qwen3-235B is now available through local Ollama runtime. 40K context window listed. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models.

#deployability#ollama#alibaba

arena-hard-auto2mo ago

Qwen3-235B-A22B - Arena-Hard-Auto

Arena-Hard-Auto official Gemini-2.5 judged score 58.4 with CI -1.9/2.1

#benchmark#arena-hard-auto#preference

livecodebench2mo ago

Qwen3-235B-A22B - LiveCodeBench

LiveCodeBench pass@1 80.4 across 1055 tasks

#benchmark#coding#livecodebench

swe-bench2mo ago

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct - SWE-Bench Verified

SWE-Bench Verified resolved rate 69.6

#benchmark#coding#agent

swe-bench3mo ago

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct - SWE-Bench Verified

SWE-Bench Verified resolved rate 69.6

#benchmark#coding#agent

gaia-benchmark10mo ago

Qwen3 - GAIA

GAIA score 44.2 from WA0824

#benchmark#agent#gaia