Qwen3-235B-A22B - Arena-Hard-Auto
Arena-Hard-Auto official Gemini-2.5 judged score 58.4 with CI -1.9/2.1
View sourceQwen
Qwen3-235B-A22B-Thinking-2507 is a high-performance, open-weight Mixture-of-Experts (MoE) language model optimized for complex reasoning tasks. It activates 22B of its 235B parameters per forward pass and natively supports up to 262,144...
Running this yourself: likely needs a high-memory cloud gpu.
60.5
Quality Score
1367
Arena ELO
235B
Parameters
262K
Context
Sign in to join the discussion
0
Downloads
0
Likes
Jul 2025
Released
Benchmarks
6
Open Source
1
Research
1
Recent launch, pricing, benchmark, and API signals linked to this model or its provider.
Arena-Hard-Auto official Gemini-2.5 judged score 58.4 with CI -1.9/2.1
View sourceLiveCodeBench pass@1 80.4 across 1055 tasks
View sourceSWE-Bench Verified resolved rate 69.6
SWE-Bench Verified resolved rate 69.6
View sourceGAIA score 44.2 from WA0824
View sourceWe present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.
Qwen3 235B A22B Thinking 2507 is now available through local Ollama runtime. 40K context window listed. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models.
Arena-Hard-Auto official Gemini-2.5 judged score 58.4 with CI -1.9/2.1
LiveCodeBench pass@1 80.4 across 1055 tasks
SWE-Bench Verified resolved rate 69.6
SWE-Bench Verified resolved rate 69.6