Qwen3.5 - GAIA
GAIA score 32.6 from TJ-0405
View sourceQwen
Qwen3.5-9B is a multimodal foundation model from the Qwen3.5 family, designed to deliver strong reasoning, coding, and visual understanding in an efficient 9B-parameter architecture. It uses a unified vision-language design...
Running this yourself: desktop gpu should be enough.
53.6
Quality Score
---
Arena ELO
9B
Parameters
---
Context
Sign in to join the discussion
8.0M
Downloads
1.5K
Likes
Feb 2026
Released
Benchmarks
5
Open Source
1
Research
1
Recent launch, pricing, benchmark, and API signals linked to this model or its provider.
GAIA score 32.6 from TJ-0405
View sourceSWE-Bench Verified resolved rate 69.6
View sourceSWE-Bench Verified resolved rate 69.6
GAIA score 44.2 from WA0824
View sourceGAIA score 44.2 from WA0824
View sourceMany real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.
Qwen3.5-9B is now available through local Ollama runtime. 256K context window listed. Qwen 3.5 is a family of open-source multimodal models that delivers exceptional utility and performance.
SWE-Bench Verified resolved rate 69.6
SWE-Bench Verified resolved rate 69.6