glm-5 - SWE-Bench Verified
SWE-Bench Verified resolved rate 72.8
View sourceZ.ai
Z.ai's flagship reasoning and coding model family for long-horizon agentic workflows.
Z.ai's flagship reasoning and coding model family for long-horizon agentic workflows.
37.3
Quality Score
---
Arena ELO
Undisclosed
Parameters
128K
Context
Sign in to join the discussion
0
Downloads
0
Likes
Feb 2026
Released
Benchmarks
5
API
4
Research
4
General
2
Recent launch, pricing, benchmark, and API signals linked to this model or its provider.
SWE-Bench Verified resolved rate 72.8
View sourceNavigation Language Models GLM-5.1 Guides API Reference Scenario Example Coding Plan Released Notes Terms and Policy Help Center Get Started Quick Start Overview Pricing Core Parameters SDKs Guide Migrate to GLM-5.1 Language Models GLM-5.1 GLM-5 GLM-5-Turbo GLM-4.7 GLM-4.6 GLM-4.5 GLM-4-32B-0414-128K Vision Language Models GLM-5V-Turbo GLM-4.6V GLM-OCR AutoGLM-Phone-Multilingual GLM-4.5V Image Generation Models GLM-Image CogView-4 Video Generation Models CogVideoX-3 Vidu Q1 V
Navigation Language Models GLM-5 Guides API Reference Scenario Example Coding Plan Released Notes Terms and Policy Help Center Get Started Quick Start Overview Pricing Core Parameters SDKs Guide Migrate to GLM-5.1 Language Models GLM-5.1 GLM-5 GLM-5-Turbo GLM-4.7 GLM-4.6 GLM-4.5 GLM-4-32B-0414-128K Vision Language Models GLM-5V-Turbo GLM-4.6V GLM-OCR AutoGLM-Phone-Multilingual GLM-4.5V Image Generation Models GLM-Image CogView-4 Video Generation Models CogVideoX-3 Vidu Q1 Vid
View sourceSWE-Bench Verified resolved rate 72.8
View sourceGAIA score 22.9 from Mozi3.5
View source
We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.
While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.
As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
GLM-5 is now available through Ollama Cloud. 198K context window listed. A strong reasoning and agentic model from Z.ai with 744B total parameters (40B active), built for complex systems engineering and long-horizon tasks.
Z.ai documents using GLM models inside local agent tooling through the official coding plan.
Z.ai documents using GLM models inside local coding tools like Cline through the official coding endpoint.
Z.ai documents GLM deployment through its coding plan and local-tool workflow integrations for programming assistants.
SWE-Bench Verified resolved rate 72.8
Navigation Language Models GLM-5.1 Guides API Reference Scenario Example Coding Plan Released Notes Terms and Policy Help Center Get Started Quick Start Overview Pricing Core Parameters SDKs Guide Migrate to GLM-5.1 Language Models GLM-5.1 GLM-5 GLM-5-Turbo GLM-4.7 GLM-4.6 GLM-4.5 GLM-4-32B-0414-128K Vision Language Models GLM-5V-Turbo GLM-4.6V GLM-OCR AutoGLM-Phone-Multilingual GLM-4.5V Image Generation Models GLM-Image CogView-4 Video Generation Models CogVideoX-3 Vidu Q1 V
Navigation Language Models GLM-5 Guides API Reference Scenario Example Coding Plan Released Notes Terms and Policy Help Center Get Started Quick Start Overview Pricing Core Parameters SDKs Guide Migrate to GLM-5.1 Language Models GLM-5.1 GLM-5 GLM-5-Turbo GLM-4.7 GLM-4.6 GLM-4.5 GLM-4-32B-0414-128K Vision Language Models GLM-5V-Turbo GLM-4.6V GLM-OCR AutoGLM-Phone-Multilingual GLM-4.5V Image Generation Models GLM-Image CogView-4 Video Generation Models CogVideoX-3 Vidu Q1 Vid
SWE-Bench Verified resolved rate 72.8