DeepSeek-V4-Pro - GAIA
GAIA score 65.8 from XWork-MultiAgent
View sourceunsloth
DeepSeek-V4-Pro is a open-weight unsloth llm model with a 1,048,576 token context window.
Running this yourself: can likely run on your own machine.
---
Quality Score
---
Arena ELO
Unknown
Parameters
1M
Context
Sign in to join the discussion
859
Downloads
39
Likes
Apr 2026
Released
Benchmarks
2
API
1
Research
1
Recent launch, pricing, benchmark, and API signals linked to this model or its provider.
GAIA score 65.8 from XWork-MultiAgent
View sourceDeepSeek published benchmark or leaderboard evidence for DeepSeek-V4-Pro.
View sourceDeepSeek-V4-Pro is now available through Ollama Cloud. 1M context window listed. DeepSeek-V4-Pro is a frontier Mixture-of-Experts model with a 1M-token context window and three reasoning modes.
View sourceFrontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.
View sourceFrontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.
DeepSeek-V4-Pro is now available through Ollama Cloud. 1M context window listed. DeepSeek-V4-Pro is a frontier Mixture-of-Experts model with a 1M-token context window and three reasoning modes.
GAIA score 65.8 from XWork-MultiAgent
DeepSeek published benchmark or leaderboard evidence for DeepSeek-V4-Pro.