xAI
Grok 4.20 Multi-Agent is a variant of xAI’s Grok 4.20 designed for collaborative, agent-based workflows. Multiple agents operate in parallel to conduct deep research, coordinate tool use, and synthesize information...
---
Quality Score
1250
Arena ELO
Undisclosed
Parameters
2M
Context
Sign in to join the discussion
0
Downloads
0
Likes
Mar 2026
Released
Launches
1
Pricing
2
Benchmarks
2
Research
1
General
6
Recent launch, pricing, benchmark, and API signals linked to this model or its provider.
Modes Agent, TTS & STT Realtime $3.00 / hour TTS $4.20 / 1M characters Read docs Try in playground Imagine API Turn ideas into reality with our image and video generation models. Use case Model Chat Grok 4.20 Coding Grok 4.20 Images Grok Imagine API Videos Grok Imagine API Voice Grok Voice API Chat API We strongly recommend all API callers use grok-4.20. Since the agent autonomously decides how many tools to call, costs scale with query complexity.
Modes Agent, TTS & STT Realtime $3.00 / hour TTS $4.20 / 1M characters Read docs Try in playground Imagine API Turn ideas into reality with our image and video generation models. Use case Model Chat Grok 4.20 Coding Grok 4.20 Images Grok Imagine API Videos Grok Imagine API Voice Grok Voice API Chat API We strongly recommend all API callers use grok-4.20. Since the agent autonomously decides how many tools to call, costs scale with query complexity.
View sourcebenchmarks. Because our models push the frontier of AI capabilities, we are committed to mitigating Our approach to safety evaluations focuses on measuring specific safety-relevant behaviors relevant to our current evaluation methodology, results, and mitigations for these various behaviors.
View source

With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: M_{pure}, which learns robust spatial-relational reasoning and M_{spur}, which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.
benchmarks. Because our models push the frontier of AI capabilities, we are committed to mitigating Our approach to safety evaluations focuses on measuring specific safety-relevant behaviors relevant to our current evaluation methodology, results, and mitigations for these various behaviors.