Name: CodeLlama 70B
Rating: 59.9 (25000 reviews)
Author: Meta

CodeLlama 70B by Meta | AI Market Cap

HF PapersMetaresearch1mo ago

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

View Source

#huggingface#daily-papers

CodeLlama 70B

Similar Models

CodeLlama 70B is now available on Ollama

Research Papers10

Other

CodeLlama 70B is now available on Ollama

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

The Price of Anarchy in Disaggregated Inference

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

The Price of Anarchy in Disaggregated Inference

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Base Models Look Human To AI Detectors

Context Memorization for Efficient Long Context Generation

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency