Name: Llama 3.1 8B Instruct
Price: 0.02 USD
Availability: InStock
Author: Meta

Llama 3.1 8B Instruct by Meta | AI Market Cap

HF PapersMetaresearch1mo ago

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

View Source

#huggingface#daily-papers

Llama 3.1 8B Instruct

Similar Models

Llama 3.1 - SWE-Bench Verified

Llama 3.1 - SWE-Bench Verified

Research Papers20

Other

Llama 3.1 - SWE-Bench Verified

Llama 3.1 - SWE-Bench Verified

Llama 3.1 - SWE-Bench Verified

llama-3.1 - GAIA

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

RoPE-Aware Bit Allocation for KV-Cache Quantization

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

The Price of Anarchy in Disaggregated Inference

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Base Models Look Human To AI Detectors

Context Memorization for Efficient Long Context Generation

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Large Language Models Align with the Human Brain during Creative Thinking

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages