Name: Kimi K2.5
Price: 0.375 USD
Availability: InStock
Rating: 54.2 (1 reviews)
Author: Moonshot AI

Kimi K2.5 by Moonshot AI | AI Market Cap

HF PapersMoonshot AIresearch1mo ago

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

View Source

#huggingface#daily-papers

Kimi K2.5

Similar Models

kimi-k2.5 - SWE-Bench Verified

模型列表 - Kimi API 开放平台

Research Papers8

Other

Kimi K2.5 is now available on Ollama Cloud

Kimi K2.5 is now available on Ollama Cloud

Welcome to Kimi API Docs - Kimi API Platform

ProCUA-SFT Technical Report

ProCUA-SFT Technical Report

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

ETCHR: Editing To Clarify and Harness Reasoning

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

VeriGrey: Greybox Agent Validation

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

Welcome to Kimi API Docs - Kimi API Platform

kimi-k2.5 - SWE-Bench Verified

模型列表 - Kimi API 开放平台