An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.
17 results for: benchmarks
MPMMine standardizes benchmarks for constraint-acquisition research
An arXiv preprint introduces MPMMine, a benchmark suite built to supply the domain artifacts and structured data constraint-acquisition methods need for reproducible evaluation.
NVIDIA Vera CPU Runs Fast and Sustained in Early Phoronix Tests
Initial Phoronix benchmarks published on NVIDIA's blog show the Vera CPU delivers the fast cores, memory bandwidth and full-core throughput targeted at agentic AI workloads.
Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks
An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
Authors Release OpenEval and Demand Item-Level Benchmark Standards
A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.
New Study Reveals Limits of Model-Level Evaluations in Alignment Assessments
A recent paper argues that alignment evaluation cannot solely rely on model-level assessments.
Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks
A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.
ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex
ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.
New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI
A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.
AI Models Show Risks for Biological Misuse Amid Evolving Safeguards
Recent benchmarks reveal AI models may enable biological weaponization by low-expertise users, raising urgent policy concerns.
Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs
Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.
Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks
The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks
EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.
MiniMax Open-Sources M2.7, Its First Self-Evolving Agent
MiniMax published M2.7 weights on Hugging Face; the model is billed as self-evolving and posts 56.22% on SWE‑Pro and 57.0% on Terminal Bench 2.