A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.
11 results for: Benchmarks
ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex
ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.
New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI
A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.
AI Models Show Risks for Biological Misuse Amid Evolving Safeguards
Recent benchmarks reveal AI models may enable biological weaponization by low-expertise users, raising urgent policy concerns.
Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs
Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.
Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks
The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks
EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.
MiniMax Open-Sources M2.7, Its First Self-Evolving Agent
MiniMax published M2.7 weights on Hugging Face; the model is billed as self-evolving and posts 56.22% on SWE‑Pro and 57.0% on Terminal Bench 2.