Search: Benchmarks | CurrentLens.com

Science & Healthcare

Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks

CurrentLens
Apr 30, 2026

A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.

Open Source & Research

ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex

CurrentLens
Apr 30, 2026

ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.

Open Source & Research

New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI

CurrentLens
Apr 28, 2026

A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.

Models & Launches

AI Models Show Risks for Biological Misuse Amid Evolving Safeguards

CurrentLens
Apr 24, 2026

Recent benchmarks reveal AI models may enable biological weaponization by low-expertise users, raising urgent policy concerns.

Models & Launches

Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs

CurrentLens
Apr 23, 2026

Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.

AI in Coding

Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks

CurrentLens
Apr 23, 2026

The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.

Open Source & Research

Evaluates LLMs on Vietnamese legal text with a dual-aspect framework

CurrentLens
Apr 21, 2026

An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.

Models & Launches

AllenAI launches vla-eval to unify Vision-Language-Action benchmarking

CurrentLens
Apr 21, 2026

vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.

Science & Healthcare

EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks

CurrentLens
Apr 16, 2026

EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.