Search: Bench | CurrentLens.com

Models & Launches

Google Releases Gemini-SQL2; Gemini 3.1 Pro Scores 80.04% on BIRD

CurrentLens
Jun 13, 2026

Google Research announced Gemini-SQL2, a Gemini 3.1 Pro-powered text-to-SQL capability that posted 80.04% execution accuracy on the BIRD single-model leaderboard.

Models & Launches

DKPS method cuts model-evaluation queries using cached responses

CurrentLens
Jun 6, 2026

An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.

Open Source & Research

MPMMine standardizes benchmarks for constraint-acquisition research

CurrentLens
May 27, 2026

An arXiv preprint introduces MPMMine, a benchmark suite built to supply the domain artifacts and structured data constraint-acquisition methods need for reproducible evaluation.

Chips & Infrastructure

NVIDIA Vera CPU Runs Fast and Sustained in Early Phoronix Tests

CurrentLens
May 27, 2026

Initial Phoronix benchmarks published on NVIDIA's blog show the Vera CPU delivers the fast cores, memory bandwidth and full-core throughput targeted at agentic AI workloads.

Open Source & Research

Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks

CurrentLens
May 25, 2026

An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.

Models & Launches

Authors Release OpenEval and Demand Item-Level Benchmark Standards

CurrentLens
May 25, 2026

A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.

Models & Launches

New Study Reveals Limits of Model-Level Evaluations in Alignment Assessments

CurrentLens
May 8, 2026

A recent paper argues that alignment evaluation cannot solely rely on model-level assessments.

Open Source & Research

OpenClassGen Provides Extensive Python Classes for LLM Research

CurrentLens
May 3, 2026

OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.

Open Source & Research

RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension

CurrentLens
May 1, 2026

RPC-Bench addresses gaps in understanding academic papers for AI models with a new benchmark.

Science & Healthcare

Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks

CurrentLens
Apr 30, 2026

A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.

Open Source & Research

ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex

CurrentLens
Apr 30, 2026

ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.

Open Source & Research

New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI

CurrentLens
Apr 28, 2026

A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.

Models & Launches

AI Models Show Risks for Biological Misuse Amid Evolving Safeguards

CurrentLens
Apr 24, 2026

Recent benchmarks reveal AI models may enable biological weaponization by low-expertise users, raising urgent policy concerns.

Models & Launches

Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs

CurrentLens
Apr 23, 2026

Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.

AI in Coding

Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks

CurrentLens
Apr 23, 2026

The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.

Open Source & Research

RARE Introduces Framework for Evaluating High-Similarity Document Retrieval

CurrentLens
Apr 23, 2026

The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.

Open Source & Research

Evaluates LLMs on Vietnamese legal text with a dual-aspect framework

CurrentLens
Apr 21, 2026

An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.

Models & Launches

OpenAI Releases ChatGPT Images 2.0

CurrentLens
Apr 21, 2026

OpenAI published ChatGPT Images 2.0; Simon Willison ran a Where's‑Waldo‑style prompt to compare it with gpt-image-1 and rival models.

Models & Launches

AllenAI launches vla-eval to unify Vision-Language-Action benchmarking

CurrentLens
Apr 21, 2026

vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.

Models & Launches

Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test

CurrentLens
Apr 16, 2026

Simon Willison reports that a local, quantized Qwen3.6-35B-A3B run produced better pelican and flamingo illustrations than Anthropic's Claude Opus 4.

Science & Healthcare

EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks

CurrentLens
Apr 16, 2026

EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.

Open Source & Research

Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA

CurrentLens
Apr 16, 2026

GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.

Open Source & Research

MiniMax Open-Sources M2.7, Its First Self-Evolving Agent

CurrentLens
Apr 13, 2026

MiniMax published M2.7 weights on Hugging Face; the model is billed as self-evolving and posts 56.22% on SWE‑Pro and 57.0% on Terminal Bench 2.

Latest
Trending

Agents & Automation

OpenAI Launches Three Academy Courses on Agents and Workflows

CurrentLens
Jun 13, 2026

OpenAI released three Academy courses focused on practical AI skills, building repeatable workflows, and applying agents in everyday work.

Models & Launches

Google Releases Gemini-SQL2; Gemini 3.1 Pro Scores 80.04% on BIRD

CurrentLens
Jun 13, 2026

Google Research announced Gemini-SQL2, a Gemini 3.1 Pro-powered text-to-SQL capability that posted 80.04% execution accuracy on the BIRD single-model leaderboard.

Science & Healthcare

Africa CDC and WHO launch $518M continental Ebola response plan

CurrentLens
Jun 6, 2026

A six-month 'One Response' plan targets the Bundibugyo Ebola outbreak with unified coordination, surveillance, clinical care and community engagement across affected countries.

Policy & Safety

HASC adds right-to-repair language to FY27 defense policy bill

CurrentLens
Jun 6, 2026

The House Armed Services Committee inserted right-to-repair provisions into its FY27 defense policy draft, aiming to ease barriers that limit troops' ability to fix equipment.

AI Creative

Startups Pull Users Off Phones With In-Person Games and DIY Cyberdecks

CurrentLens
Jun 6, 2026

TechCrunch highlights founders building physical social products: Board raised funding for in-person games, and cyberdeck DIYs are going viral.

Agents & Automation

MicroPython WASM Sandbox Enables Safer Datasette Plugin Execution

CurrentLens
Jun 6, 2026

Simon Willison published an alpha MicroPython-in-WASM sandbox (micropython-wasm) and a Datasette plugin (datasette-agent-micropython) to run plugin code with constrained access.

Models & Launches

DKPS method cuts model-evaluation queries using cached responses

CurrentLens
Jun 6, 2026

An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.

23 results for: Bench

Google Releases Gemini-SQL2; Gemini 3.1 Pro Scores 80.04% on BIRD

DKPS method cuts model-evaluation queries using cached responses

MPMMine standardizes benchmarks for constraint-acquisition research

NVIDIA Vera CPU Runs Fast and Sustained in Early Phoronix Tests

Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks

Authors Release OpenEval and Demand Item-Level Benchmark Standards

New Study Reveals Limits of Model-Level Evaluations in Alignment Assessments

OpenClassGen Provides Extensive Python Classes for LLM Research

RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension

Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks

ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex

New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI

AI Models Show Risks for Biological Misuse Amid Evolving Safeguards

Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs

Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks

RARE Introduces Framework for Evaluating High-Similarity Document Retrieval

Evaluates LLMs on Vietnamese legal text with a dual-aspect framework

OpenAI Releases ChatGPT Images 2.0

AllenAI launches vla-eval to unify Vision-Language-Action benchmarking

Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test

EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks

Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA

MiniMax Open-Sources M2.7, Its First Self-Evolving Agent

OpenAI Launches Three Academy Courses on Agents and Workflows

Google Releases Gemini-SQL2; Gemini 3.1 Pro Scores 80.04% on BIRD

Africa CDC and WHO launch $518M continental Ebola response plan

HASC adds right-to-repair language to FY27 defense policy bill

Startups Pull Users Off Phones With In-Person Games and DIY Cyberdecks

MicroPython WASM Sandbox Enables Safer Datasette Plugin Execution

DKPS method cuts model-evaluation queries using cached responses

MiniMax Open-Sources M2.7, Its First Self-Evolving Agent

OpenAI pushes to lock users and expand enterprise in internal memo

NVIDIA Launches Ising AI Models to Tackle Noisy Qubits

Microsoft Tests OpenClaw-Style Agents for Copilot

Anthropic Briefed Trump Administration on Mythos, Co‑Founder Confirms