Search: benchmarking | CurrentLens.com

Open Source & Research

OpenClassGen Provides Extensive Python Classes for LLM Research

CurrentLens
May 3, 2026

OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.

Open Source & Research

RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension

CurrentLens
May 1, 2026

RPC-Bench addresses gaps in understanding academic papers for AI models with a new benchmark.

Open Source & Research

ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex

CurrentLens
Apr 30, 2026

ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.

Open Source & Research

RARE Introduces Framework for Evaluating High-Similarity Document Retrieval

CurrentLens
Apr 23, 2026

The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.

Models & Launches

OpenAI Releases ChatGPT Images 2.0

CurrentLens
Apr 21, 2026

OpenAI published ChatGPT Images 2.0; Simon Willison ran a Where's‑Waldo‑style prompt to compare it with gpt-image-1 and rival models.

Models & Launches

AllenAI launches vla-eval to unify Vision-Language-Action benchmarking

CurrentLens
Apr 21, 2026

vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.

Latest
Trending

Science & Healthcare

Africa CDC and WHO launch $518M continental Ebola response plan

CurrentLens
Jun 6, 2026

A six-month 'One Response' plan targets the Bundibugyo Ebola outbreak with unified coordination, surveillance, clinical care and community engagement across affected countries.

Policy & Safety

HASC adds right-to-repair language to FY27 defense policy bill

CurrentLens
Jun 6, 2026

The House Armed Services Committee inserted right-to-repair provisions into its FY27 defense policy draft, aiming to ease barriers that limit troops' ability to fix equipment.

AI Creative

Startups Pull Users Off Phones With In-Person Games and DIY Cyberdecks

CurrentLens
Jun 6, 2026

TechCrunch highlights founders building physical social products: Board raised funding for in-person games, and cyberdeck DIYs are going viral.

Agents & Automation

MicroPython WASM Sandbox Enables Safer Datasette Plugin Execution

CurrentLens
Jun 6, 2026

Simon Willison published an alpha MicroPython-in-WASM sandbox (micropython-wasm) and a Datasette plugin (datasette-agent-micropython) to run plugin code with constrained access.

Models & Launches

DKPS method cuts model-evaluation queries using cached responses

CurrentLens
Jun 6, 2026

An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.