An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.
24 results for: research
PIGMENT extends quantitative diffusion MRI to sparse, multi-site and low-field scans
A physics-informed foundation model called PIGMENT learns a universal microstructure prior and adapts zero-shot to individual diffusion MRI scans, enabling reliable maps from sparse and heterogeneous data.
MPMMine standardizes benchmarks for constraint-acquisition research
An arXiv preprint introduces MPMMine, a benchmark suite built to supply the domain artifacts and structured data constraint-acquisition methods need for reproducible evaluation.
ATOM Report Finds Chinese Open Models Overtook Western Peers in 2025
A new ATOM analysis of about 1,500 open language models maps downloads, derivatives, inference share and performance, and reports Chinese models surpassed U.S.
Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks
An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
Authors Release OpenEval and Demand Item-Level Benchmark Standards
A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.
OpenClassGen Provides Extensive Python Classes for LLM Research
OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.
RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension
RPC-Bench addresses gaps in understanding academic papers for AI models with a new benchmark.
Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks
A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.
New LLM Framework Enhances Mathematical Reasoning Evaluation
A novel LLM-based framework provides flexible evaluation of mathematical reasoning, addressing limitations of symbolic methods.
Test-Time Matching Enhances Compositional Reasoning in Multimodal Models
A new test-time matching method improves compositional reasoning in AI models, achieving state-of-the-art results.
DenoiseRank Introduces Generative Approach to Learning to Rank
DenoiseRank leverages diffusion models for a fresh generative angle on learning to rank tasks.
OpenCLAW-P2P v6.0 Enhances Decentralized AI Peer Review with New Features
OpenCLAW-P2P v6.0 introduces advanced subsystems for decentralized AI peer review, improving paper resilience and retrieval.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
OpenAI Makes ChatGPT Free for Verified U.S. Healthcare Professionals
OpenAI has announced that verified U.S. physicians, nurse practitioners, and pharmacists can now access ChatGPT for Clinicians at no charge.
Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview
Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
Full fine-tuning concentrates LLM attribution in code-compliance models
An arXiv study uses perturbation-based attribution to compare FFT, LoRA, and quantized LoRA across model sizes and finds FFT yields more focused interpretive patterns.
Maps Claude system prompts into a Git commit timeline
Simon Willison turned Anthropic’s published Claude system prompts into per-model Markdown files with fake git commits so changes can be browsed on GitHub.
NVIDIA Launches Ising Open Models to Accelerate Quantum-Processor Development
NVIDIA introduced Ising, a family of open-source quantum AI models intended to help researchers and enterprises design quantum processors that can run useful applications.
OpenAI Launches GPT-Rosalind to Accelerate Life‑Sciences Research
OpenAI introduced GPT‑Rosalind, a frontier reasoning model aimed at speeding drug discovery, genomics, protein reasoning, and scientific workflows.
Researchers Build an Index to Measure the Human Relationship with Nature
Conservationists are moving from exclusionary models toward metrics that count human stewardship alongside ecological health.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.