An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.
21 results for: evaluation
MPMMine standardizes benchmarks for constraint-acquisition research
An arXiv preprint introduces MPMMine, a benchmark suite built to supply the domain artifacts and structured data constraint-acquisition methods need for reproducible evaluation.
Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks
An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
Authors Release OpenEval and Demand Item-Level Benchmark Standards
A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.
Multimodal LLMs Underperform in Real-World Dermatology Evaluation
A new study reveals that multimodal large language models struggle with clinical dermatology tasks.
New Study Reveals Limits of Model-Level Evaluations in Alignment Assessments
A recent paper argues that alignment evaluation cannot solely rely on model-level assessments.
OpenClassGen Provides Extensive Python Classes for LLM Research
OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.
Aymara AI Launches Safety Evaluation System for 20 Language Models
Aymara AI unveils a platform for custom safety evaluations of large language models, revealing performance gaps.
Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks
A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.
ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex
ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.
Experts Assess LLM Performance on Japanese Bar Exam's Open-Ended Tasks
A new study evaluates LLMs' legal reasoning using the Japanese bar exam's writing component.
New LLM Framework Enhances Mathematical Reasoning Evaluation
A novel LLM-based framework provides flexible evaluation of mathematical reasoning, addressing limitations of symbolic methods.
New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI
A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.
Test-Time Matching Enhances Compositional Reasoning in Multimodal Models
A new test-time matching method improves compositional reasoning in AI models, achieving state-of-the-art results.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
RARE Introduces Framework for Evaluating High-Similarity Document Retrieval
The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.
RepIt Framework Enables Concept-Specific Refusal in Language Models
A new framework exposes vulnerabilities in language model safety evaluations through concept-specific manipulations.
Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview
Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.