AI Quick Take
- Paper argues aggregate scores hide validity problems and that item-level response data must be standard evaluation infrastructure.
- OpenEval, an item-level archive of 10M responses over 155k items under a unified schema, demonstrates how such data reveals low-quality items and construct misalignment.
A position paper on arXiv argues that AI evaluation must stop hiding behind aggregate scores and instead make item-level benchmark responses a default part of evaluation infrastructure. To demonstrate feasibility and value, the authors published OpenEval: an item-level archive that contains 10 million model responses covering 155,000 items from widely used benchmarks, organized under a unified schema.
The paper frames three recurring failures of current benchmarking practice-underspecified item selection, construct misalignment between tests and what they claim to measure, and weak generalization - and links those failures to the field’s focus on summary metrics. With item-level data, evaluators can identify bad or ambiguous items, document where benchmarks diverge from their stated constructs, and recover validity evidence about a benchmark’s internal structure. The authors argue that standardized releases unlock transparency, replicability, and auditability that aggregate scores cannot provide.
The proposal carries operational and political implications. Model developers and benchmark curators would need to adopt schemas and provenance practices for response-level data, and organizations that rely on benchmark claims-research labs, regulators, and procurement teams-would gain tools to verify those claims. The paper also addresses common objections such as dataset contamination and author burden, concluding these concerns are tractable relative to the harm of decisions based on unverifiable metrics. The critical next steps are community uptake: whether major benchmarks, conference testbeds, or platform providers accept item-level release as a norm and how the field standardizes schemas and access controls to balance transparency with legal and privacy constraints.