A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.
2 results for: ai evaluation
New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI
A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.