The open-source harness standardizes integrations (one predict() for models, a four-method benchmark interface), speeds cross-evaluation up to 47×, and publishes a 657-entry VLA leaderboard.
AI Quick Take
- vla-eval standardizes VLA evaluations by separating model inference from benchmark execution via WebSocket+msgpack and Docker.
- Parallel episode sharding and batch inference yield up to 47× wall-clock speedups; the project reproduces published scores and publishes a 657-result leaderboard.
AllenAI published vla-eval, an open-source evaluation harness that standardizes Vision-Language-Action (VLA) benchmarking by decoupling model inference from simulator execution. The project uses a WebSocket+msgpack protocol combined with Docker-based environment isolation so models and benchmarks can run independently without resolving conflicting dependencies or undocumented preprocessing for each benchmark.
The framework requires models to implement a single predict() method and benchmarks to expose a four-method interface, enabling automatic pairing across the full cross-evaluation matrix. vla-eval currently supports 14 simulation benchmarks and six model servers, and adds parallelization features-episode sharding and batch inference - that the authors report produce up to 47× wall-clock speedups (for example, completing 2,000 LIBERO episodes in about 18 minutes).
To validate the harness, the team reproduced published scores across six VLA codebases and three benchmarks and documented undocumented pitfalls encountered during reproduction. The project also publishes a VLA leaderboard aggregating 657 published results across 17 benchmarks. All artifacts, evaluation configurations, and reproduction results are available at the project's GitHub repo and its public leaderboard site.