OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.
6 results for: benchmarking
RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension
RPC-Bench addresses gaps in understanding academic papers for AI models with a new benchmark.
ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex
ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.
RARE Introduces Framework for Evaluating High-Similarity Document Retrieval
The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.
OpenAI Releases ChatGPT Images 2.0
OpenAI published ChatGPT Images 2.0; Simon Willison ran a Where's‑Waldo‑style prompt to compare it with gpt-image-1 and rival models.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.