Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.
10 results for: benchmark
Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks
The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.
RARE Introduces Framework for Evaluating High-Similarity Document Retrieval
The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
OpenAI Releases ChatGPT Images 2.0
OpenAI published ChatGPT Images 2.0; Simon Willison ran a Where's‑Waldo‑style prompt to compare it with gpt-image-1 and rival models.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test
Simon Willison reports that a local, quantized Qwen3.6-35B-A3B run produced better pelican and flamingo illustrations than Anthropic's Claude Opus 4.
EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks
EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.
MiniMax Open-Sources M2.7, Its First Self-Evolving Agent
MiniMax published M2.7 weights on Hugging Face; the model is billed as self-evolving and posts 56.22% on SWE‑Pro and 57.0% on Terminal Bench 2.