An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.
Category: Models & Launches
Major model releases, flagship updates, launches, benchmarks and product unveilings.
PIGMENT extends quantitative diffusion MRI to sparse, multi-site and low-field scans
A physics-informed foundation model called PIGMENT learns a universal microstructure prior and adapts zero-shot to individual diffusion MRI scans, enabling reliable maps from sparse and heterogeneous data.
ATOM Report Finds Chinese Open Models Overtook Western Peers in 2025
A new ATOM analysis of about 1,500 open language models maps downloads, derivatives, inference share and performance, and reports Chinese models surpassed U.S.
Authors Release OpenEval and Demand Item-Level Benchmark Standards
A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.
New Study Reveals Limits of Model-Level Evaluations in Alignment Assessments
A recent paper argues that alignment evaluation cannot solely rely on model-level assessments.
Aymara AI Launches Safety Evaluation System for 20 Language Models
Aymara AI unveils a platform for custom safety evaluations of large language models, revealing performance gaps.
Goodfire Launches Silico, a New Tool for Debugging LLMs
Silico allows developers to fine-tune AI model parameters during training, enhancing control.
Investors Fund Skye's AI Home Screen App Ahead of iPhone Launch
Skye's AI home screen application secures investor backing pre-launch, highlighting interest in smarter iPhones.
Microsoft Launches VibeVoice, a New Speech-to-Text Model
Microsoft introduces VibeVoice, a Whisper-style speech-to-text model with speaker diarization.
Test-Time Matching Enhances Compositional Reasoning in Multimodal Models
A new test-time matching method improves compositional reasoning in AI models, achieving state-of-the-art results.
OpenAI Introduces Parameter Golf in Model Craft Initiative
OpenAI's latest initiative, Parameter Golf, aims to refine model performance metrics.
DenoiseRank Introduces Generative Approach to Learning to Rank
DenoiseRank leverages diffusion models for a fresh generative angle on learning to rank tasks.
Nemobot Introduces Strategic AI Agents for Interactive Gaming
Nemobot leverages large language models to create customizable AI agents for strategic games.
AI Models Show Risks for Biological Misuse Amid Evolving Safeguards
Recent benchmarks reveal AI models may enable biological weaponization by low-expertise users, raising urgent policy concerns.
Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs
Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.
OpenAI Makes ChatGPT Free for Verified U.S. Healthcare Professionals
OpenAI has announced that verified U.S. physicians, nurse practitioners, and pharmacists can now access ChatGPT for Clinicians at no charge.
RepIt Framework Enables Concept-Specific Refusal in Language Models
A new framework exposes vulnerabilities in language model safety evaluations through concept-specific manipulations.
OpenAI Adds Codex-Powered Workspace Agents to ChatGPT
OpenAI introduced workspace agents in ChatGPT: Codex-powered cloud agents designed to automate complex workflows and scale team work across tools securely.
Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview
Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.
Full fine-tuning concentrates LLM attribution in code-compliance models
An arXiv study uses perturbation-based attribution to compare FFT, LoRA, and quantized LoRA across model sizes and finds FFT yields more focused interpretive patterns.
OpenAI Releases ChatGPT Images 2.0
OpenAI published ChatGPT Images 2.0; Simon Willison ran a Where's‑Waldo‑style prompt to compare it with gpt-image-1 and rival models.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
Anthropic updates Claude Opus 4.7 system prompt with new tools and tighter safety guidance
Anthropic revised the Claude Opus 4.7 system prompt to add a PowerPoint agent, expand child-safety rules, and change interaction guidance.
Anthropic Ships Claude Opus 4.7 for Agentic Coding and High‑Res Vision
Anthropic released Claude Opus 4.7, a focused successor to Opus 4.6 that emphasizes agentic software engineering, high-resolution vision and long-horizon autonomy.
Anthropic ships Claude Opus 4.7 as its most powerful generally available model
Opus 4.7 arrives as Anthropic’s strongest generally available Claude release, claiming upgrades for advanced coding, image analysis and instruction following.
OpenAI Debuts GPT-Rosalind for Drug Discovery and Genomics
OpenAI launched GPT-Rosalind, its first life‑sciences model aimed at accelerating drug discovery and genomic analysis and cutting long development timelines.
Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test
Simon Willison reports that a local, quantized Qwen3.6-35B-A3B run produced better pelican and flamingo illustrations than Anthropic's Claude Opus 4.
llm-anthropic 0.25 Adds Claude-Opus-4.7 with xhigh thinking_effort
Simon Willison released llm-anthropic 0.25, which ships claude-opus-4.7 supporting thinking_effort: xhigh and new thinking flags.
Google Launches Gemini 3.1 Flash TTS with 70+ Language, Multi‑Speaker Support
Gemini 3.1 Flash TTS is a preview that refocuses Google’s speech work on expressive control, natural‑language audio tags, and native multilingual, multi‑speaker output.
DeepMind Ships Gemini Robotics‑ER 1.6 for Physical Robot Reasoning
Gemini Robotics‑ER 1.6 adds instrument-reading plus improved visual, spatial and planning skills to DeepMind's embodied-reasoning model for robots.
NVIDIA Launches Ising AI Models to Tackle Noisy Qubits
NVIDIA unveiled Ising, an open family of AI models with Calibration and Decoding domains designed to help build fault-tolerant quantum processors.
OpenAI pushes to lock users and expand enterprise in internal memo
CRO Denise Dresser told staff to prioritize user retention and enterprise sales and to build a product 'moat' as users easily switch between top models.