A new study reveals that multimodal large language models struggle with clinical dermatology tasks.
46 results for: Model
Pentagon Sees Opportunities in Frontier AI Models Despite Mythos Concerns
Defense officials are discussing frontier AI models, focusing on potential benefits amidst risks raised by Mythos.
New Study Reveals Limits of Model-Level Evaluations in Alignment Assessments
A recent paper argues that alignment evaluation cannot solely rely on model-level assessments.
RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension
RPC-Bench addresses gaps in understanding academic papers for AI models with a new benchmark.
Aymara AI Launches Safety Evaluation System for 20 Language Models
Aymara AI unveils a platform for custom safety evaluations of large language models, revealing performance gaps.
Elon Musk Reveals xAI Trained Grok Using OpenAI Models
Elon Musk testified that xAI used OpenAI's models to enhance its Grok AI, raising regulatory questions.
Research Proposes MedCheck Framework to Enhance Medical AI Benchmarks
A new framework aims to improve the assessment of medical AI benchmarks, addressing key shortcomings.
Goodfire Launches Silico, a New Tool for Debugging LLMs
Silico allows developers to fine-tune AI model parameters during training, enhancing control.
NVIDIA Nemotron 3 Nano Omni Model Launches on Amazon SageMaker JumpStart
NVIDIA now offers the Nemotron 3 Nano Omni model on Amazon SageMaker JumpStart for enterprise use.
AI Firms Limit Access to Models Amid Rising Dual-Use Risks
Leading AI companies restrict access to advanced models like GPT-Rosalind due to safety concerns.
Pentagon Integrates Google’s AI Model into GenAI.mil Amid Rising Usage
The Pentagon has incorporated Google's latest AI model into GenAI.mil as user engagement surges.
Microsoft Launches VibeVoice, a New Speech-to-Text Model
Microsoft introduces VibeVoice, a Whisper-style speech-to-text model with speaker diarization.
Test-Time Matching Enhances Compositional Reasoning in Multimodal Models
A new test-time matching method improves compositional reasoning in AI models, achieving state-of-the-art results.
Civitai Launches High-Fidelity Studious Scout LoRA for Fortnite
Civitai releases the Studious Scout 🎒 LoRA for Fortnite, designed for flexibility and character consistency.
OpenAI Introduces Parameter Golf in Model Craft Initiative
OpenAI's latest initiative, Parameter Golf, aims to refine model performance metrics.
NVIDIA Optimizes Jetson for Empowering Physical AI with Enhanced Memory Efficiency
NVIDIA reveals enhancements in Jetson's memory management, enabling larger AI models at the edge.
DenoiseRank Introduces Generative Approach to Learning to Rank
DenoiseRank leverages diffusion models for a fresh generative angle on learning to rank tasks.
Claude Code Addresses Quality Complaints After Internal Review
Claude Code's recent quality issues stem from three specific bugs, not from the models themselves.
Nemobot Introduces Strategic AI Agents for Interactive Gaming
Nemobot leverages large language models to create customizable AI agents for strategic games.
AI Models Show Risks for Biological Misuse Amid Evolving Safeguards
Recent benchmarks reveal AI models may enable biological weaponization by low-expertise users, raising urgent policy concerns.
NVIDIA Advances Optimizers to Speed Up LLM Training
NVIDIA introduces new higher-order optimizers to enhance training efficiency for large language models.
Xiaomi Launches MiMo-V2.5-Pro and MiMo-V2.5 at Lower Costs
Xiaomi's new MiMo models achieve frontier benchmarks while reducing token costs significantly.
ChatGPT Images 2.0 Excels in Text Generation Capabilities
OpenAI's ChatGPT Images 2.0 model showcases a surprising proficiency in text generation.
Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks
The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.
RepIt Framework Enables Concept-Specific Refusal in Language Models
A new framework exposes vulnerabilities in language model safety evaluations through concept-specific manipulations.
Full fine-tuning concentrates LLM attribution in code-compliance models
An arXiv study uses perturbation-based attribution to compare FFT, LoRA, and quantized LoRA across model sizes and finds FFT yields more focused interpretive patterns.
OpenAI Releases ChatGPT Images 2.0
OpenAI published ChatGPT Images 2.0; Simon Willison ran a Where's‑Waldo‑style prompt to compare it with gpt-image-1 and rival models.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
Anthropic updates Claude Opus 4.7 system prompt with new tools and tighter safety guidance
Anthropic revised the Claude Opus 4.7 system prompt to add a PowerPoint agent, expand child-safety rules, and change interaction guidance.
Anthropic Ships Claude Opus 4.7 for Agentic Coding and High‑Res Vision
Anthropic released Claude Opus 4.7, a focused successor to Opus 4.6 that emphasizes agentic software engineering, high-resolution vision and long-horizon autonomy.
Maps Claude system prompts into a Git commit timeline
Simon Willison turned Anthropic’s published Claude system prompts into per-model Markdown files with fake git commits so changes can be browsed on GitHub.
NVIDIA Launches Ising Open Models to Accelerate Quantum-Processor Development
NVIDIA introduced Ising, a family of open-source quantum AI models intended to help researchers and enterprises design quantum processors that can run useful applications.
Anthropic ships Claude Opus 4.7 as its most powerful generally available model
Opus 4.7 arrives as Anthropic’s strongest generally available Claude release, claiming upgrades for advanced coding, image analysis and instruction following.
OpenAI Launches GPT-Rosalind to Accelerate Life‑Sciences Research
OpenAI introduced GPT‑Rosalind, a frontier reasoning model aimed at speeding drug discovery, genomics, protein reasoning, and scientific workflows.
OpenAI opens GPT‑5.4‑Cyber to security vendors with $10M Trusted Access grants
OpenAI is placing GPT‑5.
Anthropic Lawsuit Exposes 'Humans-in-the-Loop' Illusion in AI Warfare
A legal fight between Anthropic and the Pentagon centers on whether commercial models can be sold for military use as AI moves beyond purely analytic roles in the conflict with Iran.
OpenAI Debuts GPT-Rosalind for Drug Discovery and Genomics
OpenAI launched GPT-Rosalind, its first life‑sciences model aimed at accelerating drug discovery and genomic analysis and cutting long development timelines.
Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test
Simon Willison reports that a local, quantized Qwen3.6-35B-A3B run produced better pelican and flamingo illustrations than Anthropic's Claude Opus 4.
Researchers Build an Index to Measure the Human Relationship with Nature
Conservationists are moving from exclusionary models toward metrics that count human stewardship alongside ecological health.
EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks
EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.
llm-anthropic 0.25 Adds Claude-Opus-4.7 with xhigh thinking_effort
Simon Willison released llm-anthropic 0.25, which ships claude-opus-4.7 supporting thinking_effort: xhigh and new thinking flags.
DeepMind Ships Gemini Robotics‑ER 1.6 for Physical Robot Reasoning
Gemini Robotics‑ER 1.6 adds instrument-reading plus improved visual, spatial and planning skills to DeepMind's embodied-reasoning model for robots.
Anthropic Briefed Trump Administration on Mythos, Co‑Founder Confirms
Jack Clark said at the Semafor summit that Anthropic provided a briefing on its Mythos model to the Trump administration while litigation is ongoing.
NVIDIA Launches Ising AI Models to Tackle Noisy Qubits
NVIDIA unveiled Ising, an open family of AI models with Calibration and Decoding domains designed to help build fault-tolerant quantum processors.
OpenAI pushes to lock users and expand enterprise in internal memo
CRO Denise Dresser told staff to prioritize user retention and enterprise sales and to build a product 'moat' as users easily switch between top models.
MiniMax Open-Sources M2.7, Its First Self-Evolving Agent
MiniMax published M2.7 weights on Hugging Face; the model is billed as self-evolving and posts 56.22% on SWE‑Pro and 57.0% on Terminal Bench 2.