An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
20 results for: LLM
Multimodal LLMs Underperform in Real-World Dermatology Evaluation
A new study reveals that multimodal large language models struggle with clinical dermatology tasks.
OpenClassGen Provides Extensive Python Classes for LLM Research
OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.
Zig Enforces Strict Anti-LLM Policy for Contributions
The Zig project's anti-LLM policy prohibits AI assistance in issues and pull requests, emphasizing human contributions.
Goodfire Launches Silico, a New Tool for Debugging LLMs
Silico allows developers to fine-tune AI model parameters during training, enhancing control.
Experts Assess LLM Performance on Japanese Bar Exam's Open-Ended Tasks
A new study evaluates LLMs' legal reasoning using the Japanese bar exam's writing component.
New LLM Framework Enhances Mathematical Reasoning Evaluation
A novel LLM-based framework provides flexible evaluation of mathematical reasoning, addressing limitations of symbolic methods.
OpenAI Merges Codex with GPT-5.4, Enhancing Coding Capabilities
OpenAI has integrated Codex into the GPT-5.4 framework, streamlining coding capabilities.
llm-openai-via-codex 0.1a0 Integrates LLM API with Codex CLI for Developers
The release of llm-openai-via-codex 0.1a0 simplifies API calls for developers using Codex CLI.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
NVIDIA Advances Optimizers to Speed Up LLM Training
NVIDIA introduces new higher-order optimizers to enhance training efficiency for large language models.
Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks
The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.
Run Claude Cowork and Claude Code Desktop in Amazon Bedrock
AWS now supports Claude Cowork and Claude Code Desktop inside Amazon Bedrock, available either directly or via an LLM gateway to broaden use beyond individual developer desktops.
Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview
Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
Full fine-tuning concentrates LLM attribution in code-compliance models
An arXiv study uses perturbation-based attribution to compare FFT, LoRA, and quantized LoRA across model sizes and finds FFT yields more focused interpretive patterns.
Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test
Simon Willison reports that a local, quantized Qwen3.6-35B-A3B run produced better pelican and flamingo illustrations than Anthropic's Claude Opus 4.
EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks
EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.
llm-anthropic 0.25 Adds Claude-Opus-4.7 with xhigh thinking_effort
Simon Willison released llm-anthropic 0.25, which ships claude-opus-4.7 supporting thinking_effort: xhigh and new thinking flags.