An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
10 results for: llms
Multimodal LLMs Underperform in Real-World Dermatology Evaluation
A new study reveals that multimodal large language models struggle with clinical dermatology tasks.
Goodfire Launches Silico, a New Tool for Debugging LLMs
Silico allows developers to fine-tune AI model parameters during training, enhancing control.
Experts Assess LLM Performance on Japanese Bar Exam's Open-Ended Tasks
A new study evaluates LLMs' legal reasoning using the Japanese bar exam's writing component.
OpenAI Merges Codex with GPT-5.4, Enhancing Coding Capabilities
OpenAI has integrated Codex into the GPT-5.4 framework, streamlining coding capabilities.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks
The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.
Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview
Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
Full fine-tuning concentrates LLM attribution in code-compliance models
An arXiv study uses perturbation-based attribution to compare FFT, LoRA, and quantized LoRA across model sizes and finds FFT yields more focused interpretive patterns.