Search: LLM | CurrentLens.com

Open Source & Research

Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks

CurrentLens
May 25, 2026

An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.

Open Source & Research

Multimodal LLMs Underperform in Real-World Dermatology Evaluation

CurrentLens
May 8, 2026

A new study reveals that multimodal large language models struggle with clinical dermatology tasks.

Open Source & Research

OpenClassGen Provides Extensive Python Classes for LLM Research

CurrentLens
May 3, 2026

OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.

AI in Coding

Zig Enforces Strict Anti-LLM Policy for Contributions

CurrentLens
Apr 30, 2026

The Zig project's anti-LLM policy prohibits AI assistance in issues and pull requests, emphasizing human contributions.

Models & Launches

Goodfire Launches Silico, a New Tool for Debugging LLMs

CurrentLens
Apr 30, 2026

Silico allows developers to fine-tune AI model parameters during training, enhancing control.

Open Source & Research

Experts Assess LLM Performance on Japanese Bar Exam's Open-Ended Tasks

CurrentLens
Apr 29, 2026

A new study evaluates LLMs' legal reasoning using the Japanese bar exam's writing component.

Science & Healthcare

New LLM Framework Enhances Mathematical Reasoning Evaluation

CurrentLens
Apr 28, 2026

A novel LLM-based framework provides flexible evaluation of mathematical reasoning, addressing limitations of symbolic methods.

Agents & Automation

OpenAI Merges Codex with GPT-5.4, Enhancing Coding Capabilities

CurrentLens
Apr 26, 2026

OpenAI has integrated Codex into the GPT-5.4 framework, streamlining coding capabilities.

AI in Coding

llm-openai-via-codex 0.1a0 Integrates LLM API with Codex CLI for Developers

CurrentLens
Apr 24, 2026

The release of llm-openai-via-codex 0.1a0 simplifies API calls for developers using Codex CLI.

Open Source & Research

Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows

CurrentLens
Apr 23, 2026

ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.

Chips & Infrastructure

NVIDIA Advances Optimizers to Speed Up LLM Training

CurrentLens
Apr 23, 2026

NVIDIA introduces new higher-order optimizers to enhance training efficiency for large language models.

AI in Coding

Qwen 3.6-27B Model Surpasses Previous Coding Benchmarks

CurrentLens
Apr 23, 2026

The new Qwen 3.6-27B model delivers superior coding performance with a significantly reduced size.

AI in Coding

Run Claude Cowork and Claude Code Desktop in Amazon Bedrock

CurrentLens
Apr 22, 2026

AWS now supports Claude Cowork and Claude Code Desktop inside Amazon Bedrock, available either directly or via an LLM gateway to broaden use beyond individual developer desktops.

Models & Launches

Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview

CurrentLens
Apr 22, 2026

Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.

Open Source & Research

Evaluates LLMs on Vietnamese legal text with a dual-aspect framework

CurrentLens
Apr 21, 2026

An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.

Models & Launches

Full fine-tuning concentrates LLM attribution in code-compliance models

CurrentLens
Apr 21, 2026

An arXiv study uses perturbation-based attribution to compare FFT, LoRA, and quantized LoRA across model sizes and finds FFT yields more focused interpretive patterns.

Models & Launches

Qwen3.6-35B-A3B bests Claude Opus 4.7 on Willison's pelican test

CurrentLens
Apr 16, 2026

Simon Willison reports that a local, quantized Qwen3.6-35B-A3B run produced better pelican and flamingo illustrations than Anthropic's Claude Opus 4.

Science & Healthcare

EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks

CurrentLens
Apr 16, 2026

EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.

Open Source & Research

Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA

CurrentLens
Apr 16, 2026

GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.

Models & Launches

llm-anthropic 0.25 Adds Claude-Opus-4.7 with xhigh thinking_effort

CurrentLens
Apr 16, 2026

Simon Willison released llm-anthropic 0.25, which ships claude-opus-4.7 supporting thinking_effort: xhigh and new thinking flags.

Latest
Trending

Science & Healthcare

Africa CDC and WHO launch $518M continental Ebola response plan

CurrentLens
Jun 6, 2026

A six-month 'One Response' plan targets the Bundibugyo Ebola outbreak with unified coordination, surveillance, clinical care and community engagement across affected countries.

Policy & Safety

HASC adds right-to-repair language to FY27 defense policy bill

CurrentLens
Jun 6, 2026

The House Armed Services Committee inserted right-to-repair provisions into its FY27 defense policy draft, aiming to ease barriers that limit troops' ability to fix equipment.

AI Creative

Startups Pull Users Off Phones With In-Person Games and DIY Cyberdecks

CurrentLens
Jun 6, 2026

TechCrunch highlights founders building physical social products: Board raised funding for in-person games, and cyberdeck DIYs are going viral.

Agents & Automation

MicroPython WASM Sandbox Enables Safer Datasette Plugin Execution

CurrentLens
Jun 6, 2026

Simon Willison published an alpha MicroPython-in-WASM sandbox (micropython-wasm) and a Datasette plugin (datasette-agent-micropython) to run plugin code with constrained access.

Models & Launches

DKPS method cuts model-evaluation queries using cached responses

CurrentLens
Jun 6, 2026

An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.