Simon Willison published an alpha MicroPython-in-WASM sandbox (micropython-wasm) and a Datasette plugin (datasette-agent-micropython) to run plugin code with constrained access.
11 results for: dataset
MPMMine standardizes benchmarks for constraint-acquisition research
An arXiv preprint introduces MPMMine, a benchmark suite built to supply the domain artifacts and structured data constraint-acquisition methods need for reproducible evaluation.
Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks
An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
Datasette Adds Extensible 'Jump to' Menu in 1.0a30
Datasette 1.0a30 introduces a customizable, searchable 'Jump to...' menu and a plugin hook for adding entries to its index.
Authors Release OpenEval and Demand Item-Level Benchmark Standards
A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.
OpenClassGen Provides Extensive Python Classes for LLM Research
OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.
Experts Assess LLM Performance on Japanese Bar Exam's Open-Ended Tasks
A new study evaluates LLMs' legal reasoning using the Japanese bar exam's writing component.
AI and GPUs Accelerate Cosmic Data Analysis This Spring Astronomy Day
AI technologies and GPUs are streamlining the analysis of vast cosmic datasets for astronomers.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
Datasette 1.0a28 fixes alpha breakages, adds shutdown and test-cleanup APIs
Release 1.0a28 repairs compatibility regressions from 1.0a27, adds datasette.close and database.close behavior, and ships a pytest plugin to avoid fd leaks.
EVE Releases Open-Source 24B Earth-Intelligence LLM and Benchmarks
EVE publishes EVE-Instruct, a 24B Mistral-based model and a suite of Earth-science datasets, benchmarks, and tooling for domain-specific LLM deployment.