An arXiv preprint introduces MPMMine, a benchmark suite built to supply the domain artifacts and structured data constraint-acquisition methods need for reproducible evaluation.
Category: Open Source & Research
Open models, papers, benchmarks, datasets, repositories and research-driven releases.
Paper Proposes Three-Step Framework for Knowledge-Work Benchmarks
An arXiv paper argues that LLM evaluation still mirrors traditional NLP tasks and offers a three-step method to align benchmarks with real workplace activity.
Multimodal LLMs Underperform in Real-World Dermatology Evaluation
A new study reveals that multimodal large language models struggle with clinical dermatology tasks.
OpenClassGen Provides Extensive Python Classes for LLM Research
OpenClassGen introduces a comprehensive dataset of Python classes, enhancing LLM evaluation.
RPC-Bench Introduces Fine-Grained Benchmark for Research Paper Comprehension
RPC-Bench addresses gaps in understanding academic papers for AI models with a new benchmark.
ATBench Introduces New Safety Evaluation Benchmarks for OpenClaw and Codex
ATBench unveils domain-specific benchmarks, ATBench-Claw and ATBench-Codex, enhancing trajectory safety evaluation.
Experts Assess LLM Performance on Japanese Bar Exam's Open-Ended Tasks
A new study evaluates LLMs' legal reasoning using the Japanese bar exam's writing component.
New Audit Reveals Flaws in Shapley Value Benchmarks for Explainable AI
A recent study critiques Shapley values, finding misalignment in evaluation metrics and human utility.
New Framework Streamlines Adaptive Medical Image Processing for Clinical Settings
A novel artifact-based agent framework enhances adaptability and reproducibility in medical imaging.
Civitai Launches High-Fidelity Studious Scout LoRA for Fortnite
Civitai releases the Studious Scout 🎒 LoRA for Fortnite, designed for flexibility and character consistency.
OpenCLAW-P2P v6.0 Enhances Decentralized AI Peer Review with New Features
OpenCLAW-P2P v6.0 introduces advanced subsystems for decentralized AI peer review, improving paper resilience and retrieval.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
RARE Introduces Framework for Evaluating High-Similarity Document Retrieval
The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.
MiniMax Open-Sources M2.7, Its First Self-Evolving Agent
MiniMax published M2.7 weights on Hugging Face; the model is billed as self-evolving and posts 56.22% on SWE‑Pro and 57.0% on Terminal Bench 2.