The RARE framework addresses evaluation flaws in redundancy-heavy document retrieval, particularly in legal and financial sectors.
7 results for: evaluation
RepIt Framework Enables Concept-Specific Refusal in Language Models
A new framework exposes vulnerabilities in language model safety evaluations through concept-specific manipulations.
Hugging Face Releases ml-intern to Automate LLM Post‑Training Workflows
ml-intern is an open-source agent that automates literature review, dataset discovery, training script runs, and iterative evaluation for LLM post-training work.
Firefox 150 Fixes 271 Vulnerabilities Found Using Claude Mythos Preview
Mozilla patched 271 vulnerabilities after an initial security evaluation that used an early Claude Mythos Preview in collaboration with Anthropic.
Evaluates LLMs on Vietnamese legal text with a dual-aspect framework
An arXiv paper introduces a quantitative-plus-error-analysis benchmark for Vietnamese legal text, comparing GPT-4o, Claude 3 Opus, Gemini 1.5 Pro and Grok-1.
AllenAI launches vla-eval to unify Vision-Language-Action benchmarking
vla-eval decouples model inference from simulator execution with a WebSocket+msgpack protocol and Docker isolation, supporting 14 benchmarks and six model servers.
Merge GNN Predictions with LLM Reasoning in GLOW for Open-World QA
GLOW pairs a pre-trained GNN with an LLM to answer questions over incomplete knowledge graphs and ships GLOW-BENCH, a 1,000-question evaluation.