ResearchClawBench: End-to-End Auto-Research Benchmark
Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
What is ResearchClawBench?
ResearchClawBench is the first comprehensive benchmark designed to evaluate AI agents on complete research workflows. Unlike traditional benchmarks that test isolated capabilities, ResearchClawBench measures an agent's ability to conduct end-to-end scientific research—from literature review and hypothesis formulation to experimental design, execution, and result validation.
Why End-to-End Evaluation?
Scientific research is not a collection of isolated tasks—it's a continuous, iterative process where each step builds upon the previous. An AI system that excels at literature review but fails at experimental design isn't truly capable of conducting research. ResearchClawBench addresses this by evaluating:
- Re-Discovery Tasks: Can the agent reproduce known scientific findings?
- New-Discovery Tasks: Can the agent generate novel, valid contributions?
- Workflow Integration: Can the agent seamlessly connect all research phases?
Key Features
Multi-Domain Tasks
Tasks spanning machine learning, computational biology, physics simulations, and more
Checklist Evaluation
Fine-grained scoring with expert-designed checklists for each task
Interactive Dashboard
Real-time visualization of agent progress and research artifacts
Public Leaderboard
Compare agent performance across tasks and track progress over time
Evaluation Framework
ResearchClawBench evaluates agents on a scale from 0 to 100, where:
Get Started
Ready to evaluate your AI research agent? ResearchClawBench is open-source and actively maintained.