Project

ResearchClawBench: End-to-End Auto-Research Benchmark

Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

What is ResearchClawBench?

ResearchClawBench is the first comprehensive benchmark designed to evaluate AI agents on complete research workflows. Unlike traditional benchmarks that test isolated capabilities, ResearchClawBench measures an agent's ability to conduct end-to-end scientific research—from literature review and hypothesis formulation to experimental design, execution, and result validation.

Why End-to-End Evaluation?

Scientific research is not a collection of isolated tasks—it's a continuous, iterative process where each step builds upon the previous. An AI system that excels at literature review but fails at experimental design isn't truly capable of conducting research. ResearchClawBench addresses this by evaluating:

  • Re-Discovery Tasks: Can the agent reproduce known scientific findings?
  • New-Discovery Tasks: Can the agent generate novel, valid contributions?
  • Workflow Integration: Can the agent seamlessly connect all research phases?

Key Features

Multi-Domain Tasks

Tasks spanning machine learning, computational biology, physics simulations, and more

Checklist Evaluation

Fine-grained scoring with expert-designed checklists for each task

Interactive Dashboard

Real-time visualization of agent progress and research artifacts

Public Leaderboard

Compare agent performance across tasks and track progress over time

Evaluation Framework

ResearchClawBench evaluates agents on a scale from 0 to 100, where:

0-30
Incomplete or Failed
50
Matches Original Paper
51-100
Surpasses Original

Get Started

Ready to evaluate your AI research agent? ResearchClawBench is open-source and actively maintained.