SGI-Bench

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

10 Disciplines 1000+ Samples 5 Task Families Agentic Evaluation

Overview

Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry — Deliberation, Conception, Action, and Perception — with the versatility and proficiency of a human scientist.

SGI-Bench Overview
SGI-Bench Pipeline

SGI Framework: Iterative cycle of scientific inquiry.

Deliberation: Deep Research

Multi-hop retrieval and meta-analysis style quantitative synthesis across scientific literature.

Conception: Idea Generation

Structured idea generation and multi-dimensional comparative evaluation of scientific hypotheses.

Action: Dry & Wet Experiments

Scientific code generation (dry) and laboratory protocol development (wet) for real experiments.

Perception: Experimental Reasoning

Multimodal reasoning over process, observation, simulation, experiment, and visualization images.

Scientist-Aligned Data Construction

Scientist-Aligned Data

Raw Corpus

Expert-curated texts/images across 10 domains from Science's 125 Big Questions.

Question Construction

100+ Master's and PhD holders; continuous expert review for scientific value.

Data Cleaning

Rule-based validation, model checks, expert QA for executability and unique answers.

Difficulty Filtering

Remove samples solvable by >50% strong LLMs to maintain high challenge.

SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science's 125 Big Questions), questions constructed by 100+ Master's and PhD holders with continuous scientist-in-the-loop review, multi-stage cleaning to ensure executability, and difficulty filtering that removes items solved by >50% strong LLMs.

Agentic Evaluation Framework

Four Stages

Question Selection → Metric Customization → Predict & Eval → Report Generation.

Tool Pool

Web search, PDF parser, Python interpreter, file reader, metric functions.

Task Metrics

EM/SLA; Implementation Similarity; PassAll@k/SER; MCA/RV.

Customizable

Add scientist-aligned metrics (e.g., rigor, feasibility) on demand.

An agent-based evaluation stack coordinating specialized agents and tools to assess models end-to-end with task-specific and customizable metrics. By formalizing question selection, metric construction, scoring, and reporting into traceable stages, it strengthens reproducibility and offers scientist-aligned, actionable insights.

Agentic Evaluation Framework
TTRL Training Dynamics

Test-Time Reinforcement Learning

Objective

Handle no-ground-truth idea generation by optimizing novelty at test time with online retrieval as baseline.

Reward Design

R = R_format + R_novelty. Enforce XML format; reward embedding dissimilarity from retrieved works.

Training Setup

GRPO on Qwen3-8B (ms-swift); group sampling G=8, high temperature, bfloat16.

Results

Format reward saturates quickly; novelty improves from 49.36 → 62.06 without labels.

TTRL converts open-ended scientific exploration into a measurable test-time optimization process. It improves idea novelty without labels by coupling strict output structure with retrieval-grounded rewards, and generalizes to multi-objective optimization balancing creativity with rigor and feasibility.

SGIEvalAgent Demo

Watch how the agentic evaluation framework assesses models through question selection, metric customization, prediction, and reporting.

Visual Analysis Cases

Explore scientific multimodal inputs across process, observation, experiment, simulation, and visualization images.

Leaderboard

Comprehensive evaluation across five task families spanning 10 scientific disciplines.

Overview

Model Type Deep Research Idea Generation Dry Experiment Wet Experiment Exp. Reasoning SGI-Score

Overview: Deep Research uses Exact Match; Idea Generation uses four-metric average; Dry Experiment uses PassAll@5; Wet Experiment averages Action Sequence Similarity and Parameter Accuracy; Experimental Reasoning uses Multi-choice Accuracy. SGI-Score is the mean across all tasks.

Scientific Deep Research

Deep Research Case
Model Type Exact Match Step-Level Accuracy

Analysis: Models show step-level alignment (SLA up to ~65%) but low exact-match accuracy (10-20%), with brittleness in cross-paper numeric aggregation and mechanism synthesis.

Idea Generation

Idea Generation Case
Model Type Effectiveness Novelty Detailedness Feasibility Average

Analysis: Models show high Novelty but Effectiveness is moderate and Feasibility remains low. Even top models with detailed ideas struggle to produce practical plans.

Dry Experiment (Code Generation)

Dry Experiment Case
Model Type PassAll@5 PassAll@3 PassAll@1 AET (s) SER

Analysis: Most models generate syntactically valid, runnable code (SER >90%) but show low test-case correctness (PassAll@5 ~36.64% best), exposing that code fluency does not imply scientific computational competence.

Wet Experiment (Lab Protocol)

Wet Experiment Case
Model Type Action Seq. Similarity Parameter Accuracy Average

Analysis: LLMs can list reasonable lab actions but fail to organize them into correct experimental trajectories, showing low sequence similarity and parameter accuracy.

Experimental Reasoning

Experimental Reasoning Case
Model Type Multi-choice Accuracy Reasoning Validity

Analysis: Open-source models substantially lag behind closed ones in Multi-choice Accuracy, but most LLMs produce better Reasoning Validity than answer accuracy, revealing persistent bottlenecks in comparative reasoning and numeric extraction.

Conclusion

This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning.

Experiments reveal a consistent pattern: in Deep Research, models show step-level alignment but low exact-match accuracy (10-20%); in Idea Generation, hypotheses are fluent but underspecified and infeasible; in Dry Experiment, code is executable but PassAll@k remains low; in Wet Experiment, sequences show omissions and misordering; and in Experimental Reasoning, causal reasoning outperforms comparative, with persistent multimodal challenges.

Taken together, SGI-Bench clarifies both what SGI is and where current systems fail. By integrating principled task design, multi-metric evaluation, and agentic tool use, our framework provides a concrete foundation for systematically advancing SGI. The combination of numerically robust reasoning, planning-aware conception, executable experimentation, comparative multimodal inference, and dynamic test-time learning charts a clear path toward general intelligence systems capable of genuine scientific discovery.

Citation

@article{xu2025probing,
  title={Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows},
  author={Xu, Wanghan and Zhou, Yuhao and Zhou, Yifan and Cao, Qinglong and Li, Shuo and Bu, Jia and Liu, Bo and Chen, Yixin and He, Xuming and Zhao, Xiangyu and others},
  journal={arXiv preprint arXiv:2512.16969},
  year={2025}
}