Research Paper arXiv:2512.16969

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, et al.

arXiv Paper Project Website Code

Abstract

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.

SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10–20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.

We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

Key Contributions

PIM Framework

Practical Inquiry Model with four dimensions: Deliberation, Conception, Action, and Perception

1000+ Expert Tasks

Cross-disciplinary samples inspired by Science's 125 Big Questions

Four Task Types

Deep research, idea generation, dry experiments, and wet experiments

TTRL Method

Test-Time Reinforcement Learning for enhanced hypothesis novelty

Citation

@article{xu2025sgi,
  title={Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows},
  author={Xu, Wanghan and Zhou, Yuhao and Zhou, Yifan and others},
  journal={arXiv preprint arXiv:2512.16969},
  year={2025}
}