Evaluation Toolkit

SGI-Probe

A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision-language models

Composite Score

SGI-Index

A score-only composite index aggregating AstroVisBench, ChemBench, CMPhysBench, EarthSE, ResearchBench, SciCode, SFE, and TRQA into one 0-100 scientific intelligence score.

Loading SGI-Index data...

Each column is one model on the x-axis; bar height represents SGI-Index on a 0-100 y-axis. Missing benchmark scores are not filled with zero. Compute, token usage, and cost are not estimated because they are not present in the source CSV files.

Key Insights from SGI-Index

Models are mapped by two composite dimensions: scientific knowledge and reasoning on the x-axis, research workflow and execution on the y-axis.

Loading SGI-Index insight map...
Insight loading
-

Discipline Coverage

Grounded in real scientific practice across six major fields

Life Science Astronomy Earth Science Chemistry Materials Science Physics

Quick Start

Get from clone to first scores in minutes

# Install
git clone https://github.com/InternScience/SciEvalKit.git
cd SciEvalKit
pip install -e .[all]

# Run evaluation
python run.py \
  --dataset SFE \
  --model gpt-4o \
  --mode all \
  --work-dir outputs/demo