Evaluation Toolkit

SGI-Probe

A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision-language models

View on GitHub Leaderboard SGI-Index Documentation

Composite Score

SGI-Index

A score-only composite index aggregating AstroVisBench, ChemBench, CMPhysBench, EarthSE, ResearchBench, SciCode, SFE, and TRQA into one 0-100 scientific intelligence score.

Loading SGI-Index data...

Each column is one model on the x-axis; bar height represents SGI-Index on a 0-100 y-axis. Missing benchmark scores are not filled with zero. Compute, token usage, and cost are not estimated because they are not present in the source CSV files.

Key Insights from SGI-Index

Models are mapped by two composite dimensions: scientific knowledge and reasoning on the x-axis, research workflow and execution on the y-axis.

Loading SGI-Index insight map...

Insight loading

Discipline Coverage

Grounded in real scientific practice across six major fields

Life Science Astronomy Earth Science Chemistry Materials Science Physics

Quick Start

Get from clone to first scores in minutes

# Install
git clone https://github.com/InternScience/SciEvalKit.git
cd SciEvalKit
pip install -e .[all]

# Run evaluation
python run.py \
  --dataset SFE \
  --model gpt-4o \
  --mode all \
  --work-dir outputs/demo

View on GitHub Technical Report