A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision-language models
A score-only composite index aggregating AstroVisBench, ChemBench, CMPhysBench, EarthSE, ResearchBench, SciCode, SFE, and TRQA into one 0-100 scientific intelligence score.
Each column is one model on the x-axis; bar height represents SGI-Index on a 0-100 y-axis. Missing benchmark scores are not filled with zero. Compute, token usage, and cost are not estimated because they are not present in the source CSV files.
Models are mapped by two composite dimensions: scientific knowledge and reasoning on the x-axis, research workflow and execution on the y-axis.
Grounded in real scientific practice across six major fields
Get from clone to first scores in minutes
# Install
git clone https://github.com/InternScience/SciEvalKit.git
cd SciEvalKit
pip install -e .[all]
# Run evaluation
python run.py \
--dataset SFE \
--model gpt-4o \
--mode all \
--work-dir outputs/demo