Benchmark leaderboard

Every method evaluated in the AssayBench cache, including ones that didn't make the main paper figures. Pick a metric, cohort, and split layout to re-rank the table and the companion chart.

From §5.1: On the test set, Gemini 3 Pro and GPT-5.4 lead on AnDCG@100, outperforming smaller open-weight LLMs, biology-specific language models and agents (Biomni and C2S-Scale), the trainable neural gene-relevance predictor, and Embedding kNN. The gene-frequency baseline is also surprisingly competitive overall: much of its signal is driven by Fitness / Proliferation / Viability screens.

Model Category Val Test LaTest

Notes