Benchmark leaderboard
Every method evaluated in the AssayBench cache, including ones that didn't make the main paper figures. Pick a metric, cohort, and split layout to re-rank the table and the companion chart.
From §5.1: On the test set, Gemini 3 Pro and GPT-5.4 lead on AnDCG@100, outperforming smaller open-weight LLMs, biology-specific language models and agents (Biomni and C2S-Scale), the trainable neural gene-relevance predictor, and Embedding kNN. The gene-frequency baseline is also surprisingly competitive overall: much of its signal is driven by Fitness / Proliferation / Viability screens.
| Model | Category | Val | Test | LaTest |
|---|
Notes
- Year split: built from BioGRID screens published through 2025, with train, val, and test cohorts assigned by publication year. Used for the main benchmark and covers every method in the cache.
- Random split: screens shuffled across train/val/test irrespective of year. Reported here for zero-shot LLMs, agents, the classifier, and oracle kNN where reverse-direction screens are excluded.
- LaTest cohort: held-out screens published in the past six months, refreshed regularly. The same cohort is used under both split layouts.
- Lower is better for dFDR@100 and Normalized dFDR@100; higher is better for the other three metrics.