Benchmark leaderboard

Every method evaluated in the AssayBench cache, including ones that didn't make the main paper figures. Pick a metric, cohort, and split layout to re-rank the table and the companion chart.

From §5.1: On the test set, Gemini 3 Pro and GPT-5.4 lead on AnDCG@100, outperforming smaller open-weight LLMs, biology-specific language models and agents (Biomni and C2S-Scale), the trainable neural gene-relevance predictor, and Embedding kNN. The gene-frequency baseline is also surprisingly competitive overall: much of its signal is driven by Fitness / Proliferation / Viability screens.

Metric

Split layout

Sort by cohort

Model	Category	Val	Test	LaTest

Notes

Year split: built from BioGRID screens published through 2025, with train, val, and test cohorts assigned by publication year. Used for the main benchmark and covers every method in the cache.
Random split: screens shuffled across train/val/test irrespective of year. Reported here for zero-shot LLMs, agents, the classifier, and oracle kNN where reverse-direction screens are excluded.
LaTest cohort: held-out screens published in the past six months, refreshed regularly. The same cohort is used under both split layouts.
Lower is better for dFDR@100 and Normalized dFDR@100; higher is better for the other three metrics.