Benchmark · Phenotypic Screen Prediction · Genentech · 2026
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
Genentech, South San Francisco, CA, USA · ★ equal contribution
Building the virtual cell requires more than predicting gene expression. Can a model predict the outcome of a CRISPR screen before you run it? AssayBench frames in-silico phenotypic screening as a gene-ranking task on 1,920 public CRISPR screens and gives a single, comparable yardstick (adjusted nDCG) for measuring progress across heterogeneous assays.
Abstract
Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in-silico phenotypic screening and, more broadly, virtual cell models.
By the numbers
Headline findings
- Frontier generalist LLMs outperform biology-specific LLMs, biomedical agents, trainable baselines, and retrieval baselines.
- Even the best models remain far below empirical performance ceilings. Oracle kNN outperforms the top model by ~86%, and technical-replicate predictors nearly double the AnDCG@100 of Gemini 3 Pro.
- Fine-tuning (SFT, GRPO) and ensembling each push performance further; the LLM RRF ensemble achieves the best overall test score.
- The performance drop on our recent LaTest split, together with a citation-count effect, is consistent with partial memorization of public screen literature.
Explore the results
Every panel below is interactive and powered by the same results cache that produced the paper's figures.
Full benchmark table
Sort and filter every model in the cache by metric, cohort, and split layout. A dot-and-line chart updates with your selection.
Performance by phenotype
Mean AnDCG@100 per (phenotype × model) on the year-split test cohort. Viability screens are the most predictable, host-pathogen the hardest.
Qwen3.5 scaling + optimization
Mean AnDCG@100 vs parameter count for the Qwen3.5 family, plus side-by-side bars comparing base, SFT, and GRPO variants.
Citations vs performance
Gemini 3 Pro AnDCG@100 vs log(1 + citation count) across screens, plus regression coefficients for year, citations, and phenotype.
Gene-set bias
For each (model, screen), the fraction of top-100 predicted genes in curated gene sets minus the same fraction in ground truth.
Per-screen explorer
Pick any of the 1,920 screens to see how every model ranks it on AnDCG@100, alongside the screen's metadata.
Screen UMAP
UMAP over the ground-truth screen transfer matrix. Color by split, phenotype, cell type, library, or publication year.
How the task works
Each example is a free-text description of a CRISPR screen (cell line, library, perturbation, condition, phenotype) plus a list of genes in the screen library. The model returns a ranked list of genes most likely to be hits. Predictions are scored against thresholded percentile-relevance scores from the BioGRID ORCS source data, summarized by adjusted nDCG (AnDCG): a chance-corrected ranking metric that is comparable across screens of very different sizes and hit rates.
The benchmark ships three cohorts: a year-split train / val / test built from BioGRID screens published through 2025, and a held-out LaTest split that is refreshed regularly with screens published in the past six months. It serves as an ongoing memorization probe for new frontier models.