Benchmark · Phenotypic Screen Prediction · Genentech · 2026

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer, Carl Edwards, Alexander Wu, Jenna Collier, Graham Heimberg, Xiner Li, Meena Subramaniam, Ehsan Hajiramezanali, David Richmond, Jan-Christian Hütter, Sara Mostafavi, Gabriele Scalia

Genentech, South San Francisco, CA, USA · ★ equal contribution

📄 arXiv:2605.10876 💻 Code 🤗 Dataset 📚 Citation

Building the virtual cell requires more than predicting gene expression. Can a model predict the outcome of a CRISPR screen before you run it? AssayBench frames in-silico phenotypic screening as a gene-ranking task on 1,920 public CRISPR screens and gives a single, comparable yardstick (adjusted nDCG) for measuring progress across heterogeneous assays.

TL;DR. We benchmark frontier LLMs, biology-specific LLMs, agents, trainable gene-relevance predictors, and retrieval / frequency baselines. Generalist frontier LLMs lead the board, but all methods remain far below empirically estimated performance ceilings. Fine-tuning and ensembling push the frontier further. Performance partly tracks citation counts, consistent with memorization, and motivates our recent-screens "LaTest" split.

Abstract

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in-silico phenotypic screening and, more broadly, virtual cell models.

By the numbers

1,920

Public CRISPR screens

…

Methods evaluated

…

Model families

Phenotype classes

Ranking metrics

Cohorts (val / test / LaTest)

Headline findings

Frontier generalist LLMs outperform biology-specific LLMs, biomedical agents, trainable baselines, and retrieval baselines.
Even the best models remain far below empirical performance ceilings. Oracle kNN outperforms the top model by ~86%, and technical-replicate predictors nearly double the AnDCG@100 of Gemini 3 Pro.
Fine-tuning (SFT, GRPO) and ensembling each push performance further; the LLM RRF ensemble achieves the best overall test score.
The performance drop on our recent LaTest split, together with a citation-count effect, is consistent with partial memorization of public screen literature.

Explore the results

Every panel below is interactive and powered by the same results cache that produced the paper's figures.

Figure 2 · Leaderboard

Full benchmark table

Sort and filter every model in the cache by metric, cohort, and split layout. A dot-and-line chart updates with your selection.

Open the leaderboard →

Figure 3 · Phenotypes

Performance by phenotype

Mean AnDCG@100 per (phenotype × model) on the year-split test cohort. Viability screens are the most predictable, host-pathogen the hardest.

Open the heatmap →

Figure 4 (left) · Scaling

Qwen3.5 scaling + optimization

Mean AnDCG@100 vs parameter count for the Qwen3.5 family, plus side-by-side bars comparing base, SFT, and GRPO variants.

Open scaling →

Figure 4 (right) · Memorization

Citations vs performance

Gemini 3 Pro AnDCG@100 vs log(1 + citation count) across screens, plus regression coefficients for year, citations, and phenotype.

Open the analysis →

Figure 5 · Bias

Gene-set bias

For each (model, screen), the fraction of top-100 predicted genes in curated gene sets minus the same fraction in ground truth.

Open the heatmap →

Drill-down

Per-screen explorer

Pick any of the 1,920 screens to see how every model ranks it on AnDCG@100, alongside the screen's metadata.

Open the explorer →

Latent biology

Screen UMAP

UMAP over the ground-truth screen transfer matrix. Color by split, phenotype, cell type, library, or publication year.

Open the UMAP →

How the task works

Each example is a free-text description of a CRISPR screen (cell line, library, perturbation, condition, phenotype) plus a list of genes in the screen library. The model returns a ranked list of genes most likely to be hits. Predictions are scored against thresholded percentile-relevance scores from the BioGRID ORCS source data, summarized by adjusted nDCG (AnDCG): a chance-corrected ranking metric that is comparable across screens of very different sizes and hit rates.

The benchmark ships three cohorts: a year-split train / val / test built from BioGRID screens published through 2025, and a held-out LaTest split that is refreshed regularly with screens published in the past six months. It serves as an ongoing memorization probe for new frontier models.