Memorization analysis

Why do frontier LLMs do better on older screens than on freshly released ones? AssayBench's recent LaTest split together with citation-count metadata lets us probe that gap directly.

From §5.5: A regression analysis of Gemini 3 Pro performance across AssayBench as a function of screen publication year, citation count, and phenotype shows the apparent temporal effect is largely explained by citations. This is consistent with memorization: highly cited screens are more likely to have been discussed in the literature, increasing the chance that their biological findings were present in the pretraining data of frontier models.

Citations vs Gemini 3 Pro AnDCG@100

Each point is a single screen; the x-axis is log(1 + citation count) of the source publication. Colors track the coarse phenotype class.

Regression coefficients

Linear regression of AnDCG@100 on centered publication year, log(1 + citations), and phenotype dummies. Error bars are ±1 standard error.

How to read this

If frontier LLMs only "won" by memorizing public benchmarks, the year coefficient would dominate. Instead the citation coefficient absorbs most of the temporal signal, suggesting it is not the date of publication but how widely a screen has been discussed in literature that drives performance. That is precisely what one would expect if pretraining corpora over-represent highly cited work. Nevertheless, frontier LLMs still retain the strongest relative performance on LaTest, so memorization is part of the picture, not all of it.