AssayBench · Task & evaluation metrics · §3

Adjusted nDCG@k (AnDCG@k)

AnDCG@k is the primary metric we use to score gene rankings across heterogeneous CRISPR screens. It is a modified version of nDCG@k with two changes: it is condensed (unassayed genes are ignored, not penalised as false positives) and adjusted (a screen-specific random baseline is subtracted, then random-or-worse performance is clamped to zero).

One-line definition. Take the model's top-\(k\) predicted genes, look up their ground-truth relevance scores, drop the ones that weren't measured in the screen, compute discounted cumulative gain on what's left, normalise by an ideal ranking that avoids wrong-direction hits, and finally rescale so that 0 means "random-or-worse for this screen" and 1 means "perfect for this screen".

Step 1. Lift predictions into the relevance space

Let \([g_1,\dots,g_L]\) be the ranked prediction list and let \(\mathbf{y}\) denote the ground-truth relevance scores for the target screen. We construct the score sequence \(\mathbf{x}\) by assigning

\( x_i = \begin{cases} \mathbf{y}[g_i] & \text{if } g_i \text{ is assayed in the screen} \\ \texttt{MISSING} & \text{otherwise} \end{cases} \)

Predicted genes that are not measured in the target screen are tagged as MISSING rather than treated as false positives. This matters because library composition varies wildly across AssayBench's 1,920 screens.

Step 2. Truncate and condense

If the prediction list is shorter than \(k\), pad it with zero-relevance entries. Then truncate to the first \(k\) positions and remove all MISSING entries while preserving order. Genes ranked below \(k\) do not move up to replace missing top-\(k\) predictions. Concretely, with \(k=5\):

[0.9, MISSING, 0.6, MISSING, -0.2, 0.8] → [0.9, 0.6, -0.2]

Call the condensed sequence \(\mathbf{x}'=(x'_1,\dots,x'_{k'})\) where \(k' \le k\).

Step 3. Discounted Cumulative Gain

\( \mathrm{DCG}@k = \sum_{i=1}^{k'} \frac{x'_i}{\log_2(i+1)} \)

Because relevance scores can be negative for genes associated with the opposite phenotype direction, ranking such genes near the top decreases the score. This is a desirable property for directional screens (e.g. "increases drug resistance" should not look the same as "decreases drug resistance").

Step 4. Normalise against the ideal ranking

The ideal ranking is obtained by clipping negative relevance values to zero and sorting the resulting ground-truth relevance scores in descending order. Clipping means the ideal predictor is not penalized for avoiding genes associated with the opposite phenotype direction. Let \(\mathrm{IDCG}@k\) be the corresponding discounted cumulative gain. Then

\( \mathrm{nDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k} \)

Set to 0 when the denominator is zero. The negative-relevance clipping prevents the ideal normalizer from making \(\mathrm{nDCG}@k\) exceed 1.

Step 5. Adjust against a per-screen random baseline

Raw nDCG values are not directly comparable across screens: predicting the hits of some assays is intrinsically easier than others (think viability vs reporter activity). We correct for this by subtracting a screen-specific random baseline \(\mathrm{nDCG}_{\mathrm{rand}}@k\):

\( \boxed{\;\mathrm{AnDCG}@k \;=\; \max\left( \dfrac{\mathrm{nDCG}@k \;-\; \mathrm{nDCG}_{\mathrm{rand}}@k} {1 - \mathrm{nDCG}_{\mathrm{rand}}@k},\;0\right)\;} \)

In practice this rescales performance so that 0 corresponds to a random ranking and 1 to the ideal ranking for every screen, regardless of how easy or hard the screen is. The expected value of nDCG@k under a uniformly random predictor can be computed analytically (see paper, Appendix on adjusted-condensed nDCG), so we don't need Monte Carlo sampling. Scores that would fall below the random baseline are reported as 0.

Companion metrics

Precision@k

\( \mathrm{Precision}@k = \begin{cases} \dfrac{1}{k'} \sum_{j=1}^{k'} \mathbb{I}[c_j > 0], & k' > 0 \\ 0, & k' = 0 \end{cases} \)

For precision-style metrics, unscored predictions are condensed before the top-\(k\) cutoff in the paper definition. Here \((c_1,\dots,c_M)\) is that condensed sequence and \(k'=\min(k,M)\). Normalized Precision divides by the screen-specific maximum attainable precision.

Directional false discovery rate (dFDR@k)

\( \mathrm{dFDR}@k = \begin{cases} \dfrac{1}{k'} \sum_{j=1}^{k'} \mathbb{I}[c_j < 0], & k' > 0 \\ 0, & k' = 0 \end{cases} \)

Reported only for directional screens, where some genes have negative relevance because they correspond to the opposite phenotype direction. This metric measures how often the model puts those "wrong-direction" genes among its top scored predictions. Normalized dFDR divides by the screen-specific maximum attainable dFDR.

Why this matters

With raw nDCG, a uniform-ranking baseline can already score 0.4 on a viability screen and 0.05 on a sparse reporter screen, making per-screen averages dominated by the easy assays. AnDCG@k centers every screen on its own random baseline, so mean AnDCG@100 across 1,920 heterogeneous screens is a fair measure of how well a method captures screen-specific biology.

Source: §3 (Task and evaluation metrics) of the AssayBench paper. See the arXiv version for the closed-form expression of \(\mathrm{nDCG}_{\mathrm{rand}}@k\) and additional discussion of edge cases.