AssayBench · Task & evaluation metrics · §3
Adjusted nDCG@k (AnDCG@k)
AnDCG@k is the primary metric we use to score gene rankings across heterogeneous CRISPR screens. It is a modified version of nDCG@k with two changes: it is condensed (unassayed genes are ignored, not penalised as false positives) and adjusted (a screen-specific random baseline is subtracted, then random-or-worse performance is clamped to zero).
Step 1. Lift predictions into the relevance space
Let \([g_1,\dots,g_L]\) be the ranked prediction list and let \(\mathbf{y}\) denote the ground-truth relevance scores for the target screen. We construct the score sequence \(\mathbf{x}\) by assigning
Predicted genes that are not measured in the target screen are tagged as MISSING rather
than treated as false positives. This matters because library composition varies wildly across
AssayBench's 1,920 screens.
Step 2. Truncate and condense
If the prediction list is shorter than \(k\), pad it with zero-relevance entries. Then truncate to the
first \(k\) positions and remove all MISSING entries while preserving order. Genes ranked
below \(k\) do not move up to replace missing top-\(k\) predictions. Concretely, with \(k=5\):
Call the condensed sequence \(\mathbf{x}'=(x'_1,\dots,x'_{k'})\) where \(k' \le k\).
Step 3. Discounted Cumulative Gain
Because relevance scores can be negative for genes associated with the opposite phenotype direction, ranking such genes near the top decreases the score. This is a desirable property for directional screens (e.g. "increases drug resistance" should not look the same as "decreases drug resistance").
Step 4. Normalise against the ideal ranking
The ideal ranking is obtained by clipping negative relevance values to zero and sorting the resulting ground-truth relevance scores in descending order. Clipping means the ideal predictor is not penalized for avoiding genes associated with the opposite phenotype direction. Let \(\mathrm{IDCG}@k\) be the corresponding discounted cumulative gain. Then
Set to 0 when the denominator is zero. The negative-relevance clipping prevents the ideal normalizer from making \(\mathrm{nDCG}@k\) exceed 1.
Step 5. Adjust against a per-screen random baseline
Raw nDCG values are not directly comparable across screens: predicting the hits of some assays is intrinsically easier than others (think viability vs reporter activity). We correct for this by subtracting a screen-specific random baseline \(\mathrm{nDCG}_{\mathrm{rand}}@k\):
In practice this rescales performance so that 0 corresponds to a random ranking and 1 to the ideal ranking for every screen, regardless of how easy or hard the screen is. The expected value of nDCG@k under a uniformly random predictor can be computed analytically (see paper, Appendix on adjusted-condensed nDCG), so we don't need Monte Carlo sampling. Scores that would fall below the random baseline are reported as 0.
Companion metrics
Precision@k
For precision-style metrics, unscored predictions are condensed before the top-\(k\) cutoff in the paper definition. Here \((c_1,\dots,c_M)\) is that condensed sequence and \(k'=\min(k,M)\). Normalized Precision divides by the screen-specific maximum attainable precision.
Directional false discovery rate (dFDR@k)
Reported only for directional screens, where some genes have negative relevance because they correspond to the opposite phenotype direction. This metric measures how often the model puts those "wrong-direction" genes among its top scored predictions. Normalized dFDR divides by the screen-specific maximum attainable dFDR.
Why this matters
With raw nDCG, a uniform-ranking baseline can already score 0.4 on a viability screen and 0.05 on a sparse reporter screen, making per-screen averages dominated by the easy assays. AnDCG@k centers every screen on its own random baseline, so mean AnDCG@100 across 1,920 heterogeneous screens is a fair measure of how well a method captures screen-specific biology.
Source: §3 (Task and evaluation metrics) of the AssayBench paper. See the arXiv version for the closed-form expression of \(\mathrm{nDCG}_{\mathrm{rand}}@k\) and additional discussion of edge cases.