Predict Effects of Alzheimer’s Variants Using ATAC Models#

In this tutorial, we demonstrate how to predict the impact of sequence variation using a trained gReLU model.

import anndata as ad
import pandas as pd
import numpy as np
import os

Load the CATlas model#

This is a binary classification model trained on snATAC-seq data from Catlas (http://catlas.org/humanenhancer/). This model predicts the probability that an input sequence will be accessible in various cell types.

import grelu.resources
model = grelu.resources.load_model(repo_id="Genentech/human-atac-catlas-model", filename='model.ckpt')

View the model’s metadata#

model.data_params is a dictionary containing metadata about the data used to train the model. Let’s look at what information is stored:

model.data_params.keys()
dict_keys(['tasks', 'train', 'val', 'test'])
for k, v in model.data_params['train'].items():
    if k !="intervals":
        print(k, model.data_params['train'][k])
bin_size 1
end both
genome hg38
label_aggfunc None
label_len 200
label_transform_func None
max_label_clip None
max_pair_shift 0
max_seq_shift 2
min_label_clip None
n_alleles 1
n_augmented 1
n_seqs 977014
n_tasks 204
padded_label_len 200
padded_seq_len 204
predict False
rc True
seq_len 200

Note the parameter seq_len. This tells us that the model was trained on 200 bp long sequences.

model.data_params['tasks'] is a large dictionary containing metadata about the output tracks that the model predicts. We can collect these into a dataframe called tasks:

tasks = pd.DataFrame(model.data_params['tasks'])
tasks.head(3)
name cell type
0 Follicular Follicular
1 Fibro General Fibro General
2 Acinar Acinar

Load Alzheimer’s Variants from GWAS Catalog (Jansen et al. 2019 meta-analysis)#

Download a small subset of variants from the AD sumstats file from this meta-analysis study. This contains 1,000 variants mapped to the hg19 genome.

import grelu.resources

variant_file = grelu.resources.download_dataset(
    repo_id="Genentech/alzheimers-variant-tutorial-data", 
    filename="variants.csv"
)
variants = pd.read_csv(variant_file)
variants.head(3)
snpid chrom pos alt ref rsid pval beta se
0 6:32630634_G_A chr6 32630634 G A 6:32630634 0.000071 0.025194 0.006339
1 6:32630797_A_G chr6 32630797 A G 6:32630797 0.000053 0.024866 0.006155
2 6:32630824_T_C chr6 32630824 T C 6:32630824 0.000088 0.026630 0.006790

Filter variants#

from grelu.data.preprocess import filter_blacklist, filter_chromosomes
from grelu.variant import filter_variants

Remove indels, since we don’t support them for now. We also remove variants where one of the alleles contains Ns.

variants = filter_variants(variants, max_del_len=0, max_insert_len=0, standard_bases=True)
Initial number of variants: 1000
Final number of variants: 989

Remove non-standard chromosomes

variants = filter_chromosomes(variants, include='autosomesXY')
Keeping 989 intervals

Remove SNPs from unmappable regions

variants = filter_blacklist(variants, genome="hg19", window=100).reset_index(drop=True)
Keeping 988 intervals
variants.head(3)
snpid chrom pos alt ref rsid pval beta se
0 6:32630634_G_A chr6 32630634 G A 6:32630634 0.000071 0.025194 0.006339
1 6:32630797_A_G chr6 32630797 A G 6:32630797 0.000053 0.024866 0.006155
2 6:32630824_T_C chr6 32630824 T C 6:32630824 0.000088 0.026630 0.006790

Predict variant effects#

The grelu.variant module contains several functions related to analysis of variants. The predict_variant_effects function takes a model and a dataframe of variants, and uses the model to predict the activity of the genomic regions containing both the ref and alt alleles. It can then compare the two predictions and return an effect size for each variant. We can also apply data augmentation, i.e. make predictions for several versions of the sequence and average them together.

import grelu.variant

odds = grelu.variant.predict_variant_effects(
    variants=variants,
    model=model, 
    devices=0, # Run on GPU 0
    num_workers=8,
    batch_size=512,
    genome="hg19",
    compare_func="log2FC", # Return the log2 fold change between alt and ref predictions
    return_ad=True, # Return an anndata object.
    rc = True, # Reverse complement the ref/alt predictions and average them.
)
making dataset

Now this (odds) is an AnnData object (similar to SingleCellExperiment in R but in Python). Might be a good point to pause here and learn more about it: https://anndata.readthedocs.io/en/latest/.

odds
AnnData object with n_obs × n_vars = 988 × 204
    obs: 'snpid', 'chrom', 'pos', 'alt', 'ref', 'rsid', 'pval', 'beta', 'se'
    var: 'cell type'

odds.var contains the cell types for which the model makes predictions.

odds.var.head(3)
cell type
name
Follicular Follicular
Fibro General Fibro General
Acinar Acinar

odds.obs contains the variants.

odds.obs.head(3)
snpid chrom pos alt ref rsid pval beta se
0 6:32630634_G_A chr6 32630634 G A 6:32630634 0.000071 0.025194 0.006339
1 6:32630797_A_G chr6 32630797 A G 6:32630797 0.000053 0.024866 0.006155
2 6:32630824_T_C chr6 32630824 T C 6:32630824 0.000088 0.026630 0.006790

And odds.X contains the predicted effect size for each variant in a numpy array of shape (n_variants x n_cell types)

print(odds.X[:5, :5])
[[ 1.107881    1.1225077   1.3193094   1.760034    1.7427868 ]
 [ 0.06262888 -0.07848518 -0.03106441  0.07413852  0.01521102]
 [-0.5139757  -0.10657616 -0.70205635 -0.56330484 -0.6706449 ]
 [ 0.6097045   0.4748386   0.7162784   0.22368696  0.4141106 ]
 [-1.253903   -0.92243546 -0.43949667  0.32733923  0.13079897]]

Select a variant with effect specific to microglia#

As an example, we search for variants that strongly disrupt accessibility in microglia but not in all cell types.

We set some arbitrary thresholds; The log2 fold change of the alt allele w.r.t the ref allele should be < -2, and the average log2 fold change of the alt allele w.r.t the ref allele should be nonnegative.

This implies that the alt allele should decrease accessibility in microglia to < 1/4 of its reference value, but should not decrease accessibility across all cell types on average.

# Calculate the mean variant effect across all cell types
mean_variant_effect = odds.X.mean(1) 
# Calculate the variant effect only in microglia
microglia_effect = odds[:, 'Microglia'].X.squeeze() 
specific_variant_idx = np.where(
    (microglia_effect < -2) & (mean_variant_effect >= 0) 
)[0]
specific_variant_idx
array([498])

We will isolate this microglia specific variant to analyze further.

variant = odds[specific_variant_idx, :] 
variant
View of AnnData object with n_obs × n_vars = 1 × 204
    obs: 'snpid', 'chrom', 'pos', 'alt', 'ref', 'rsid', 'pval', 'beta', 'se'
    var: 'cell type'
variant.obs
snpid chrom pos alt ref rsid pval beta se
498 6:76659840_T_G chr6 76659840 T G rs138940043 0.000061 0.058235 0.014525

Let’s look at the effect size (log2FC) for this variant:

variant_effect_size = variant[0, "Microglia"].X
variant_effect_size
ArrayView([[-2.7606957]], dtype=float32)

View importance scores in microglia for the bases surrounding the selected variant#

We can use several different approaches to calculate the importance of each base in the reference and alternate sequence. However, we want to score each base specifically by its importance to the model’s prediction in microglia and not the model’s predictions on other tasks.

To do this we define a transform that operates on the model’s predictions and selects only the predictions in Microglia.

from grelu.transforms.prediction_transforms import Aggregate
microglia_score = Aggregate(tasks=["Microglia"], model=model)

Next, we extract the 200 bp long genomic sequences containing the reference and alternate alleles.

ref_seq, alt_seq = grelu.variant.variant_to_seqs(
    seq_len=model.data_params['train']['seq_len'], # Create variant-centered sequence of this length
    genome='hg19',
    **variant.obs.iloc[0][["chrom", "pos", "ref", "alt"]]
)

ref_seq, alt_seq
('CAAAGATATATGTCATAATAATTATCTTTTCACTTTCTTTGTTAGGGACCAGGAATGATAAACCACTTAGTCATTTTTTAGGTTTACAAGAACTTAAGGGGAACTAAGAAAGGAACCCTTACTCCTGAACTCTCAGCCTCATCTGTGCTGGACCATTCTAACTTTGTACCCTTTCATGAGATTGATATAATTTAGAAAAT',
 'CAAAGATATATGTCATAATAATTATCTTTTCACTTTCTTTGTTAGGGACCAGGAATGATAAACCACTTAGTCATTTTTTAGGTTTACAAGAACTTAAGGTGAACTAAGAAAGGAACCCTTACTCCTGAACTCTCAGCCTCATCTGTGCTGGACCATTCTAACTTTGTACCCTTTCATGAGATTGATATAATTTAGAAAAT')

We are now ready to calculate the per-base importance scores. For this, we use the grelu.interpret.score module, which contains functions related to scoring the importance of each base in a sequence. We choose the “inputxgradient” method to score each base.

import grelu.interpret.score

ref_attrs = grelu.interpret.score.get_attributions(
    model, ref_seq, prediction_transform=microglia_score, device=0,
    seed=0, method="inputxgradient",
)

alt_attrs = grelu.interpret.score.get_attributions(
    model, alt_seq, prediction_transform=microglia_score, device=0,
    seed=0, method="inputxgradient",
)

We can visualize the attribution scores for the sequence, highlighting the mutated base. Here, we visualize the attributions for the central 100 bp of the 200 bp sequence.

import grelu.visualize
grelu.visualize.plot_attributions(
    ref_attrs, start_pos=50, end_pos=150,
    highlight_positions=[99], ticks=5,
    edgecolor='red'
)
<logomaker.src.Logo.Logo at 0x14a91de74790>
../_images/14d16aeeb7ace3e48e61ffe91c82d5f3d72e0c8a4e508cd5aac68b11c1ab4660.png
grelu.visualize.plot_attributions(
    alt_attrs, start_pos=50, end_pos=150,
    highlight_positions=[99], ticks=5,
    edgecolor='red',
)
<logomaker.src.Logo.Logo at 0x14a91db785d0>
../_images/d5e0f7a66f761040b9253824b0224707b3da8e35f3cd0ff6a7d7ef5b86219698.png

ISM#

We can also perform ISM (In silico mutagenesis) of the bases surrounding the variant to see what the effect would be if we mutated the reference allele to any other base. The ISM_predict function in grelu.interpret.score performs every possible single-base substitution on the given sequence, predicts the effect of each substitution, and optionally compares these predictions to the reference sequence to return an effect size for each substitution.

Once again, since we are interested in how important each base is to the model’s prediction in microglia, we use the microglia_score transform. This ensures that ISM will score each base’s importance to the microglial prediction only.

ism = grelu.interpret.score.ISM_predict(
    ref_seq,
    model,
    prediction_transform=microglia_score, # Focus on the prediction in microglia
    compare_func = "log2FC", # Return the log2FC of the mutated sequence prediction w.r.t the reference sequence
    devices=0, # Index of the GPU to use
    num_workers=8,
)
ism
C A A A G A T A T A ... T T T A G A A A A T
A 0.054882 0.000000 0.000000 0.000000 -0.004369 0.000000 -0.062929 0.000000 0.037061 0.000000 ... 0.022315 -0.202768 -0.114814 -0.002966 -0.105570 -0.002966 -0.002966 -0.002966 -0.002966 0.034192
C 0.000000 -0.027611 -0.002416 -0.064968 -0.063026 -0.028825 -0.080880 -0.034179 -0.051146 -0.096463 ... 0.176605 -0.177294 0.079079 0.140559 -0.074599 0.211111 0.107023 0.122998 0.092927 0.108006
G 0.024039 0.015707 -0.012429 -0.108350 0.000000 -0.302840 -0.045235 -0.045110 -0.058974 -0.321752 ... 0.069490 0.001972 0.056143 0.127993 -0.002966 0.118532 0.072232 0.066505 0.039074 0.110938
T 0.021374 -0.030615 0.034908 -0.143111 0.033140 -0.051820 0.000000 -0.116558 0.000000 -0.055416 ... 0.000000 0.000000 -0.002966 -0.041839 -0.165060 0.087063 -0.194618 -0.024649 -0.041730 -0.002966

4 rows × 200 columns

grelu.visualize.plot_ISM(
    ism, method="heatmap", center=0, figsize=(20, 1.5),
)
<Axes: >
../_images/bf103719e4d3dda0f57da387917e3fcc8af01fa83ec99a50b82430713958c11d.png
grelu.visualize.plot_ISM(
    ism, method='logo', figsize=(20, 1.5), highlight_positions=[99], edgecolor='red'
)
<logomaker.src.Logo.Logo at 0x14a91d4c0510>
../_images/d996460b3827de6556109a6f489bc704f9230ecb8d9c6d995ffa4935b8b3d87b.png

Scan variant with motifs#

We now scan the reference and alternate sequences to identify known TF motifs that may have been disrupted by the variant. We can use the grelu.interpret.motifs module to scan these sequences with TF motifs. Here, we use the HOCOMOCO v13 set.

from grelu.interpret.motifs import compare_motifs
compare_motifs(
    ref_seq=ref_seq,
    alt_seq=alt_seq,
    motifs="hocomoco_v13",
    pthresh=1e-4,
    ref_attrs=ref_attrs,
    alt_attrs=alt_attrs
)
motif start end strand fimo_p-value_alt fimo_p-value_ref fimo_score_alt fimo_score_ref motif_attr_score_alt motif_attr_score_ref site_attr_score_alt site_attr_score_ref fimo_score_diff
0 SPIB.H13CORE.1.S.C 93 106 + 0.002108 0.000090 -7.645480 5.641884 -0.000985 0.051337 -0.000291 0.011623 -13.287364
1 SPIB.H13CORE.0.P.B 94 110 + 0.002976 0.000052 -3.500456 9.686128 -0.001005 0.042656 -0.000401 0.011381 -13.186584
2 SPIB.H13CORE.2.SM.B 91 107 + 0.001238 0.000082 -7.038979 4.544859 -0.000797 0.041247 -0.000220 0.009247 -11.583838
3 SPI1.H13CORE.1.S.B 95 107 + 0.006406 0.000049 2.095837 11.517704 -0.001016 0.052423 -0.000343 0.014483 -9.421867
4 ZN701.H13CORE.0.P.B 85 104 + 0.000435 0.000036 5.164952 10.891898 -0.000414 0.017886 -0.000233 0.008311 -5.726945
5 STAT1.H13CORE.0.P.B 80 100 - 0.000047 0.000026 11.692194 12.570953 -0.000170 0.001517 -0.000188 0.001428 -0.878759
6 NR1I3.H13CORE.1.PSM.A 92 109 - 0.000099 0.090040 9.407806 -28.693682 -0.000907 0.015508 -0.000366 0.009969 38.101488
7 NR1I3.H13CORE.2.M.C 89 109 - 0.000011 0.931289 13.900074 -69.032222 -0.000684 0.009954 -0.000268 0.008749 82.932295

The fimo_score_ref column contains the score for the motif in the reference allele-containing sequence and the fimo_score_alt column contains the score in the alternate-allele containing sequence. Similarly, the fimo_p-value_ref column contains the p-value for the motif in the reference allele-containing sequence and the fimo_p-value_alt column contains the p-value in the alternate-allele containing sequence.

site_attr_score is the average attribution value for all nucleotides within the motif match site. This score gives the importance of the sequence region but does not reflect the similarity between the PWM and the shape of the attributions.

motif_attr_score is the sum of the element-wise product of the PWM and the attributions. This score is higher when the shape of the motif matches the shape of the attribution profile, and is useful for ranking multiple motifs that all match to the same sequence region.

Note that the SPI1.H13CORE.1.P.B motif is much weaker in the alternate allele-containing sequence.

Compare the variant impact to a background distribution#

We saw that the variant has a strong effect size (log2 fold change). To place this effect size in context, we create a set of background sequences by shuffling the sequence surrounding the variant while conserving dinucleotide frequency. We then insert the reference and alternate alleles in each shuffled sequence, and compute the variant effect size again.

test = grelu.variant.marginalize_variants(
    model=model,
    variants=variant.obs,
    genome="hg19",
    prediction_transform=microglia_score,
    seed=0,
    devices=0,
    n_shuffles=100,
    compare_func="log2FC",
    rc=True, # Average the result over both strands
)
Predicting variant effects
making dataset
Predicting variant effects in background sequences
Calculating background distributions
Performing 2-sided test
test
{'effect_size': [-2.758793592453003],
 'mean': [-0.002620171522721648],
 'sd': [0.33667072653770447],
 'pvalue': [2.6881005251979756e-16]}