Predict Effects of Alzheimer’s Variants Using ATAC Models#
In this tutorial, we demonstrate how to predict the impact of sequence variation using a trained gReLU model.
import anndata as ad
import pandas as pd
import numpy as np
import os
Load the CATlas model#
This is a binary classification model trained on snATAC-seq data from Catlas (http://catlas.org/humanenhancer/). This model predicts the probability that an input sequence will be accessible in various cell types.
import grelu.resources
model = grelu.resources.load_model(project='human-atac-catlas', model_name='model')
View the model’s metadata#
model.data_params
is a dictionary containing metadata about the data used to train the model. Let’s look at what information is stored:
for key in list(model.data_params.keys())[:21]:
if key !="tasks":
print(key, model.data_params[key])
train_bin_size 1
train_chroms ['chr18', 'chr5', 'chr1', 'chr17', 'chr9', 'chr19', 'chr8', 'chr21', 'chr14', 'chr20', 'chr22', 'chr10', 'chr15', 'chr12', 'chr16', 'chr3', 'chr4', 'chr11', 'chr2', 'chr6']
train_end both
train_genome hg38
train_label_aggfunc None
train_label_len 200
train_label_transform_func None
train_max_label_clip None
train_max_pair_shift 0
train_max_seq_shift 1
train_min_label_clip None
train_n_alleles 1
train_n_augmented 1
train_n_seqs 1022128
train_n_tasks 204
train_padded_label_len 200
train_padded_seq_len 202
train_predict False
train_rc True
train_seq_len 200
Note the parameter train_seq_len
. This tells us that the model was trained on 200 bp long sequences.
model.data_params['tasks']
is a large dictionary containing metadata about the output tracks that the model predicts. We can collect these into a dataframe called tasks:
tasks = pd.DataFrame(model.data_params['tasks'])
tasks.head(3)
name | cell type | |
---|---|---|
0 | Follicular | Follicular |
1 | Fibro General | Fibro General |
2 | Acinar | Acinar |
Load Alzheimer’s Variants from GWAS Catalog (Jansen et al. 2019 meta-analysis)#
Download a small subset of variants from the AD sumstats file from this meta-analysis study. This contains 1,000 variants mapped to the hg19 genome.
import grelu.resources
variant_dir = grelu.resources.get_artifact(
project='alzheimers-variant-tutorial',
name='dataset'
).download()
variant_file = os.path.join(variant_dir, "variants.txt")
variants = pd.read_table(variant_file)
variants.head(3)
snpid | chrom | pos | alt | ref | rsid | zscore | pval | nsum | neff | direction | eaf | beta | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6:32630634_G_A | chr6 | 32630634 | G | A | 6:32630634 | 3.974476 | 0.000071 | 71639 | 71639.0 | ?+?+ | 0.2237 | 0.025194 | 0.006339 |
1 | 6:32630797_A_G | chr6 | 32630797 | A | G | 6:32630797 | 4.040244 | 0.000053 | 71639 | 71639.0 | ?+?+ | 0.2435 | 0.024866 | 0.006155 |
2 | 6:32630824_T_C | chr6 | 32630824 | T | C | 6:32630824 | 3.921736 | 0.000088 | 71639 | 71639.0 | ?+?+ | 0.1859 | 0.026630 | 0.006790 |
Filter variants#
from grelu.data.preprocess import filter_blacklist, filter_chromosomes
from grelu.variant import filter_variants
Remove indels, since we don’t support them for now. We also remove variants where one of the alleles contains Ns.
variants = filter_variants(variants, max_del_len=0, max_insert_len=0, standard_bases=True)
Initial number of variants: 1000
Final number of variants: 989
Remove non-standard chromosomes
variants = filter_chromosomes(variants, include='autosomesXY')
Keeping 989 intervals
Remove SNPs from unmappable regions
variants = filter_blacklist(variants, genome="hg19", window=100).reset_index(drop=True)
Keeping 988 intervals
variants.head(3)
snpid | chrom | pos | alt | ref | rsid | zscore | pval | nsum | neff | direction | eaf | beta | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6:32630634_G_A | chr6 | 32630634 | G | A | 6:32630634 | 3.974476 | 0.000071 | 71639 | 71639.0 | ?+?+ | 0.2237 | 0.025194 | 0.006339 |
1 | 6:32630797_A_G | chr6 | 32630797 | A | G | 6:32630797 | 4.040244 | 0.000053 | 71639 | 71639.0 | ?+?+ | 0.2435 | 0.024866 | 0.006155 |
2 | 6:32630824_T_C | chr6 | 32630824 | T | C | 6:32630824 | 3.921736 | 0.000088 | 71639 | 71639.0 | ?+?+ | 0.1859 | 0.026630 | 0.006790 |
Predict variant effects#
The grelu.variant
module contains several functions related to analysis of variants. The predict_variant_effects
function takes a model and a dataframe of variants, and uses the model to predict the activity of the genomic regions containing both the ref and alt alleles. It can then compare the two predictions and return an effect size for each variant. We can also apply data augmentation, i.e. make predictions for several versions of the sequence and average them together.
import grelu.variant
odds = grelu.variant.predict_variant_effects(
variants=variants,
model=model,
devices=0, # Run on GPU 0
num_workers=8,
batch_size=512,
genome="hg19",
compare_func="log2FC", # Return the log2 fold change between alt and ref predictions
return_ad=True, # Return an anndata object.
rc = True, # Reverse complement the ref/alt predictions and average them.
)
making dataset
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.78it/s]
Now this (odds) is an AnnData object (similar to SingleCellExperiment
in R but in Python). Might be a good point to pause here and learn more about it: https://anndata.readthedocs.io/en/latest/.
odds
AnnData object with n_obs × n_vars = 988 × 204
obs: 'snpid', 'chrom', 'pos', 'alt', 'ref', 'rsid', 'zscore', 'pval', 'nsum', 'neff', 'direction', 'eaf', 'beta', 'se'
var: 'cell type'
odds.var contains the cell types for which the model makes predictions.
odds.var.head(3)
cell type | |
---|---|
name | |
Follicular | Follicular |
Fibro General | Fibro General |
Acinar | Acinar |
odds.obs contains the variants.
odds.obs.head(3)
snpid | chrom | pos | alt | ref | rsid | zscore | pval | nsum | neff | direction | eaf | beta | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6:32630634_G_A | chr6 | 32630634 | G | A | 6:32630634 | 3.974476 | 0.000071 | 71639 | 71639.0 | ?+?+ | 0.2237 | 0.025194 | 0.006339 |
1 | 6:32630797_A_G | chr6 | 32630797 | A | G | 6:32630797 | 4.040244 | 0.000053 | 71639 | 71639.0 | ?+?+ | 0.2435 | 0.024866 | 0.006155 |
2 | 6:32630824_T_C | chr6 | 32630824 | T | C | 6:32630824 | 3.921736 | 0.000088 | 71639 | 71639.0 | ?+?+ | 0.1859 | 0.026630 | 0.006790 |
And odds.X contains the predicted effect size for each variant in a numpy array of shape (n_variants x n_cell types)
print(odds.X[:5, :5])
[[ 1.5886321 1.2478062 1.2845564 1.868789 1.3667169 ]
[ 0.06283956 0.11273362 0.17008997 0.2536258 0.35816932]
[-0.45841014 0.00426316 -0.52971315 -0.37011045 -0.2963973 ]
[ 0.26240113 0.33035216 0.67170584 0.3984024 0.46165606]
[-1.0053136 -0.6806445 -0.08496638 0.3818305 0.03284647]]
Select a variant with effect specific to microglia#
As an example, we search for variants that strongly disrupt accessibility in microglia but not in all cell types.
We set some arbitrary thresholds; The log2 fold change of the alt allele w.r.t the ref allele should be < -2, and the average log2 fold change of the alt allele w.r.t the ref allele should be nonnegative.
This implies that the alt allele should decrease accessibility in microglia to < 1/4 of its reference value, but should not decrease accessibility across all cell types on average.
# Calculate the mean variant effect across all cell types
mean_variant_effect = odds.X.mean(1)
# Calculate the variant effect only in microglia
microglia_effect = odds[:, 'Microglia'].X.squeeze()
specific_variant_idx = np.where(
(microglia_effect < -2) & (mean_variant_effect >= 0)
)[0]
specific_variant_idx
array([498])
We will isolate this microglia specific variant to analyze further.
variant = odds[specific_variant_idx, :]
variant
View of AnnData object with n_obs × n_vars = 1 × 204
obs: 'snpid', 'chrom', 'pos', 'alt', 'ref', 'rsid', 'zscore', 'pval', 'nsum', 'neff', 'direction', 'eaf', 'beta', 'se'
var: 'cell type'
variant.obs
snpid | chrom | pos | alt | ref | rsid | zscore | pval | nsum | neff | direction | eaf | beta | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
498 | 6:76659840_T_G | chr6 | 76659840 | T | G | rs138940043 | 4.009219 | 0.000061 | 363637 | 363637.0 | ??+? | 0.00656 | 0.058235 | 0.014525 |
Let’s look at the effect size (log2FC) for this variant:
variant_effect_size = variant[0, "Microglia"].X
variant_effect_size
ArrayView([[-2.5063705]], dtype=float32)
View importance scores in microglia for the bases surrounding the selected variant#
We can use several different approaches to calculate the importance of each base in the reference and alternate sequence. However, we want to score each base specifically by its importance to the model’s prediction in microglia and not the model’s predictions on other tasks.
To do this we define a transform that operates on the model’s predictions and selects only the predictions in Microglia.
from grelu.transforms.prediction_transforms import Aggregate
microglia_score = Aggregate(tasks=["Microglia"], model=model)
Next, we extract the 200 bp long genomic sequences containing the reference and alternate alleles.
ref_seq, alt_seq = grelu.variant.variant_to_seqs(
seq_len=model.data_params['train_seq_len'],
genome='hg19',
**variant.obs.iloc[0][["chrom", "pos", "ref", "alt"]]
)
ref_seq, alt_seq
('CAAAGATATATGTCATAATAATTATCTTTTCACTTTCTTTGTTAGGGACCAGGAATGATAAACCACTTAGTCATTTTTTAGGTTTACAAGAACTTAAGGGGAACTAAGAAAGGAACCCTTACTCCTGAACTCTCAGCCTCATCTGTGCTGGACCATTCTAACTTTGTACCCTTTCATGAGATTGATATAATTTAGAAAAT',
'CAAAGATATATGTCATAATAATTATCTTTTCACTTTCTTTGTTAGGGACCAGGAATGATAAACCACTTAGTCATTTTTTAGGTTTACAAGAACTTAAGGTGAACTAAGAAAGGAACCCTTACTCCTGAACTCTCAGCCTCATCTGTGCTGGACCATTCTAACTTTGTACCCTTTCATGAGATTGATATAATTTAGAAAAT')
We are now ready to calculate the per-base importance scores. For this, we use the grelu.interpret.score
module, which contains functions related to scoring the importance of each base in a sequence. We choose the “inputxgradient” method to score each base.
import grelu.interpret.score
ref_attrs = grelu.interpret.score.get_attributions(
model, ref_seq, prediction_transform=microglia_score, device=0,
seed=0, method="inputxgradient",
)
alt_attrs = grelu.interpret.score.get_attributions(
model, alt_seq, prediction_transform=microglia_score, device=0,
seed=0, method="inputxgradient",
)
We can visualize the attribution scores for the sequence, highlighting the mutated base. Here, we visualize the attributions for the central 100 bp of the 200 bp sequence.
import grelu.visualize
grelu.visualize.plot_attributions(
ref_attrs, start_pos=50, end_pos=150,
highlight_positions=[99], ticks=5,
edgecolor='red'
)
<logomaker.src.Logo.Logo at 0x7f8a721e2110>
grelu.visualize.plot_attributions(
alt_attrs, start_pos=50, end_pos=150,
highlight_positions=[99], ticks=5,
edgecolor='red',
)
<logomaker.src.Logo.Logo at 0x7f8a34129c50>
ISM#
We can also perform ISM (In silico mutagenesis) of the bases surrounding the variant to see what the effect would be if we mutated the reference allele to any other base. The ISM_predict
function in grelu.interpret.score
performs every possible single-base substitution on the given sequence, predicts the effect of each substitution, and optionally compares these predictions to the reference sequence to return an effect size for each substitution.
Once again, since we are interested in how important each base is to the model’s prediction in microglia,
we use the microglia_score
transform. This ensures that ISM will score each base’s importance to the
microglial prediction only.
ism = grelu.interpret.score.ISM_predict(
ref_seq,
model,
prediction_transform=microglia_score, # Focus on the prediction in microglia
compare_func = "log2FC", # Return the log2FC of the mutated sequence prediction w.r.t the reference sequence
devices=0, # Index of the GPU to use
num_workers=8,
)
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26.28it/s]
ism
C | A | A | A | G | A | T | A | T | A | ... | T | T | T | A | G | A | A | A | A | T | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 0.035059 | 0.000000 | 0.000000 | 0.000000 | 0.175448 | 0.000000 | 0.068264 | 0.000000 | 0.279259 | 0.000000 | ... | 0.055060 | 0.271744 | -0.123421 | -0.000112 | 0.096319 | -0.000112 | -0.000112 | -0.000112 | -0.000112 | 0.025718 |
C | 0.000000 | 0.033303 | 0.094729 | 0.088286 | 0.242377 | 0.156310 | 0.097453 | 0.381278 | 0.206919 | 0.283350 | ... | 0.102014 | -0.033543 | -0.023884 | 0.074331 | -0.127204 | 0.100797 | -0.022535 | 0.024198 | 0.077191 | 0.031655 |
G | -0.043938 | 0.027918 | 0.074443 | 0.077228 | 0.000000 | 0.046280 | 0.066102 | 0.355099 | 0.204960 | -0.027415 | ... | 0.012160 | -0.067337 | -0.113562 | -0.052966 | -0.000112 | 0.149694 | -0.022391 | 0.045327 | -0.012053 | 0.024655 |
T | 0.059993 | 0.051857 | 0.068118 | -0.242041 | 0.206191 | 0.037834 | 0.000000 | -0.063626 | 0.000000 | 0.225319 | ... | 0.000000 | 0.000000 | -0.000112 | -0.032892 | 0.082135 | -0.003328 | -0.192516 | 0.036364 | -0.064144 | -0.000112 |
4 rows × 200 columns
grelu.visualize.plot_ISM(
ism, method="heatmap", center=0, figsize=(20, 1.5),
)
<Axes: >
grelu.visualize.plot_ISM(
ism, method='logo', figsize=(20, 1.5), highlight_positions=[99], edgecolor='red'
)
<logomaker.src.Logo.Logo at 0x7f8a2dece810>
Scan variant with JASPAR consensus motifs#
We now scan the sequences immediately around the variant to identify known TF motifs that may have been disrupted by the variant.
# Select the central bases of the sequence
central_ref_seq = grelu.sequence.utils.resize(ref_seq, 18)
central_alt_seq = grelu.sequence.utils.resize(alt_seq, 18)
central_ref_seq, central_alt_seq
('ACTTAAGGGGAACTAAGA', 'ACTTAAGGTGAACTAAGA')
We can use the grelu.interpret.motifs
module to scan these sequences with TF motifs. Here, we use a reference set of non-redundant motifs (https://www.vierstra.org/resources/motif_clustering) that are provided with gReLU as consensus
.
# Scan with motifs
import grelu.interpret.motifs
scan = grelu.interpret.motifs.compare_motifs(
ref_seq=central_ref_seq,
alt_seq=central_alt_seq,
motifs="consensus",
pthresh=5e-4,
)
scan
Read 637 motifs from file.
sequence | motif | start | end | strand | alt | ref | foldChange |
---|---|---|---|---|---|---|---|
0 | AC0176:MZF:C2H2_ZF | 13 | 6 | - | 0.000000 | 9.000000 | 0.0 |
1 | AC0177:ZNF_MZF:C2H2_ZF | 16 | 5 | - | 0.000000 | 8.431193 | 0.0 |
2 | AC0227:SPI_BCL11A:Ets | 5 | 16 | + | 0.000000 | 11.954545 | 0.0 |
3 | AC0622:ELF_SPIB:Ets | 5 | 14 | + | 0.000000 | 8.308943 | 0.0 |
4 | AC0079:ZNF:C2H2_ZF | 14 | 1 | - | 1.918367 | 0.000000 | inf |
The ref column contains the score for the motif in the reference allele-cintaining sequence and the alt column contains the score in the alternate-allele containing sequence.
Note that the AC0622:ELF_SPIB:Ets
motif is lost in the alternate allele-containing sequence.
Compare the variant impact to a background distribution#
We saw that the variant has a strong effect size (log2 fold change). To place this effect size in context, we create a set of background sequences by shuffling the sequence surrounding the variant while conserving dinucleotide frequency. We then insert the reference and alternate alleles in each shuffled sequence, and compute the variant effect size again.
test = grelu.variant.marginalize_variants(
model=model,
variants=variant.obs,
genome="hg19",
prediction_transform=microglia_score,
seed=0,
n_shuffles=100,
compare_func="log2FC",
rc=True, # Average the result over both strands
)
Predicting variant effects
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 0.48it/s]
Predicting variant effects in background sequences
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 0.33it/s]
Calculating background distributions
Performing 2-sided test
test
{'effect_size': [-2.504143476486206],
'mean': [-0.08852208405733109],
'sd': [0.4351801872253418],
'pvalue': [2.8427864717814608e-08]}