Model-guided design of regulatory elements#

In this example, we use gReLU to perform model-guided design of accessible DNA elements for microglial cells.

import anndata
import numpy as np
import pandas as pd
import os
import importlib
import re

%matplotlib inline

Load the pre-trained Catlas model from the model zoo#

This is a binary classification model trained on snATAC-seq data from Catlas (http://catlas.org/humanenhancer/).

import grelu.resources
model = grelu.resources.load_model(project='human-atac-catlas', model_name="model")

View the model’s metadata#

model.data_params is a dictionary containing metadata about the data used to train the model. Let’s look at what information is stored:

for key in list(model.data_params.keys())[:21]:
    if key !="tasks":
        print(key, model.data_params[key])
train_bin_size 1
train_chroms ['chr18', 'chr5', 'chr1', 'chr17', 'chr9', 'chr19', 'chr8', 'chr21', 'chr14', 'chr20', 'chr22', 'chr10', 'chr15', 'chr12', 'chr16', 'chr3', 'chr4', 'chr11', 'chr2', 'chr6']
train_end both
train_genome hg38
train_label_aggfunc None
train_label_len 200
train_label_transform_func None
train_max_label_clip None
train_max_pair_shift 0
train_max_seq_shift 1
train_min_label_clip None
train_n_alleles 1
train_n_augmented 1
train_n_seqs 1022128
train_n_tasks 204
train_padded_label_len 200
train_padded_seq_len 202
train_predict False
train_rc True
train_seq_len 200

Note the parameter train_seq_len. This tells us that the model was trained on 200 bp long sequences.

model.data_params['tasks'] is a large dictionary containing metadata about the output tracks that the model predicts. We can collect these into a dataframe called tasks:

tasks = pd.DataFrame(model.data_params["tasks"])
tasks.head()
name cell type
0 Follicular Follicular
1 Fibro General Fibro General
2 Acinar Acinar
3 T Lymphocyte 1 (CD8+) T Lymphocyte 1 (CD8+)
4 T lymphocyte 2 (CD4+) T lymphocyte 2 (CD4+)

Example 1: Starting from a random sequence, evolve a sequence with high accessibility in microglia using directed evolution.#

Generate a random starting sequence#

import grelu.sequence.utils
start_seq = grelu.sequence.utils.generate_random_sequences(
    n=1, # number of sequences to generate
    seq_len=model.data_params['train_seq_len'], # Length of the generated sequence
    seed=0,
    output_format="strings"
)[0]
start_seq
'ATCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTCTTGTACCCTATGATTGTGTAGAAACCGAACTACGGTACCTCCTGTTGGTAGTCACGATAGATTATAAAAGTATGTTCCCACCCTATCGACGAGACTGGCATCCTAGGTGTTTGCGGTGTTGGTACGTGCGCAGGTATGTAAGAGTGGTAAACGA'

Create the objective to optimize - predictions in microglia#

In the first example, we want to create a sequence that has a high probability of accessibility in microglia (regardless of what it does in other cell types). We first must define the objective that we want to optimize. Such objective functions are defined in grelu.transforms.prediction_transforms. We pick the Aggregate class, which tells the model to aggregate predictions over a subset of output tasks or positions.

from grelu.transforms.prediction_transforms import Aggregate

microglia_score = Aggregate(
    tasks = ["Microglia"],
    model = model,
)

Run directed evolution to maximize the prediction in microglia, ignoring other cell types#

grelu.design contains algorithms for model-guided sequence design. We pick the evolve function, which performs directed evolution. Note that it selects the sequence with the highest objective function value at every iteration, which in this case is the highest predicted accessibility in microglia. If you wanted to instead select the sequence with the lowest prediction in microglia, you could add weight=-1 to the Aggregate object.

import grelu.design

output = grelu.design.evolve(
    [start_seq], # The initial sequences
    model, 
    prediction_transform=microglia_score, # Objective to optimize 
    max_iter=5, # Number of iterations for directed evolution
    num_workers=8,
    devices=0,
    return_seqs="all", # Return all the evolved sequences
)
Iteration 0
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.20it/s]

Best value at iteration 0: 0.194
Iteration 1
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 52.49it/s]
Best value at iteration 1: 0.449
Iteration 2
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 100.15it/s]
Best value at iteration 2: 0.655
Iteration 3
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 90.07it/s]
Best value at iteration 3: 0.896
Iteration 4
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 92.75it/s]
Best value at iteration 4: 0.976
Iteration 5
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 86.54it/s]
Best value at iteration 5: 0.991
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:00<00:00, 108.08it/s]

Examine the output#

The output of grelu.design.evolve is a dataframe which contains all the sequences generated during directed evolution, and the model’s predictions. It also contains the position that was mutated and the base that was substituted at that position.

output.head()
iter start_seq best_in_iter prediction_score seq_score total_score seq position allele Microglia
0 0 0 True 0.193559 0 0.193559 ATCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... NaN NaN 0.193414
1 1 0 False 0.173063 0 0.173063 CTCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 0.0 C 0.173063
2 1 0 False 0.176796 0 0.176796 GTCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 0.0 G 0.176796
3 1 0 False 0.185264 0 0.185264 TTCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 0.0 T 0.185264
4 1 0 False 0.180785 0 0.180785 AACATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 1.0 A 0.180785

We can visualize the model’s predictions on the sequences generated at each iteration:

import grelu.visualize
grelu.visualize.plot_evolution(output, outlier_size=.1)
../_images/cee6c6bc4cc0dc3ce05d28ccad465a98a8dce374a7106c8866fa37baf601d96b.png

We see that the predicted probability of accessibility in Microglia increases at each iteration.

Compare the initial and final sequence#

Let us take the best sequence from the final iteration (iteration 5):

end_seq = output[output.iter==5].sort_values('total_score').iloc[-1].seq
end_seq
'ATCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTCTTGTACCCTATGATTGGGTAGAAACCGAACTACGGTACTTCCTGTTGGTAGTCACGATAGATTATAAAAGTATGTTCCCGCCCTCTCGACGAGACTGGCATCCTAGGTGTTTGCGGTGTTGGTACGTGCGCAGTTATGTAAGAGTGGTAAACGA'

We can compare this to the sequence we started with to see what changes were made:

import grelu.sequence.mutate
mutated_positions = grelu.sequence.mutate.seq_differences(start_seq, end_seq, verbose=True)
Position: 62 Reference base: T Alternate base: G Reference sequence: GATTGTGTAG
Position: 84 Reference base: C Alternate base: T Reference sequence: GGTACCTCCT
Position: 125 Reference base: A Alternate base: G Reference sequence: TTCCCACCCT
Position: 130 Reference base: A Alternate base: C Reference sequence: ACCCTATCGA
Position: 179 Reference base: G Alternate base: T Reference sequence: CGCAGGTATG

Next, we will use the grelu.interpret module to understand why these changes were made. One way to do this is by calculating per-base importance scores for the start and end sequence. gReLU provides several methods to do this, including In Silico Mutagenesis (ISM), DeepShap, InputxGradient and Integrated Gradients. Let’s first try inputxgradient on the start and end sequences.

import grelu.interpret.score

start_attrs = grelu.interpret.score.get_attributions(
    model, start_seq, prediction_transform=microglia_score, device=0,
    method="inputxgradient",
)
end_attrs = grelu.interpret.score.get_attributions(
    model, end_seq, prediction_transform=microglia_score, device=0,
    method="inputxgradient",
)
grelu.visualize.plot_attributions(
    start_attrs, 
    highlight_positions=mutated_positions, # Highlight the mutated positions
    ticks=10,
)
<logomaker.src.Logo.Logo at 0x7f126438f1d0>
../_images/a868c912d616048a900951db0769174c12d1b639da4d027993edba73bd34c324.png
grelu.visualize.plot_attributions(
    end_attrs,
    highlight_positions=mutated_positions, # Highlight the mutated positions,
    ticks=10,
)
<logomaker.src.Logo.Logo at 0x7f136c567810>
../_images/1cebdfc9eb0dda2c4df99b607cfd73fa116f424bb83409aac3606455a1b815af.png

We see that several of the mutated bases are in regions of negative attributions, i.e. which reduce the model’s predictions.

Example 2: Use sequence constraints#

gReLU’s evolution function allows you to impose additional constraints upon the evolutionary process. For example, we can try to promote or avoid instances of a specific pattern or motif in our designed sequences. In this example, we suppose that we want to avoid occurrences of the sequence pattern CG or its reverse complement GC.

Let us count how many times this pattern occurs in our initial and final sequences.

unwanted_seqs = ["CG", "GC"]

n_initial = np.sum([len(re.findall(x, start_seq)) for x in unwanted_seqs])
n_final = np.sum([len(re.findall(x, end_seq)) for x in unwanted_seqs])

n_initial, n_final
(17, 19)

Create additional design objective: A sequence transform#

We see that although directed evolution improved the predicted accessibility of the sequence in microglia, it also increased the number of unwanted sequence patterns. So, we want to add a penalty for these patterns in our evolutionary process. We accomplish this by creating another transform, this time one that operates on the sequence instead of the model’s predictions. These transforms are found in the grelu.transforms.seq_transforms module.

Here, we use the PatternScore class, which counts the number of times given subsequences occur in a sequence, and assigns the sequence a score based on this count. If you want to instead count the number of matches of a motif, use the MotifScore class.

from grelu.transforms.seq_transforms import PatternScore

We create an instance of this class that counts the occurrences of the unwanted subsequences and assigns a score of -0.1 for each occurrence. The negative sign means that sequences containing these subsequences will be penalized during evolution.

gc_penalizer = PatternScore(
    patterns = unwanted_seqs, # Patterns to count
    weights=[-.1, -.1], # Score to assign to each occurrence of each pattern
)

Run directed evolution including the sequence penalty#

output = grelu.design.evolve(
    [start_seq], # The initial sequences
    model, 
    prediction_transform=microglia_score, # Prediction objective to optimize 
    seq_transform=gc_penalizer, # Sequence objective to optimize
    max_iter=5, # Number of iterations for directed evolution
    num_workers=8,
    devices=0,
    return_seqs="all", # Return all the evolved sequences
)
Iteration 0
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.16it/s]
Best value at iteration 0: -1.506
Iteration 1
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.69it/s]
Best value at iteration 1: -1.251
Iteration 2
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 85.23it/s]
Best value at iteration 2: -1.045
Iteration 3
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 85.19it/s]
Best value at iteration 3: -0.810
Iteration 4
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 87.44it/s]
Best value at iteration 4: -0.638
Iteration 5
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 91.66it/s]
Best value at iteration 5: -0.502
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:00<00:00, 106.15it/s]

Examine results#

Let’s examine the new optimized sequence and see whether it has fewer instances of GC and CG.

end_seq = output[output.iter==5].sort_values('total_score').iloc[-1].seq
np.sum([len(re.findall(x, end_seq)) for x in unwanted_seqs])
14

We see that the number of GC/CG occurrences has been reduced. When we visualize the output of evolution, we see that both the prediction objective (accessibility in microglia) and the sequence objective have been optimized:

grelu.visualize.plot_evolution(output, outlier_size=.1)
../_images/f7ee253d278684cc3bf05e8d1db325ec28bb295ba7593188df7aec6ffc689a74.png

Example 3: Evolve a real genomic sequence toward microglia-specific accessibility using directed evolution#

In this example, we want to evolve a sequence that has a high predicted probability of being accessible in microglia, but low predicted probability of accessibility in other cell types (e.g. neurons). To increase the realism of the final sequence, we also want to start with a set of real genomic sequences instead of random sequences. These should be sequences that were not seen by the model during training or validation.

We can examine the model’s data_params to see which chromosomes were used for training and validation:

Select genomic starting sequences#

import grelu.data.utils

chromosomes_seen_by_model = model.data_params["train_chroms"] + model.data_params["val_chroms"]
all_autosomes = grelu.data.utils.get_chromosomes("autosomes")
set(all_autosomes).difference(chromosomes_seen_by_model)
{'chr13'}

We find that chromosome 13 was not seen by the model during training. We can now generate some random regions from this chromosome:

from grelu.io.genome import read_sizes
sizes = read_sizes(genome="hg38")
chr_size = sizes.loc[sizes.chrom=="chr13", "size"].values[0]
chr_size
114364328
rng = np.random.RandomState(0)

starts = rng.randint(low=1000000, high=chr_size-1000000, size=10)
start_intervals = pd.DataFrame({
    "chrom":"chr11",
    "start": starts,
    "end": starts + model.data_params["train_seq_len"],
})

start_intervals
chrom start end
0 chr11 76434668 76434868
1 chr11 3215104 3215304
2 chr11 39712131 39712331
3 chr11 61969723 61969923
4 chr11 59554051 59554251
5 chr11 16039847 16040047
6 chr11 75753033 75753233
7 chr11 90739541 90739741
8 chr11 41294130 41294330
9 chr11 1374564 1374764

Define design objective: cell type-specific accessibility in microglia#

We now want to optimize a more complicated objective. We want to create sequences that have a high probability of accessibility in microglia but low probability of accessibility in neurons. We begin by identifying neuron-related cell types in the model’s output:

neuron_tasks = tasks.name[tasks.name.str.contains("Neuron")]
print(neuron_tasks)
51                 Enteric Neuron
109     Fetal Excitatory Neuron 3
110     Fetal Excitatory Neuron 4
111     Fetal Excitatory Neuron 5
112     Fetal Excitatory Neuron 6
113     Fetal Excitatory Neuron 7
114     Fetal Excitatory Neuron 8
125          Fetal Adrenal Neuron
145          Fetal Retinal Neuron
146          Fetal Enteric Neuron
148        Fetal Placental Neuron
164     Fetal Inhibitory Neuron 1
165     Fetal Inhibitory Neuron 2
166     Fetal Excitatory Neuron 9
167    Fetal Excitatory Neuron 10
202     Fetal Excitatory Neuron 1
203     Fetal Excitatory Neuron 2
Name: name, dtype: object

We now define our objective using the Specificity class of transforms. This class takes the model’s predictions and calculates a score by comparing predictions in defined on-target tasks (in this case Microglia) and off-target tasks (in this case, neurons). We can also specify a relative weight for the off-target tasks, how to aggregate the on- and off-target predictions, and soft thresholds for off-target expression.

from grelu.transforms.prediction_transforms import Specificity

mcg_specificity = Specificity(
    on_tasks = "Microglia", # We want high predictions in these
    off_tasks = neuron_tasks, # We want low predictions in these
    on_aggfunc = "mean", 
    off_aggfunc = "max", # Compare the on-target prediction to the highest prediction in any off-target task.
    model=model,
    compare_func="subtract", # Each sequence will get a score which is the ratio of the on-target and off-target scores
)

Run directed evolution#

Note that here, we start with multiple sequences and set for_each=False. This means that the model will evaluate all the starting sequences, choose the best one (with highest microglia-specific predicted accessibility) and continue with that sequence. If we set for_each=True instead, we would run directed evolution independently on each starting sequence. That option can be used to generate a more diverse set of evolved sequences.

output = grelu.design.evolve(
    start_intervals, # Start from the natural sequences
    model, 
    genome="hg38",
    prediction_transform=mcg_specificity, # Optimize the specific accessibility score
    max_iter=5,
    num_workers=8,
    devices=0,
    return_seqs="all",
    for_each=False,
)
Iteration 0
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.15it/s]

Best value at iteration 0: 0.094
Iteration 1
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 55.15it/s]
Best value at iteration 1: 0.284
Iteration 2
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 84.12it/s]
Best value at iteration 2: 0.436
Iteration 3
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 85.93it/s]
Best value at iteration 3: 0.549
Iteration 4
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 74.50it/s]
Best value at iteration 4: 0.677
Iteration 5
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 90.54it/s]
Best value at iteration 5: 0.744
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:00<00:00, 100.07it/s]

Visualize the output#

We can plot the accessibility of the sequences at each iteration:

grelu.visualize.plot_evolution(output, figsize=(6, 14), outlier_size=0.2)
../_images/fd77364f17bfee00397fe358d2a54d274632eebf2e1d0d35e38df55552e0f51a.png

Compare start and end sequences#

start_seq = output[output.iter==0].sort_values('total_score').iloc[-1].seq
end_seq = output[output.iter==5].sort_values('total_score').iloc[-1].seq

mutated_positions = grelu.sequence.mutate.seq_differences(start_seq, end_seq, verbose=True)
Position: 68 Reference base: G Alternate base: A Reference sequence: GCGGAGGTTG
Position: 111 Reference base: C Alternate base: A Reference sequence: CTGGGCGACA
Position: 149 Reference base: A Alternate base: G Reference sequence: AAAGAAGAAG
Position: 160 Reference base: A Alternate base: G Reference sequence: AAAAAAAGAA
Position: 170 Reference base: G Alternate base: A Reference sequence: GTGGAGGTTC

This time, let’s use Integrated Gradients to visualize the importance of these bases that were mutated at each iteration of evolution. Note also that we supply prediction_transform=mcg_specificity, so that we calculate the importance of each base to the specificity of accessibility in microglia.

start_attrs = grelu.interpret.score.get_attributions(
    model, start_seq, prediction_transform=mcg_specificity, device=0, method="integratedgradients",
)
end_attrs = grelu.interpret.score.get_attributions(
    model, end_seq, prediction_transform=mcg_specificity, device=0, method="integratedgradients",
)
grelu.visualize.plot_attributions(start_attrs, highlight_positions=mutated_positions)
<logomaker.src.Logo.Logo at 0x7f138c110fd0>
../_images/53b8231abe26ef37876048f0f32dc3f93879a454073fc8378833d214bc8dc97c.png
grelu.visualize.plot_attributions(end_attrs, highlight_positions=mutated_positions)
<logomaker.src.Logo.Logo at 0x7f138c083f50>
../_images/8b2db54596fd9bfc6a58f1f6c8b12361ed08d71920383af08d9ef93f8272d1f4.png

It seems that the evolution process has created new motifs. We can identify the specific motifs that were altered by scanning the reference and alternate sequence with consensus JASPAR motifs and comparing the output.

import grelu.interpret.motifs

comparison = grelu.interpret.motifs.compare_motifs(
    ref_seq = start_seq,
    alt_seq = end_seq,
    motifs="hocomoco_v12",
    pthresh=5e-4,
    rc=True, # Scan both strands of the sequence
)
comparison.head()
sequence motif start end strand alt ref foldChange
0 ZN362.H12CORE.0.P.C 154 177 + -3.969035 0.0 -inf
1 Z585B.H12CORE.0.P.C 147 171 + -14.712123 0.0 -inf
2 ZN565.H12CORE.0.P.C 43 74 + -5.353975 0.0 -inf
3 ZN778.H12CORE.1.P.B 98 129 - -11.470738 0.0 -inf
4 ZN611.H12CORE.1.P.C 105 130 + -21.699276 0.0 -inf
comparison[comparison.motif=="SPI1.H12CORE.0.P.B"].sort_values("start")
sequence motif start end strand alt ref foldChange
837 SPI1.H12CORE.0.P.B 59 73 + 7.007089 0.000000 inf
694 SPI1.H12CORE.0.P.B 143 157 + 11.429639 6.883668 1.660400
318 SPI1.H12CORE.0.P.B 155 169 + 12.444161 14.239082 0.873944
890 SPI1.H12CORE.0.P.B 161 175 + 5.971476 0.000000 inf

The ref column contains the score for the motif in the original sequence and the alt column contains the score in the end sequence. We see that the evolution process has created new instances of the SPI1.H12CORE.0.P.B motif, while leaving the already existing instance unchanged.

Example 4: Evolution by motif scanning#

So far, we have been performing directed evolution by in silico mutagenesis: at each iteration, we create all possible single-base substitutions and choose the best one. Another approach is to insert a given motif into every possible position in the starting sequence, and choose the best one. We can run this approach by choosing method="pattern" in grelu.design.evolve.

Let’s start by getting the sequence of the SPI1.H12CORE.0.P.B motif.

import grelu.io.motifs

motifs = grelu.io.motifs.read_meme_file("hocomoco_v12", names=["SPI1.H12CORE.0.P.B"])
patterns = grelu.interpret.motifs.motifs_to_strings(motifs)

print(patterns)
['AAAAGAGGAAGTGA']

Run evolution#

We will now try to run the evolution function for microglia-specific accessibility, this time by scanning the starting sequence with the motif and finding the best position to substitute in this motif. You can do this with multiple motifs with different weights or penalties.

In addition, we demonstrate another constraint in the grelu.design.evolve function; you can constrain which positions can be altered, whether by ISM or by motif substitution. Here, we will specify that the motif should only be substituted in the first 100 bases of the sequence.

output = grelu.design.evolve(
    start_intervals, # Start from the natural sequences
    model, 
    genome="hg38",
    prediction_transform=mcg_specificity, # Optimize the specific accessibility score
    max_iter=2, # Since we are modifying a whole motif, we only run 2 iterations
    method="pattern", # Evolution by pattern substitution
    patterns=patterns, # SPIB motif
    num_workers=8,
    devices=0,
    return_seqs="all",
    for_each=False,
    positions=range(100), # Positions allowed for substitution
)
Iteration 0
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.59it/s]
Best value at iteration 0: 0.094
Iteration 1
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 42.38it/s]
Best value at iteration 1: 0.851
Iteration 2
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 69.87it/s]
Best value at iteration 2: 0.954
Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 61.69it/s]
grelu.visualize.plot_evolution(output, figsize=(5,10))
../_images/69b82a03e4db2e3df01761eecab4abe56265e65a48a5764edab67e2c5564614b.png

Let’s see where the motifs were inserted:

end_seq = output[output.iter==2].sort_values('total_score').iloc[-1].seq
end_attrs = grelu.interpret.score.get_attributions(
    model, end_seq, prediction_transform=mcg_specificity, device=0, method="integratedgradients",
)
grelu.visualize.plot_attributions(end_attrs)
<logomaker.src.Logo.Logo at 0x7f99307c1cd0>
../_images/2056d9ce0ef69249cb360945df4b539958ea1d77565d9e6db854909ff3010d2e.png

Example 5: Ledidi#

We also offer the Ledidi method by Schreiber et al. (2020) (https://www.biorxiv.org/content/10.1101/2020.05.21.109686v1). This is a gradient-based approach for sequence optimization. Here, we run Ledidi for the same objective as in example 3.

Ledidi requires a single starting sequence.

start_seq
'GCGTGCGCCTTTAATCCTGGCTACTAAGGAGGCCGAGGAAGGAGAATAGCTTGACCTCAGGAGGCGGAGGTTGCAGTGAGCTGAGATTGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAGAAGAAGAAAAAAAAGAAGTGGAGGTTCAGAGGTTTGGAAAGAAACAGACCAA'
output = grelu.design.ledidi(
    start_seq,
    model,
    prediction_transform=mcg_specificity,
    max_iter=2000,
    devices=0,
    lr=5e-3,
)
iter=I	input_loss= 0.0	output_loss=-0.09399	total_loss=-9.399	time=0.0282
iter=100	input_loss=0.09375	output_loss=-0.09989	total_loss=-9.895	time=1.395
iter=200	input_loss=0.09375	output_loss=-0.09334	total_loss=-9.241	time=1.326
iter=300	input_loss=0.4375	output_loss=-0.09927	total_loss=-9.49	time=1.314
iter=400	input_loss=1.438	output_loss=-0.1364	total_loss=-12.21	time=1.305
iter=500	input_loss=6.594	output_loss=-0.4359	total_loss=-37.0	time=1.318
iter=600	input_loss=12.66	output_loss=-0.7089	total_loss=-58.24	time=1.375
iter=700	input_loss=16.19	output_loss=-0.8076	total_loss=-64.57	time=1.301
iter=800	input_loss=15.78	output_loss=-0.8285	total_loss=-67.07	time=1.317
iter=900	input_loss=16.53	output_loss=-0.8214	total_loss=-65.61	time=1.321
iter=1000	input_loss=17.22	output_loss=-0.8737	total_loss=-70.15	time=1.323
iter=1100	input_loss=16.03	output_loss=-0.8872	total_loss=-72.69	time=1.343
iter=1200	input_loss=16.75	output_loss=-0.9087	total_loss=-74.12	time=1.292
iter=1300	input_loss=18.22	output_loss=-0.9163	total_loss=-73.42	time=1.276
iter=1400	input_loss=17.09	output_loss=-0.9221	total_loss=-75.12	time=1.375
iter=1500	input_loss=16.81	output_loss=-0.9323	total_loss=-76.42	time=1.281
iter=1600	input_loss=17.06	output_loss=-0.9418	total_loss=-77.12	time=1.268
iter=1700	input_loss=17.28	output_loss=-0.9428	total_loss=-77.0	time=1.275
iter=1800	input_loss=16.22	output_loss=-0.9512	total_loss=-78.9	time=1.341
iter=1900	input_loss=15.84	output_loss=-0.9442	total_loss=-78.57	time=1.312
iter=2000	input_loss=15.84	output_loss=-0.9331	total_loss=-77.46	time=1.313

This returns a list of optimized sequences. Let’s look at the first sequence in the list:

end_seq = output[0]
end_seq
'GCGTGCGGCTTTAATCCTGGCTACTAAGGAGGATGAGGAAGGAGAATAGCTTGACCACAGAATGAGGAAGTTGCAGTGAGCTGAGGTTGCGAAACTGCACTCCAGCCTGGGTGACAGAGCCAGACTCCGTCTCAAAAAAGAAAAAAAGCAGAAGAAAAAGGAGAAGTGGAGGTTCAGAGGTTTGGAAAGAAACAGACCAA'
mutated_positions = grelu.sequence.mutate.seq_differences(start_seq, end_seq)
Position: 7 Reference base: C Alternate base: G Reference sequence: GTGCGCCTTT
Position: 32 Reference base: C Alternate base: A Reference sequence: GGAGGCCGAG
Position: 33 Reference base: C Alternate base: T Reference sequence: GAGGCCGAGG
Position: 56 Reference base: T Alternate base: A Reference sequence: TGACCTCAGG
Position: 60 Reference base: G Alternate base: A Reference sequence: CTCAGGAGGC
Position: 62 Reference base: G Alternate base: T Reference sequence: CAGGAGGCGG
Position: 64 Reference base: C Alternate base: A Reference sequence: GGAGGCGGAG
Position: 68 Reference base: G Alternate base: A Reference sequence: GCGGAGGTTG
Position: 85 Reference base: A Alternate base: G Reference sequence: CTGAGATTGC
Position: 91 Reference base: C Alternate base: A Reference sequence: TTGCGCCACT
Position: 92 Reference base: C Alternate base: A Reference sequence: TGCGCCACTG
Position: 111 Reference base: C Alternate base: T Reference sequence: CTGGGCGACA
Position: 120 Reference base: G Alternate base: C Reference sequence: AGAGCGAGAC
Position: 139 Reference base: A Alternate base: G Reference sequence: AAAAAAAAAA
Position: 148 Reference base: A Alternate base: C Reference sequence: AAAAGAAGAA
Position: 159 Reference base: A Alternate base: G Reference sequence: AAAAAAAAGA
Position: 160 Reference base: A Alternate base: G Reference sequence: AAAAAAAGAA