Model-guided design of regulatory elements#

In this example, we use gReLU to perform model-guided design of accessible DNA elements for microglial cells.

import anndata
import numpy as np
import pandas as pd
import os
import importlib
import re

%matplotlib inline

Load the pre-trained Catlas model from the model zoo#

This is a binary classification model trained on snATAC-seq data from Catlas (http://catlas.org/humanenhancer/).

import grelu.resources
model = grelu.resources.load_model(project='human-atac-catlas', model_name="model")

View the model’s metadata#

model.data_params is a dictionary containing metadata about the data used to train the model. Let’s look at what information is stored:

model.data_params.keys()
dict_keys(['tasks', 'train', 'val', 'test'])
for k, v in model.data_params['train'].items():
    if k !='intervals':
        print(k, v)
bin_size 1
end both
genome hg38
label_aggfunc None
label_len 200
label_transform_func None
max_label_clip None
max_pair_shift 0
max_seq_shift 2
min_label_clip None
n_alleles 1
n_augmented 1
n_seqs 977014
n_tasks 204
padded_label_len 200
padded_seq_len 204
predict False
rc True
seq_len 200

Note the parameter seq_len. This tells us that the model was trained on 200 bp long sequences.

model.data_params['tasks'] is a large dictionary containing metadata about the output tracks that the model predicts. We can collect these into a dataframe called tasks:

tasks = pd.DataFrame(model.data_params["tasks"])
tasks.head(3)
name cell type
0 Follicular Follicular
1 Fibro General Fibro General
2 Acinar Acinar

model.data_params['train']['intervals'] is a dictionary containing the genomic intervals used to train the model. Similarly, model.data_params['test']['intervals'] contains genomic intervals in the test set. We can collect these into a dataframe called test_intervals:

test_intervals = pd.DataFrame(model.data_params['test']['intervals'])
test_intervals.head(3)
chrom start end cre_class in_fetal in_adult cre_module width cre_idx enformer_split split
0 chr1 143497510 143497710 Promoter Proximal no yes 113 400 53530 test test
1 chr1 143498052 143498252 Promoter Proximal no yes 4 400 53531 test test
2 chr1 143498633 143498833 Promoter yes no 46 400 53532 test test

Example 1: Starting from a random sequence, evolve a sequence with high accessibility in microglia using directed evolution.#

Generate a random starting sequence#

import grelu.sequence.utils
start_seq = grelu.sequence.utils.generate_random_sequences(
    n=1, # number of sequences to generate
    seq_len=model.data_params['train']['seq_len'], # Length of the generated sequence
    seed=0,
    output_format="strings"
)[0]
start_seq
'ATCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTCTTGTACCCTATGATTGTGTAGAAACCGAACTACGGTACCTCCTGTTGGTAGTCACGATAGATTATAAAAGTATGTTCCCACCCTATCGACGAGACTGGCATCCTAGGTGTTTGCGGTGTTGGTACGTGCGCAGGTATGTAAGAGTGGTAAACGA'

Create the objective to optimize - predictions in microglia#

In the first example, we want to create a sequence that has a high probability of accessibility in microglia (regardless of what it does in other cell types). We first must define the objective that we want to optimize. Such objective functions are defined in grelu.transforms.prediction_transforms. We pick the Aggregate class, which tells the model to aggregate predictions over a subset of output tasks or positions.

from grelu.transforms.prediction_transforms import Aggregate

microglia_score = Aggregate(
    tasks = ["Microglia"],
    model = model,
)

Run directed evolution to maximize the prediction in microglia, ignoring other cell types#

grelu.design contains algorithms for model-guided sequence design. We pick the evolve function, which performs directed evolution. Note that it selects the sequence with the highest objective function value at every iteration, which in this case is the highest predicted accessibility in microglia. If you wanted to instead select the sequence with the lowest prediction in microglia, you could add weight=-1 to the Aggregate object.

import grelu.design

output = grelu.design.evolve(
    [start_seq], # The initial sequences
    model, 
    prediction_transform=microglia_score, # Objective to optimize 
    max_iter=5, # Number of iterations for directed evolution
    num_workers=8,
    devices=0,
    return_seqs="all", # Return all the evolved sequences
)
Iteration 0
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.00it/s]
Best value at iteration 0: 0.241
Iteration 1
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 25.95it/s]
Best value at iteration 1: 0.352
Iteration 2
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 42.68it/s]
Best value at iteration 2: 0.621
Iteration 3
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 42.48it/s]
Best value at iteration 3: 0.820
Iteration 4
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 41.97it/s]
Best value at iteration 4: 0.912
Iteration 5
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 41.64it/s]
Best value at iteration 5: 0.957
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 47/47 [00:01<00:00, 43.55it/s]

Examine the output#

The output of grelu.design.evolve is a dataframe which contains all the sequences generated during directed evolution, and the model’s predictions. It also contains the position that was mutated and the base that was substituted at that position.

output.head()
iter start_seq best_in_iter prediction_score seq_score total_score seq position allele Microglia
0 0 0 True 0.240779 0 0.240779 ATCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... NaN NaN 0.240672
1 1 0 False 0.234935 0 0.234935 CTCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 0.0 C 0.234935
2 1 0 False 0.236990 0 0.236990 GTCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 0.0 G 0.236990
3 1 0 False 0.240781 0 0.240781 TTCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 0.0 T 0.240781
4 1 0 False 0.236180 0 0.236180 AACATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTC... 1.0 A 0.236180

We can visualize the model’s predictions on the sequences generated at each iteration:

import grelu.visualize
grelu.visualize.plot_evolution(output, outlier_size=.1)
../_images/5f3de48cb3c4c5b212cfb0d129f5c62fcc30a5b29374952c382b2195891ce89b.png

We see that the predicted probability of accessibility in Microglia increases at each iteration.

Compare the initial and final sequence#

Let us take the best sequence from the final iteration (iteration 5):

end_seq = output[output.iter==5].sort_values('total_score').iloc[-1].seq
end_seq
'ATCATTTTCTCGATGAAAGCGTTGACCCCACCTATCGTTAGTACTCTTGTACCCTATGATTGGGTAGAAACCGAACTACGGTACTTCCTGTTGGTAGTCACGATAGATTATAAAAGTATGTTCCCGCCCTATCGACGAGACTGGCATCCTAGGTGTTTGCGGTGTTGGTACGTGCGCAGGTCTGTAAGAGTGGTAAACGA'

We can compare this to the sequence we started with to see what changes were made:

import grelu.sequence.mutate
mutated_positions = grelu.sequence.mutate.seq_differences(start_seq, end_seq, verbose=True)
Position: 31 Reference base: A Alternate base: C Reference sequence: CCCACATATC
Position: 62 Reference base: T Alternate base: G Reference sequence: GATTGTGTAG
Position: 84 Reference base: C Alternate base: T Reference sequence: GGTACCTCCT
Position: 125 Reference base: A Alternate base: G Reference sequence: TTCCCACCCT
Position: 181 Reference base: A Alternate base: C Reference sequence: CAGGTATGTA

Next, we will use the grelu.interpret module to understand why these changes were made. One way to do this is by calculating per-base importance scores for the start and end sequence. gReLU provides several methods to do this, including In Silico Mutagenesis (ISM), DeepShap, InputxGradient and Integrated Gradients. Let’s first try inputxgradient on the start and end sequences.

import grelu.interpret.score

start_attrs = grelu.interpret.score.get_attributions(
    model, start_seq, prediction_transform=microglia_score, device=0,
    method="inputxgradient",
)
end_attrs = grelu.interpret.score.get_attributions(
    model, end_seq, prediction_transform=microglia_score, device=0,
    method="inputxgradient",
)
grelu.visualize.plot_attributions(
    start_attrs, 
    highlight_positions=mutated_positions, # Highlight the mutated positions
    ticks=10,
)
<logomaker.src.Logo.Logo at 0x7f37d8157910>
../_images/3fcacda4dec3574c5a4f7c51ad131d09678fe35a464a6910a9c92f99d02fd355.png
grelu.visualize.plot_attributions(
    end_attrs,
    highlight_positions=mutated_positions, # Highlight the mutated positions,
    ticks=10,
)
<logomaker.src.Logo.Logo at 0x7f37d3f11590>
../_images/f4b512660f5a41c1d71a2c4e86c894da65254a36adee776e74a601e4ce046c29.png

We see that several of the mutated bases are in regions of negative attributions, i.e. which reduce the model’s predictions.

Example 2: Use sequence constraints#

gReLU’s evolution function allows you to impose additional constraints upon the evolutionary process. For example, we can try to promote or avoid instances of a specific pattern or motif in our designed sequences. In this example, we suppose that we want to avoid occurrences of the sequence pattern CG or its reverse complement GC.

Let us count how many times this pattern occurs in our initial and final sequences.

unwanted_seqs = ["CG", "GC"]

n_initial = np.sum([len(re.findall(x, start_seq)) for x in unwanted_seqs])
n_final = np.sum([len(re.findall(x, end_seq)) for x in unwanted_seqs])

n_initial, n_final
(17, 19)

Create additional design objective: A sequence transform#

We see that although directed evolution improved the predicted accessibility of the sequence in microglia, it also increased the number of unwanted sequence patterns. So, we want to add a penalty for these patterns in our evolutionary process. We accomplish this by creating another transform, this time one that operates on the sequence instead of the model’s predictions. These transforms are found in the grelu.transforms.seq_transforms module.

Here, we use the PatternScore class, which counts the number of times given subsequences occur in a sequence, and assigns the sequence a score based on this count. If you want to instead count the number of matches of a motif, use the MotifScore class.

from grelu.transforms.seq_transforms import PatternScore

We create an instance of this class that counts the occurrences of the unwanted subsequences and assigns a score of -0.1 for each occurrence. The negative sign means that sequences containing these subsequences will be penalized during evolution.

gc_penalizer = PatternScore(
    patterns = unwanted_seqs, # Patterns to count
    weights=[-.1, -.1], # Score to assign to each occurrence of each pattern
)

Run directed evolution including the sequence penalty#

output = grelu.design.evolve(
    [start_seq], # The initial sequences
    model, 
    prediction_transform=microglia_score, # Prediction objective to optimize 
    seq_transform=gc_penalizer, # Sequence objective to optimize
    max_iter=5, # Number of iterations for directed evolution
    num_workers=8,
    devices=0,
    return_seqs="all", # Return all the evolved sequences
)
Iteration 0
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00, 39.77it/s]
Best value at iteration 0: -1.459
Iteration 1
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 40.23it/s]
Best value at iteration 1: -1.237
Iteration 2
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 41.28it/s]
Best value at iteration 2: -1.048
Iteration 3
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 41.86it/s]
Best value at iteration 3: -0.878
Iteration 4
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 41.19it/s]
Best value at iteration 4: -0.749
Iteration 5
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 40.75it/s]
Best value at iteration 5: -0.638
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 47/47 [00:01<00:00, 44.68it/s]

Examine results#

Let’s examine the new optimized sequence and see whether it has fewer instances of GC and CG.

end_seq = output[output.iter==5].sort_values('total_score').iloc[-1].seq
np.sum([len(re.findall(x, end_seq)) for x in unwanted_seqs])
9

We see that the number of GC/CG occurrences has been reduced. When we visualize the output of evolution, we see that both the prediction objective (accessibility in microglia) and the sequence objective have been optimized:

grelu.visualize.plot_evolution(output, outlier_size=.1)
../_images/aecc8045f2cb3564ec59530f352ae7205c250f3f58f1f121e8f967424fe1f672.png

Example 3: Evolve a real genomic sequence toward microglia-specific accessibility using directed evolution#

In this example, we want to evolve a sequence that has a high predicted probability of being accessible in microglia, but low predicted probability of accessibility in other cell types (e.g. neurons). To increase the realism of the final sequence, we also want to start with a set of real genomic sequences instead of random sequences. These should be sequences that were not seen by the model during training or validation.

We can examine the model’s data_params to see which chromosomes were used for training and validation:

Select genomic starting sequences#

We now select some random genomic regions from the model’s test set:

start_intervals = test_intervals.sample(10, random_state=0)

start_intervals
chrom start end cre_class in_fetal in_adult cre_module width cre_idx enformer_split split
14377 chr14 29731170 29731370 Distal no yes 122 400 307182 test test
714 chr10 37865983 37866183 Distal yes yes 14 400 115934 test test
20226 chr14 49376887 49377087 Distal no yes 122 400 313045 test test
63957 chr3 183215891 183216091 Distal yes yes 34 400 733910 test test
49631 chr19 10618186 10618386 Distal yes yes 63 400 482008 test test
47836 chr15 25797712 25797912 Distal yes yes 143 400 341079 test test
54616 chr19 38034528 38034728 Distal yes yes 15 400 491075 test test
63385 chr22 22030660 22030860 Promoter yes yes 13 400 644921 test test
41067 chr14 95312376 95312576 Distal yes no 24 400 333904 test test
68642 chr3 192483010 192483210 Distal no yes 30 400 738595 test test

Define design objective: cell type-specific accessibility in microglia#

We now want to optimize a more complicated objective. We want to create sequences that have a high probability of accessibility in microglia but low probability of accessibility in neurons. We begin by identifying neuron-related cell types in the model’s output:

neuron_tasks = tasks.name[tasks.name.str.contains("Neuron")]
print(neuron_tasks)
51                 Enteric Neuron
109     Fetal Excitatory Neuron 3
110     Fetal Excitatory Neuron 4
111     Fetal Excitatory Neuron 5
112     Fetal Excitatory Neuron 6
113     Fetal Excitatory Neuron 7
114     Fetal Excitatory Neuron 8
125          Fetal Adrenal Neuron
145          Fetal Retinal Neuron
146          Fetal Enteric Neuron
148        Fetal Placental Neuron
164     Fetal Inhibitory Neuron 1
165     Fetal Inhibitory Neuron 2
166     Fetal Excitatory Neuron 9
167    Fetal Excitatory Neuron 10
202     Fetal Excitatory Neuron 1
203     Fetal Excitatory Neuron 2
Name: name, dtype: object

We now define our objective using the Specificity class of transforms. This class takes the model’s predictions and calculates a score by comparing predictions in defined on-target tasks (in this case Microglia) and off-target tasks (in this case, neurons). We can also specify a relative weight for the off-target tasks, how to aggregate the on- and off-target predictions, and soft thresholds for off-target expression.

from grelu.transforms.prediction_transforms import Specificity

mcg_specificity = Specificity(
    on_tasks = "Microglia", # We want high predictions in these
    off_tasks = neuron_tasks, # We want low predictions in these
    on_aggfunc = "mean", 
    off_aggfunc = "max", # Compare the on-target prediction to the highest prediction in any off-target task.
    model=model,
    compare_func="subtract", # Each sequence will get a score which is the ratio of the on-target and off-target scores
)

Run directed evolution#

Note that here, we start with multiple sequences and set for_each=False. This means that the model will evaluate all the starting sequences, choose the best one (with highest microglia-specific predicted accessibility) and continue with that sequence. If we set for_each=True instead, we would run directed evolution independently on each starting sequence. That option can be used to generate a more diverse set of evolved sequences.

output = grelu.design.evolve(
    start_intervals, # Start from the natural sequences
    model, 
    genome="hg38",
    prediction_transform=mcg_specificity, # Optimize the specific accessibility score
    max_iter=5,
    num_workers=8,
    devices=0,
    return_seqs="all",
    for_each=False,
)
Iteration 0
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.46it/s]
Best value at iteration 0: 0.284
Iteration 1
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 37.94it/s]
Best value at iteration 1: 0.809
Iteration 2
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 37.63it/s]
Best value at iteration 2: 0.918
Iteration 3
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 38.95it/s]
Best value at iteration 3: 0.956
Iteration 4
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 38.43it/s]
Best value at iteration 4: 0.967
Iteration 5
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 37.82it/s]
Best value at iteration 5: 0.976
Predicting DataLoader 0: 100%|███████████████████████████████████████████████| 48/48 [00:01<00:00, 43.71it/s]

Visualize the output#

We can plot the accessibility of the sequences at each iteration:

grelu.visualize.plot_evolution(output, figsize=(6, 14), outlier_size=0.2)
../_images/b72fc039fb5fe42552dad689283e4b8eaf01f109d789d0dcfad8e5c3dfcbdcdb.png

Compare start and end sequences#

start_seq = output[output.iter==0].sort_values('total_score').iloc[-1].seq
end_seq = output[output.iter==5].sort_values('total_score').iloc[-1].seq

mutated_positions = grelu.sequence.mutate.seq_differences(start_seq, end_seq, verbose=True)
Position: 28 Reference base: A Alternate base: T Reference sequence: TCAGTACCTG
Position: 84 Reference base: G Alternate base: T Reference sequence: CGCCAGGCAC
Position: 100 Reference base: C Alternate base: G Reference sequence: CAAAACCAGA
Position: 105 Reference base: C Alternate base: A Reference sequence: CCAGACCTGA
Position: 123 Reference base: C Alternate base: G Reference sequence: CTGGGCCTGC

This time, let’s use Integrated Gradients to visualize the importance of these bases that were mutated at each iteration of evolution. Note also that we supply prediction_transform=mcg_specificity, so that we calculate the importance of each base to the specificity of accessibility in microglia.

start_attrs = grelu.interpret.score.get_attributions(
    model, start_seq, prediction_transform=mcg_specificity, device=0, method="integratedgradients",
)
end_attrs = grelu.interpret.score.get_attributions(
    model, end_seq, prediction_transform=mcg_specificity, device=0, method="integratedgradients",
)
grelu.visualize.plot_attributions(start_attrs, highlight_positions=mutated_positions)
<logomaker.src.Logo.Logo at 0x7f368c734690>
../_images/155ca9eb131cd445b9651f76f7e295a3493cae89f19c71df927a8593d9b2cda5.png
grelu.visualize.plot_attributions(end_attrs, highlight_positions=mutated_positions)
<logomaker.src.Logo.Logo at 0x7f37be04e550>
../_images/4751da3f30f6b14796e2f92c8b477e52cc63b34ba6ef22bb99d2e6e3ac4be8c0.png

It seems that the evolution process has created new motifs. We can identify the specific motifs that were altered by scanning the reference and alternate sequence with consensus JASPAR motifs and comparing the output.

import grelu.interpret.motifs

comparison = grelu.interpret.motifs.compare_motifs(
    ref_seq = start_seq,
    alt_seq = end_seq,
    motifs="hocomoco_v12",
    pthresh=1e-4,
    rc=True, # Scan both strands of the sequence
)
comparison.head()
sequence motif start end strand alt ref foldChange
0 ZN184.H12CORE.0.P.B 96 127 - -8.513401 0.000000 -inf
1 ZSC21.H12CORE.0.P.C 21 32 + 0.000000 13.521612 0.0
2 HAND2.H12CORE.1.P.B 95 109 - 0.000000 12.592777 0.0
3 HIC2.H12CORE.0.P.B 123 131 + 0.000000 10.317403 0.0
4 ZN567.H12CORE.0.P.C 80 110 + 0.000000 6.816476 0.0
comparison[comparison.motif=="SPI1.H12CORE.0.P.B"].sort_values("start")
sequence motif start end strand alt ref foldChange
141 SPI1.H12CORE.0.P.B 23 37 - 15.881978 0.00000 inf
29 SPI1.H12CORE.0.P.B 44 58 + 12.378630 12.37863 1.0
142 SPI1.H12CORE.0.P.B 96 110 + 11.530125 0.00000 inf

The ref column contains the score for the motif in the original sequence and the alt column contains the score in the end sequence. We see that the evolution process has created new instances of the SPI1.H12CORE.0.P.B motif, while leaving the already existing instance unchanged.

Example 4: Evolution by motif scanning#

So far, we have been performing directed evolution by in silico mutagenesis: at each iteration, we create all possible single-base substitutions and choose the best one. Another approach is to insert a given motif into every possible position in the starting sequence, and choose the best one. We can run this approach by choosing method="pattern" in grelu.design.evolve.

Let’s start by getting the sequence of the SPI1.H12CORE.0.P.B motif.

import grelu.io.motifs

motifs = grelu.io.motifs.read_meme_file("hocomoco_v12", names=["SPI1.H12CORE.0.P.B"])
patterns = grelu.interpret.motifs.motifs_to_strings(motifs)

print(patterns)
['AAAAGAGGAAGTGA']

Run evolution#

We will now try to run the evolution function for microglia-specific accessibility, this time by scanning the starting sequence with the motif and finding the best position to substitute in this motif. You can do this with multiple motifs with different weights or penalties.

In addition, we demonstrate another constraint in the grelu.design.evolve function; you can constrain which positions can be altered, whether by ISM or by motif substitution. Here, we will specify that the motif should only be substituted in the first 100 bases of the sequence.

output = grelu.design.evolve(
    start_intervals, # Start from the natural sequences
    model, 
    genome="hg38",
    prediction_transform=mcg_specificity, # Optimize the specific accessibility score
    max_iter=2, # Since we are modifying a whole motif, we only run 2 iterations
    method="pattern", # Evolution by pattern substitution
    patterns=patterns, # SPIB motif
    num_workers=8,
    devices=0,
    return_seqs="all",
    for_each=False,
    positions=range(100), # Positions allowed for substitution
)
Iteration 0
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00, 35.81it/s]
Best value at iteration 0: 0.284
Iteration 1
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 2/2 [00:00<00:00, 15.89it/s]
Best value at iteration 1: 0.907
Iteration 2
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 2/2 [00:00<00:00, 22.83it/s]
Best value at iteration 2: 0.951
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████| 4/4 [00:00<00:00, 24.24it/s]
grelu.visualize.plot_evolution(output, figsize=(5,10))
../_images/f7241440d96b4b921d9078fae1318a60d50a790c7e50719dde36c633b162046b.png

Let’s see where the motifs were inserted:

end_seq = output[output.iter==2].sort_values('total_score').iloc[-1].seq
end_attrs = grelu.interpret.score.get_attributions(
    model, end_seq, prediction_transform=mcg_specificity, device=0, method="integratedgradients",
)
grelu.visualize.plot_attributions(end_attrs)
<logomaker.src.Logo.Logo at 0x7f37bbf7a950>
../_images/6889d235a687e0a7ebde7694a9fdc6578e5cecc10e35f2e2674b4974f9b49dec.png

Example 5: Ledidi#

We also offer the Ledidi method by Schreiber et al. (2020) (https://www.biorxiv.org/content/10.1101/2020.05.21.109686v1). This is a gradient-based approach for sequence optimization. Here, we run Ledidi for the same objective as in example 3.

Ledidi requires a single starting sequence.

start_seq
'TATTGTTTTATATTGTTTTATACTCAGTACCTGTTTTAAGAAAAAAACAAGGAAGTGAAACCAAAGGCAGGCGGCCCGGCGCCAGGCACCAGACCCAAAACCAGACCTGAAACCAGGCCTGGGCCTGCCTGGCCTAAACCAAGTAGTTAAAAATCAACTCATGACTTAGCAACCGATGTTATCCATAGATTCCAGACATT'
output = grelu.design.ledidi(
    start_seq,
    model,
    prediction_transform=mcg_specificity,
    max_iter=2000,
    devices=0,
    lr=5e-3,
)
iter=I	input_loss=0.03125	output_loss=-0.2845	total_loss=-28.42	time=0.07945
iter=100	input_loss=0.03125	output_loss=-0.2769	total_loss=-27.66	time=5.107
iter=200	input_loss=0.1562	output_loss=-0.2932	total_loss=-29.16	time=5.086
iter=300	input_loss=0.375	output_loss=-0.2854	total_loss=-28.17	time=5.095
iter=400	input_loss=2.031	output_loss=-0.3944	total_loss=-37.41	time=5.085
iter=500	input_loss=7.594	output_loss=-0.6585	total_loss=-58.26	time=5.089
iter=600	input_loss=11.16	output_loss=-0.8318	total_loss=-72.02	time=5.088
iter=700	input_loss=13.66	output_loss=-0.8727	total_loss=-73.62	time=5.136
iter=800	input_loss=12.81	output_loss=-0.8936	total_loss=-76.55	time=5.081
iter=900	input_loss=13.16	output_loss=-0.9146	total_loss=-78.31	time=5.092
iter=1000	input_loss=12.38	output_loss=-0.9374	total_loss=-81.37	time=5.086
iter=1100	input_loss=13.56	output_loss=-0.9437	total_loss=-80.8	time=5.079
iter=1200	input_loss=15.09	output_loss=-0.9391	total_loss=-78.81	time=5.093
iter=1300	input_loss=14.94	output_loss=-0.9547	total_loss=-80.53	time=5.085
iter=1400	input_loss=15.28	output_loss=-0.9605	total_loss=-80.76	time=5.114
iter=1500	input_loss=14.12	output_loss=-0.9646	total_loss=-82.34	time=5.088
iter=1600	input_loss=15.38	output_loss=-0.9582	total_loss=-80.44	time=5.086
iter=1700	input_loss=14.97	output_loss=-0.9646	total_loss=-81.49	time=5.087
iter=1800	input_loss=15.78	output_loss=-0.9659	total_loss=-80.81	time=5.09
iter=1900	input_loss=16.06	output_loss=-0.9587	total_loss=-79.81	time=5.089
iter=2000	input_loss=14.62	output_loss=-0.9573	total_loss=-81.1	time=5.093

This returns a list of optimized sequences. Let’s look at the first sequence in the list:

end_seq = output[0]
end_seq
'TACTGTTTTATATAGTTTTATACTCAGTACAACTTTTAAGAAAAAAACAGGGAAGTGAAACCAAAGGCAGACGGCCCGGCGCCAGGCACCAGACCCAAAAGAGGAACTGAAACCAAGACTGGGGCTGCCTGGCCTAAACCAAGTAGTTAAAACTCAACTCAAGACTTAGCAACCGATGTTATCCTTAGATTCCAGACATT'
mutated_positions = grelu.sequence.mutate.seq_differences(start_seq, end_seq)
Position: 2 Reference base: T Alternate base: C Reference sequence: 
Position: 13 Reference base: T Alternate base: A Reference sequence: TATATTGTTT
Position: 30 Reference base: C Alternate base: A Reference sequence: AGTACCTGTT
Position: 31 Reference base: T Alternate base: A Reference sequence: GTACCTGTTT
Position: 32 Reference base: G Alternate base: C Reference sequence: TACCTGTTTT
Position: 49 Reference base: A Alternate base: G Reference sequence: AAACAAGGAA
Position: 70 Reference base: G Alternate base: A Reference sequence: GGCAGGCGGC
Position: 100 Reference base: C Alternate base: G Reference sequence: CAAAACCAGA
Position: 101 Reference base: C Alternate base: A Reference sequence: AAAACCAGAC
Position: 102 Reference base: A Alternate base: G Reference sequence: AAACCAGACC
Position: 105 Reference base: C Alternate base: A Reference sequence: CCAGACCTGA
Position: 115 Reference base: G Alternate base: A Reference sequence: AACCAGGCCT
Position: 117 Reference base: C Alternate base: A Reference sequence: CCAGGCCTGG
Position: 123 Reference base: C Alternate base: G Reference sequence: CTGGGCCTGC
Position: 152 Reference base: A Alternate base: C Reference sequence: TAAAAATCAA
Position: 161 Reference base: T Alternate base: A Reference sequence: ACTCATGACT
Position: 184 Reference base: A Alternate base: T Reference sequence: TATCCATAGA