grelu.interpret.motifs#
Functions related to manipulating sequence motifs and scanning DNA sequences with motifs.
Functions#
|
Extracts a matching DNA sequence from a motif |
|
Trims the edges of a PWM based on information content. |
|
Scan a DNA sequence using motifs |
|
Runs a marginalization experiment. |
|
Scan sequences containing the reference and alternate alleles |
Module Contents#
- grelu.interpret.motifs.motifs_to_strings(motifs: pymemesuite.common.Motif | List[pymemesuite.common.Motif] | str, names: List[str] | None = None, sample: bool = False, rng: Generator | None = None) str [source]#
Extracts a matching DNA sequence from a motif
- Parameters:
motifs – A pymemesuite.common.Motif object, a list of such objects, or the path to a MEME file.
names – A list of motif names to read from the MEME file, in case a MEME file is supplied in motifs. If None, all motifs in the file will be read.
sample – If True, a sequence will be sampled from the motif. Otherwise, the best match sequence will be returned.
rng – np.random.RandomState object
- Returns:
DNA sequence(s) as strings
- grelu.interpret.motifs.trim_pwm(pwm: numpy.array, trim_threshold: float = 0.3, padding: int = 0, return_indices: bool = False) Tuple[int] | numpy.array [source]#
Trims the edges of a PWM based on information content.
- Parameters:
pwm – PWM array of shape (L, 4)
trim_threshold – Threshold ranging from 0 to 1 to trim edge positions
padding – Number of low-information positions on either end to allow
return_indices – If True, only the indices of the positions to keep will be returned. If False, the trimmed motif will be returned.
- Returns:
np.array containing the trimmed PWM (if return_indices = True) or a tuple of ints for the start and end positions of the trimmed motif (if return_indices = False).
- grelu.interpret.motifs.scan_sequences(seqs: List[str], motifs: pymemesuite.common.Motif | List[pymemesuite.common.Motif] | str, names: List[str] | None = None, bg=None, seq_ids: List[str] | None = None, pthresh: float = 0.001, rc: bool = True)[source]#
Scan a DNA sequence using motifs
- Parameters:
seqs – A list of DNA sequences as strings
motifs – A list of pymemesuite.common.Motif objects, or the path to a MEME file.
names – A list of motif names to read from the MEME file. If None, all motifs in the file will be read.
bg – A background distribution for motif p-value calculations. Only needed if a list of Motif objects is supplied instead of a MEME file.
seq_ids – Optional list of IDs for sequences
pthresh – p-value cutoff for binding sites
rc – If True, both the sequence and its reverse complement will be scanned. If False, only the given sequence will be scanned.
- Returns:
pd.DataFrame containing columns ‘motif’, ‘sequence’, ‘start’, ‘end’, ‘strand’, ‘score’ and ‘pval’.
- grelu.interpret.motifs.marginalize_patterns(model: Callable, patterns: str | List[str], seqs: pandas.DataFrame | List[str] | numpy.ndarray, genome: str | None = None, devices: str | int | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 0, seed: int | None = None, prediction_transform: Callable | None = None, rc: bool = False, compare_func: str | Callable | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray] [source]#
Runs a marginalization experiment.
Given a model, a pattern (short sequence) to insert, and a set of background sequences, get the predictions from the model before and after inserting the patterns into the (optionally shuffled) background sequences.
- Parameters:
model – trained model
patterns – a sequence or list of sequences to insert
seqs – background sequences
genome – Name of the genome to use if genomic intervals are supplied
device – Index of device on which to run inference
num_workers – Number of workers for inference
batch_size – Batch size for inference
seed – Random seed
prediction_transform – A module to transform the model output
rc – If True, augment by reverse complementation
compare_func – Function to compare the predictions with and without the pattern. Options are “divide” or “subtract”. If not provided, the predictions for the shuffled sequences and each pattern will be returned.
- Returns:
The predictions from the background sequences preds_after: The predictions after inserting the pattern into
the background sequences.
- Return type:
preds_before
- grelu.interpret.motifs.compare_motifs(ref_seq: str | pandas.DataFrame, motifs: pymemesuite.common.Motif | List[pymemesuite.common.Motif] | str, alt_seq: str | None = None, alt_allele: str | None = None, pos: int | None = None, names: List[str] | None = None, bg=None, pthresh: float = 0.001, rc: bool = True) pandas.DataFrame [source]#
Scan sequences containing the reference and alternate alleles to identify affected motifs.
- Parameters:
ref_seq – The reference sequence as a string
motifs – A list of pymemesuite.common.Motif objects, or the path to a MEME file.
alt_seq – The alternate sequence as a string
ref_allele – The alternate allele as a string. Only used if alt_seq is not supplied.
alt_allele – The alternate allele as a string. Only needed if alt_seq is not supplied.
pos – The position at which to substitute the alternate allele. Only needed if alt_seq is not supplied.
names – A list of motif names to read from the MEME file. If None, all motifs in the file will be read.
bg – A background distribution for motif p-value calculations. Only needed if a list of Motif objects is supplied instead of a MEME file.
pthresh – p-value cutoff for binding sites
rc – If True, both the sequence and its reverse complement will be scanned. If False, only the given sequence will be scanned.