grelu.interpret.motifs#

`grelu.interpret.motifs contains functions related to manipulating sequence motifs and scanning DNA sequences with motifs. Note that the aim here is not to provide a comprehensive suite of functions related to motif analysis, but only the functionality necessary for interpreting sequence-to-function deep learning models using these motifs.

Functions#

`motifs_to_strings`(→ str)	Extracts a matching DNA sequence from a motif. If sample=True, the best match sequence
`trim_pwm`(→ Union[Tuple[int], numpy.ndarray])	Trims the edges of a Position Weight Matrix (PWM) based on the
`scan_sequences`(seqs, motifs[, names, seq_ids, ...])	Scan a DNA sequence using motifs. Based on jmschrei/memesuite-lite.
`score_sites`(→ pandas.DataFrame)	Given a dataframe of motif matching sites identified by FIMO and a set of attributions, this
`score_motifs`(→ pandas.DataFrame)	Given a dataframe of motif matching sites identified by FIMO and a set of attributions, this
`compare_motifs`(→ pandas.DataFrame)	Scan sequences containing the reference and alternate alleles
`run_tomtom`(→ pandas.DataFrame)	Function to compare given motifs to reference motifs using the

Module Contents#

grelu.interpret.motifs.motifs_to_strings(motifs: numpy.ndarray | Dict[str, numpy.ndarray] | str, names: List[str] | None = None, sample: bool = False, rng: Generator | None = None) → str[source]#

Extracts a matching DNA sequence from a motif. If sample=True, the best match sequence is returned, otherwise a sequence is sampled from the probability distribution at each position of the motif.

Parameters:

motifs – Either a numpy array containing a Position Probability Matrix (PPM) of shape (4, L), or a dictionary containing motif names as keys and PPMs of shape (4, L) as values, or the path to a MEME file.
names – A list of motif names to read from the MEME file, in case a MEME file is supplied in motifs. If None, all motifs in the file will be read.
sample – If True, a sequence will be sampled from the motif. Otherwise, the best match sequence will be returned.
rng – np.random.RandomState object

Returns:

DNA sequence(s) as strings

grelu.interpret.motifs.trim_pwm(pwm: numpy.ndarray, trim_threshold: float = 0.3, return_indices: bool = False) → Tuple[int] | numpy.ndarray[source]#

Trims the edges of a Position Weight Matrix (PWM) based on the information content of each position.

Parameters:

pwm – A numpy array of shape (4, L) containing the PWM
trim_threshold – Threshold ranging from 0 to 1 to trim edge positions
return_indices – If True, only the indices of the positions to keep will be returned. If False, the trimmed motif will be returned.

Returns:

np.array containing the trimmed PWM (if return_indices = True) or a tuple of ints for the start and end positions of the trimmed motif (if return_indices = False).

grelu.interpret.motifs.scan_sequences(seqs: str | List[str], motifs: str | Dict[str, numpy.ndarray], names: List[str] | None = None, seq_ids: List[str] | None = None, pthresh: float = 0.001, rc: bool = True, bin_size: float = 0.1, eps: float = 0.0001, attrs: numpy.ndarray | None = None)[source]#

Scan a DNA sequence using motifs. Based on jmschrei/memesuite-lite.

Parameters:

seqs – A string or a list of DNA sequences as strings
motifs – A dictionary whose values are Position Probability Matrices (PPMs) of shape (4, L), or the path to a MEME file.
names – A list of motif names to read from the MEME file. If None, all motifs in the file will be read.
seq_ids – Optional list of IDs for sequences
pthresh – p-value cutoff for binding sites
rc – If True, both the sequence and its reverse complement will be scanned. If False, only the given sequence will be scanned.
bin_size – The size of the bins discretizing the PWM scores. The smaller the bin size the higher the resolution, but the less data may be available to support it. Default is 0.1.
eps – A small pseudocount to add to the motif PPMs before taking the log. Default is 0.0001.
attrs – An optional numpy array of shape (B, 4, L) containing attributions for the input sequences. If provided, the results will include site attribution and motif attribution scores for each FIMO hit.

Returns:

pd.DataFrame containing columns ‘motif’, ‘sequence’, ‘start’, ‘end’, ‘strand’, ‘score’, ‘pval’, and ‘matched_seq’.

grelu.interpret.motifs.score_sites(sites: pandas.DataFrame, attrs: numpy.ndarray, seqs: str | List[str]) → pandas.DataFrame[source]#

Given a dataframe of motif matching sites identified by FIMO and a set of attributions, this function assigns each site a ‘site attribution score’ corresponding to the average attribution value for all nucleotides within the site. This score gives the importance of the sequence region but does not reflect the similarity between the PWM and the shape of the attributions.

Parameters:

sites – A dataframe containing the output of scan_sequences
attrs – An optional numpy array of shape (B, 4, L) containing attributions for the sequences.
seqs – A string or a list of DNA sequences as strings, which were the input to scan_sequences.

Returns:

pd.DataFrame containing columns ‘motif’, ‘sequence’, ‘start’, ‘end’, ‘strand’, ‘score’, ‘pval’, ‘matched_seq’, and ‘site_attr_score’.

grelu.interpret.motifs.score_motifs(sites: pandas.DataFrame, attrs: numpy.ndarray, motifs: Dict[str, numpy.ndarray] | str) → pandas.DataFrame[source]#

Given a dataframe of motif matching sites identified by FIMO and a set of attributions, this function assigns each site a ‘motif attribution score’ which is the sum of the element-wise product of the motif and the attributions. This score is higher when the shape of the motif matches the shape of the attribution profile, and is particularly useful for ranking multiple motifs that all match to the same sequence region.

Parameters:

sites – A dataframe containing the output of scan_sequences
attrs – An optional numpy array of shape (B, 4, L) containing attributions for the sequences.
motifs – A dictionary whose values are Position Probability Matrices (PPMs) of shape (4, L), or the path to a MEME file. This should be the same as the input passed to scan_sequences.

Returns:

pd.DataFrame containing columns ‘motif’, ‘sequence’, ‘start’, ‘end’, ‘strand’, ‘score’, ‘pval’, ‘matched_seq’, and ‘motif_attr_score’.

Scan sequences containing the reference and alternate alleles to identify affected motifs.

Parameters:

ref_seq – The reference sequence as a string
motifs – A dictionary whose values are Position Probability Matrices (PPMs) of shape (4, L), or the path to a MEME file.
alt_seq – The alternate sequence as a string
ref_allele – The alternate allele as a string. Only used if alt_seq is not supplied.
alt_allele – The alternate allele as a string. Only needed if alt_seq is not supplied.
pos – The position at which to substitute the alternate allele. Only needed if alt_seq is not supplied.
names – A list of motif names to read from the MEME file. If None, all motifs in the file will be read.
pthresh – p-value cutoff for binding sites
rc – If True, both the sequence and its reverse complement will be scanned. If False, only the given sequence will be scanned.

grelu.interpret.motifs.run_tomtom(motifs: Dict[str, numpy.ndarray], meme_file: str) → pandas.DataFrame[source]#

Function to compare given motifs to reference motifs using the tomtom algorithm, as implemented in memelite (jmschrei/memesuite-lite).

Parameters:

motifs – A dictionary whose values are Position Probability Matrices (PPMs) of shape (4, L).
meme_file – Path to a meme file containing reference motifs.

Returns:

Pandas dataframe containing all tomtom results.

Return type:

grelu.interpret.motifs#

Functions#

Module Contents#

This Page