grelu.interpret.motifs#

Functions related to manipulating sequence motifs and scanning DNA sequences with motifs.

Functions#

motifs_to_strings(→ str)

Extracts a matching DNA sequence from a motif

trim_pwm(→ Union[Tuple[int], numpy.array])

Trims the edges of a PWM based on information content.

scan_sequences(seqs, motifs[, names, bg, seq_ids, ...])

Scan a DNA sequence using motifs

marginalize_patterns(→ Union[numpy.ndarray, ...)

Runs a marginalization experiment.

compare_motifs(→ pandas.DataFrame)

Scan sequences containing the reference and alternate alleles

Module Contents#

grelu.interpret.motifs.motifs_to_strings(motifs: pymemesuite.common.Motif | List[pymemesuite.common.Motif] | str, names: List[str] | None = None, sample: bool = False, rng: Generator | None = None) str[source]#

Extracts a matching DNA sequence from a motif

Parameters:
  • motifs – A pymemesuite.common.Motif object, a list of such objects, or the path to a MEME file.

  • names – A list of motif names to read from the MEME file, in case a MEME file is supplied in motifs. If None, all motifs in the file will be read.

  • sample – If True, a sequence will be sampled from the motif. Otherwise, the best match sequence will be returned.

  • rng – np.random.RandomState object

Returns:

DNA sequence(s) as strings

grelu.interpret.motifs.trim_pwm(pwm: numpy.array, trim_threshold: float = 0.3, padding: int = 0, return_indices: bool = False) Tuple[int] | numpy.array[source]#

Trims the edges of a PWM based on information content.

Parameters:
  • pwm – PWM array of shape (L, 4)

  • trim_threshold – Threshold ranging from 0 to 1 to trim edge positions

  • padding – Number of low-information positions on either end to allow

  • return_indices – If True, only the indices of the positions to keep will be returned. If False, the trimmed motif will be returned.

Returns:

np.array containing the trimmed PWM (if return_indices = True) or a tuple of ints for the start and end positions of the trimmed motif (if return_indices = False).

grelu.interpret.motifs.scan_sequences(seqs: List[str], motifs: pymemesuite.common.Motif | List[pymemesuite.common.Motif] | str, names: List[str] | None = None, bg=None, seq_ids: List[str] | None = None, pthresh: float = 0.001, rc: bool = True)[source]#

Scan a DNA sequence using motifs

Parameters:
  • seqs – A list of DNA sequences as strings

  • motifs – A list of pymemesuite.common.Motif objects, or the path to a MEME file.

  • names – A list of motif names to read from the MEME file. If None, all motifs in the file will be read.

  • bg – A background distribution for motif p-value calculations. Only needed if a list of Motif objects is supplied instead of a MEME file.

  • seq_ids – Optional list of IDs for sequences

  • pthresh – p-value cutoff for binding sites

  • rc – If True, both the sequence and its reverse complement will be scanned. If False, only the given sequence will be scanned.

Returns:

pd.DataFrame containing columns ‘motif’, ‘sequence’, ‘start’, ‘end’, ‘strand’, ‘score’ and ‘pval’.

grelu.interpret.motifs.marginalize_patterns(model: Callable, patterns: str | List[str], seqs: pandas.DataFrame | List[str] | numpy.ndarray, genome: str | None = None, devices: str | int | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 0, seed: int | None = None, prediction_transform: Callable | None = None, rc: bool = False, compare_func: str | Callable | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray][source]#

Runs a marginalization experiment.

Given a model, a pattern (short sequence) to insert, and a set of background sequences, get the predictions from the model before and after inserting the patterns into the (optionally shuffled) background sequences.

Parameters:
  • model – trained model

  • patterns – a sequence or list of sequences to insert

  • seqs – background sequences

  • genome – Name of the genome to use if genomic intervals are supplied

  • device – Index of device on which to run inference

  • num_workers – Number of workers for inference

  • batch_size – Batch size for inference

  • seed – Random seed

  • prediction_transform – A module to transform the model output

  • rc – If True, augment by reverse complementation

  • compare_func – Function to compare the predictions with and without the pattern. Options are “divide” or “subtract”. If not provided, the predictions for the shuffled sequences and each pattern will be returned.

Returns:

The predictions from the background sequences preds_after: The predictions after inserting the pattern into

the background sequences.

Return type:

preds_before

grelu.interpret.motifs.compare_motifs(ref_seq: str | pandas.DataFrame, motifs: pymemesuite.common.Motif | List[pymemesuite.common.Motif] | str, alt_seq: str | None = None, alt_allele: str | None = None, pos: int | None = None, names: List[str] | None = None, bg=None, pthresh: float = 0.001, rc: bool = True) pandas.DataFrame[source]#

Scan sequences containing the reference and alternate alleles to identify affected motifs.

Parameters:
  • ref_seq – The reference sequence as a string

  • motifs – A list of pymemesuite.common.Motif objects, or the path to a MEME file.

  • alt_seq – The alternate sequence as a string

  • ref_allele – The alternate allele as a string. Only used if alt_seq is not supplied.

  • alt_allele – The alternate allele as a string. Only needed if alt_seq is not supplied.

  • pos – The position at which to substitute the alternate allele. Only needed if alt_seq is not supplied.

  • names – A list of motif names to read from the MEME file. If None, all motifs in the file will be read.

  • bg – A background distribution for motif p-value calculations. Only needed if a list of Motif objects is supplied instead of a MEME file.

  • pthresh – p-value cutoff for binding sites

  • rc – If True, both the sequence and its reverse complement will be scanned. If False, only the given sequence will be scanned.