grelu.interpret.score#

Functions related to scoring the importance of individual DNA bases.

Functions#

ISM_predict(→ Union[numpy.array, pandas.DataFrame])

Predicts the importance scores of each nucleotide position in a given DNA sequence

get_attributions(→ numpy.array)

Get per-nucleotide importance scores for sequences using Captum.

run_modisco(model, seqs[, genome, ...])

Run TF-Modisco to get relevant motifs for a set of inputs, and optionally score the

get_attention_scores(→ numpy.ndarray)

Get the attention scores from a model's transformer layers, for a given input sequence.

Module Contents#

grelu.interpret.score.ISM_predict(seqs: pandas.DataFrame | numpy.ndarray | str, model: Callable, genome: str | None = None, prediction_transform: Callable | None = None, start_pos: int = 0, end_pos: int | None = None, compare_func: str | Callable | None = None, devices: str | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, return_df: bool = True) numpy.array | pandas.DataFrame[source]#

Predicts the importance scores of each nucleotide position in a given DNA sequence using the In Silico Mutagenesis (ISM) method.

Parameters:
  • seqs – Input DNA sequences as genomic intervals, strings, or integer-encoded form.

  • genome – Name of the genome to use if a genomic interval is supplied.

  • model – A pre-trained deep learning model

  • prediction_transform – A module to transform the model output

  • start_pos – Index of the position to start applying ISM

  • end_pos – Index of the position to stop applying ISM

  • compare_func – A function or name of a function to compare the predictions for mutated and reference sequences. Allowed names are “divide”, “subtract” and “log2FC”. If not provided, the raw predictions for both mutant and reference sequences will be returned.

  • devices – Indices of the devices on which to run inference

  • num_workers – number of workers for inference

  • batch_size – batch size for model inference

  • return_df – If True, the ISM results will be returned as a dataframe. Otherwise, they will be returned as a Numpy array.

Returns:

A numpy array of the predicted scores for each nucleotide position (if return_df = False) or a pandas dataframe with A, C, G, and T as row labels and the bases at each position of the sequence as column labels (if return_df = True).

grelu.interpret.score.get_attributions(model, seqs: pandas.DataFrame | numpy.array | List[str], genome: str | None = None, prediction_transform: Callable | None = None, device: str | int = 'cpu', method: str = 'deepshap', hypothetical: bool = False, n_shuffles: int = 20, seed=None, **kwargs) numpy.array[source]#

Get per-nucleotide importance scores for sequences using Captum.

Parameters:
  • model – A trained deep learning model

  • seqs – input DNA sequences as genomic intervals, strings, or integer-encoded form.

  • genome – Name of the genome to use if a genomic interval is supplied.

  • prediction_transform – A module to transform the model output

  • devices – Indices of the devices to use for inference

  • method – One of “deepshap”, “saliency”, “inputxgradient” or “integratedgradients”

  • hypothetical – whether to calculate hypothetical importance scores. Set this to True to obtain input for tf-modisco, False otherwise

  • n_shuffles – Number of times to dinucleotide shuffle sequence

  • seed – Random seed

  • **kwargs – Additional arguments to pass to tangermeme.deep_lift_shap.deep_lift_shap

Returns:

Per-nucleotide importance scores as numpy array of shape (B, 4, L).

grelu.interpret.score.run_modisco(model, seqs: pandas.DataFrame | numpy.array | List[str], genome: str | None = None, prediction_transform: Callable | None = None, window: int = None, meme_file: str = None, out_dir: str = 'outputs', devices: str | int = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 10, seed=None, method: str = 'deepshap', **kwargs)[source]#

Run TF-Modisco to get relevant motifs for a set of inputs, and optionally score the motifs against a reference set of motifs using TOMTOM

Parameters:
  • model – A trained deep learning model

  • seqs – Input DNA sequences as genomic intervals, strings, or integer-encoded form.

  • genome – Name of the genome to use. Only used if genomic intervals are provided.

  • prediction_transform – A module to transform the model output

  • window – Sequence length over which to consider attributions

  • meme_file – Path to a MEME file containing reference motifs for TOMTOM.

  • out_dir – Output directory

  • devices – Indices of devices to use for model inference

  • num_workers – Number of workers to use for model inference

  • batch_size – Batch size to use for model inference

  • n_shuffles – Number of times to shuffle the background sequences for deepshap.

  • seed – Random seed

  • method – Either “deepshap”, “saliency” or “ism”.

  • **kwargs – Additional arguments to pass to TF-Modisco.

Raises:

NotImplementedError – if the method is neither “deepshap” nor “ism”

grelu.interpret.score.get_attention_scores(model, seqs: pandas.DataFrame | str | numpy.ndarray | torch.Tensor, block_idx: int | None = None, genome: str | None = None) numpy.ndarray[source]#

Get the attention scores from a model’s transformer layers, for a given input sequence.

Parameters:
  • model – A trained deep learning model

  • seq – Input sequences as genoic intervals, strings or in index or one-hot encoded format.

  • block_idx – Index of the transformer layer to use, ranging from 0 to n_transformers-1. If None, attention scores from all transformer layers will be returned.

  • genome – Name of the genome to use if genomic intervals are supplied.

Returns:

Numpy array of shape (Layers, Heads, L, L) if block_idx is None or (Heads, L, L) otherwise.