grelu.interpret.score#
Functions related to scoring the importance of individual DNA bases.
Functions#
|
Predicts the importance scores of each nucleotide position in a given DNA sequence |
|
Get per-nucleotide importance scores for sequences using Captum. |
|
Run TF-Modisco to get relevant motifs for a set of inputs, and optionally score the |
|
Get the attention scores from a model's transformer layers, for a given input sequence. |
Module Contents#
- grelu.interpret.score.ISM_predict(seqs: pandas.DataFrame | numpy.ndarray | str, model: Callable, genome: str | None = None, prediction_transform: Callable | None = None, start_pos: int = 0, end_pos: int | None = None, compare_func: str | Callable | None = None, devices: str | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, return_df: bool = True) numpy.array | pandas.DataFrame [source]#
Predicts the importance scores of each nucleotide position in a given DNA sequence using the In Silico Mutagenesis (ISM) method.
- Parameters:
seqs – Input DNA sequences as genomic intervals, strings, or integer-encoded form.
genome – Name of the genome to use if a genomic interval is supplied.
model – A pre-trained deep learning model
prediction_transform – A module to transform the model output
start_pos – Index of the position to start applying ISM
end_pos – Index of the position to stop applying ISM
compare_func – A function or name of a function to compare the predictions for mutated and reference sequences. Allowed names are “divide”, “subtract” and “log2FC”. If not provided, the raw predictions for both mutant and reference sequences will be returned.
devices – Indices of the devices on which to run inference
num_workers – number of workers for inference
batch_size – batch size for model inference
return_df – If True, the ISM results will be returned as a dataframe. Otherwise, they will be returned as a Numpy array.
- Returns:
A numpy array of the predicted scores for each nucleotide position (if return_df = False) or a pandas dataframe with A, C, G, and T as row labels and the bases at each position of the sequence as column labels (if return_df = True).
- grelu.interpret.score.get_attributions(model, seqs: pandas.DataFrame | numpy.array | List[str], genome: str | None = None, prediction_transform: Callable | None = None, device: str | int = 'cpu', method: str = 'deepshap', hypothetical: bool = False, n_shuffles: int = 20, seed=None, **kwargs) numpy.array [source]#
Get per-nucleotide importance scores for sequences using Captum.
- Parameters:
model – A trained deep learning model
seqs – input DNA sequences as genomic intervals, strings, or integer-encoded form.
genome – Name of the genome to use if a genomic interval is supplied.
prediction_transform – A module to transform the model output
devices – Indices of the devices to use for inference
method – One of “deepshap”, “saliency”, “inputxgradient” or “integratedgradients”
hypothetical – whether to calculate hypothetical importance scores. Set this to True to obtain input for tf-modisco, False otherwise
n_shuffles – Number of times to dinucleotide shuffle sequence
seed – Random seed
**kwargs – Additional arguments to pass to tangermeme.deep_lift_shap.deep_lift_shap
- Returns:
Per-nucleotide importance scores as numpy array of shape (B, 4, L).
- grelu.interpret.score.run_modisco(model, seqs: pandas.DataFrame | numpy.array | List[str], genome: str | None = None, prediction_transform: Callable | None = None, window: int = None, meme_file: str = None, out_dir: str = 'outputs', devices: str | int = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 10, seed=None, method: str = 'deepshap', **kwargs)[source]#
Run TF-Modisco to get relevant motifs for a set of inputs, and optionally score the motifs against a reference set of motifs using TOMTOM
- Parameters:
model – A trained deep learning model
seqs – Input DNA sequences as genomic intervals, strings, or integer-encoded form.
genome – Name of the genome to use. Only used if genomic intervals are provided.
prediction_transform – A module to transform the model output
window – Sequence length over which to consider attributions
meme_file – Path to a MEME file containing reference motifs for TOMTOM.
out_dir – Output directory
devices – Indices of devices to use for model inference
num_workers – Number of workers to use for model inference
batch_size – Batch size to use for model inference
n_shuffles – Number of times to shuffle the background sequences for deepshap.
seed – Random seed
method – Either “deepshap”, “saliency” or “ism”.
**kwargs – Additional arguments to pass to TF-Modisco.
- Raises:
NotImplementedError – if the method is neither “deepshap” nor “ism”
- grelu.interpret.score.get_attention_scores(model, seqs: pandas.DataFrame | str | numpy.ndarray | torch.Tensor, block_idx: int | None = None, genome: str | None = None) numpy.ndarray [source]#
Get the attention scores from a model’s transformer layers, for a given input sequence.
- Parameters:
model – A trained deep learning model
seq – Input sequences as genoic intervals, strings or in index or one-hot encoded format.
block_idx – Index of the transformer layer to use, ranging from 0 to n_transformers-1. If None, attention scores from all transformer layers will be returned.
genome – Name of the genome to use if genomic intervals are supplied.
- Returns:
Numpy array of shape (Layers, Heads, L, L) if block_idx is None or (Heads, L, L) otherwise.