grelu.interpret.score#
grelu.interpret.score contains functions related to scoring the importance of individual DNA bases or regions using a trained model.
gReLU uses Captum for several attribution methods, including InputXGradient, IntegratedGradients, and Saliency.
Functions#
|
Predicts the importance scores of each nucleotide position in a given DNA sequence |
|
Get per-nucleotide importance scores for sequences using Captum. |
|
Get the attention scores from a model's transformer layers, for a given input sequence. |
Module Contents#
- grelu.interpret.score.ISM_predict(seqs: pandas.DataFrame | numpy.ndarray | str | List[str], model: Callable, genome: str | None = None, prediction_transform: Callable | None = None, start_pos: int = 0, end_pos: int | None = None, compare_func: str | Callable | None = None, devices: str | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, return_df: bool = True) numpy.array | pandas.DataFrame [source]#
Predicts the importance scores of each nucleotide position in a given DNA sequence using the In Silico Mutagenesis (ISM) method.
- Parameters:
seqs – Input DNA sequences as genomic intervals, strings, or integer-encoded form.
genome – Name of the genome to use if a genomic interval is supplied.
model – A pre-trained deep learning model
prediction_transform – A module to transform the model output
start_pos – Index of the position to start applying ISM
end_pos – Index of the position to stop applying ISM
compare_func – A function or name of a function to compare the predictions for mutated and reference sequences. Allowed names are “divide”, “subtract” and “log2FC”. If not provided, the raw predictions for both mutant and reference sequences will be returned.
devices – Indices of the devices on which to run inference
num_workers – number of workers for inference
batch_size – batch size for model inference
return_df – If True, the ISM results will be returned as a dataframe. Otherwise, they will be returned as a Numpy array.
- Returns:
A numpy array of the predicted scores for each nucleotide position (if return_df = False) or a pandas dataframe with A, C, G, and T as row labels and the bases at each position of the sequence as column labels (if return_df = True).
- grelu.interpret.score.get_attributions(model, seqs: pandas.DataFrame | numpy.array | List[str], genome: str | None = None, prediction_transform: Callable | None = None, device: str | int = 'cpu', method: str = 'deepshap', correct_grad: bool = False, hypothetical: bool = False, n_shuffles: int = 20, seed=None, **kwargs) numpy.array [source]#
Get per-nucleotide importance scores for sequences using Captum.
- Parameters:
model – A trained deep learning model
seqs – input DNA sequences as genomic intervals, strings, or integer-encoded form.
genome – Name of the genome to use if a genomic interval is supplied.
prediction_transform – A module to transform the model output
devices – Indices of the devices to use for inference
method – One of “deepshap”, “saliency”, “inputxgradient” or “integratedgradients”
correct_grad – If True, gradients will be corrected using the method of Majdandzic et al. (PMID: 37161475). Only used with method=’saliency’.
hypothetical – Only used with method = “deepshap”. If true, the function will return hypothetical importance scores which can be used as input for tf-modisco.
n_shuffles – Number of times to dinucleotide shuffle sequence
seed – Random seed
**kwargs – Additional arguments to pass to tangermeme.deep_lift_shap.deep_lift_shap
- Returns:
Per-nucleotide importance scores as numpy array of shape (B, 4, L).
- grelu.interpret.score.get_attention_scores(model, seqs: pandas.DataFrame | str | numpy.ndarray | torch.Tensor, block_idx: int | None = None, genome: str | None = None) numpy.ndarray [source]#
Get the attention scores from a model’s transformer layers, for a given input sequence.
- Parameters:
model – A trained deep learning model
seq – Input sequences as genoic intervals, strings or in index or one-hot encoded format.
block_idx – Index of the transformer layer to use, ranging from 0 to n_transformers-1. If None, attention scores from all transformer layers will be returned.
genome – Name of the genome to use if genomic intervals are supplied.
- Returns:
Numpy array of shape (Layers, Heads, L, L) if block_idx is None or (Heads, L, L) otherwise.