grelu.variant#
This module provides functions to filter and process genetic variants.
Functions#
|
Filter variants by length. |
|
Return genomic intervals centered around each variant. |
|
|
|
Check that the given reference alleles match those present in the reference genome. |
|
Predict the effects of variants based on a trained model. |
|
Runs a marginalization experiment. |
Module Contents#
- grelu.variant.filter_variants(variants, standard_bases: bool = True, max_insert_len: int | None = 0, max_del_len: int | None = 0, inplace: bool = False, null_string: str = '-') pandas.DataFrame | None [source]#
Filter variants by length.
- Parameters:
variants – A DataFrame of genetic variants. It should contain columns “ref” for the reference allele sequence and “alt” for the alternate allele sequence.
standard_bases – If True, drop variants whose alleles include nonstandard bases (other than A,C,G,T).
max_insert_len – Maximum insertion length to allow.
max_del_len – Maximum deletion length to allow.
inplace – If False, return a copy. Otherwise, do operation in place and return None.
null_string – string used to indicate the absence of a base
- Returns:
A filtered dataFrame containing only filtered variants (if inplace=False).
- grelu.variant.variants_to_intervals(variants: pandas.DataFrame, seq_len: int = 1, inplace: bool = False) pandas.DataFrame [source]#
Return genomic intervals centered around each variant.
- Parameters:
variants – A DataFrame of genetic variants. It should contain columns “chrom” for the chromosome and “pos” for the position.
seq_len – Length of the resulting genomic intervals.
- Returns:
A pandas dataframe containing genomic intervals centered on the variants.
- grelu.variant.variant_to_seqs(chrom: str, pos: int, ref: str, alt: str, genome: str, seq_len: int = 1) Tuple[str, str] [source]#
- Parameters:
chrom – chromosome
pos – position
ref – reference allele
alt – alternate allele
seq_len – Length of the resulting sequences
genome – Name of the genome
- Returns:
A pair of strings centered on the variant, one containing the reference allele and one containing the alternate allele.
- grelu.variant.check_reference(variants: pandas.DataFrame, genome: str = 'hg38', null_string: str = '-') None [source]#
Check that the given reference alleles match those present in the reference genome.
- Parameters:
variants – A DataFrame containing variant information, with columns ‘chrom’, ‘pos’, ‘ref’, and ‘alt’.
genome – Name of the genome
null_string – String used to indicate the absence of a base.
- Raises:
A warning message that lists indices of variants whose reference allele does not –
match the genome. –
- grelu.variant.predict_variant_effects(variants: pandas.DataFrame, model: Callable, devices: int | str = 'cpu', seq_len: int | None = None, batch_size: int = 64, num_workers: int = 1, genome: str = 'hg38', rc: bool = False, max_seq_shift: int = 0, compare_func: str | Callable | None = 'divide', return_ad: bool = True, check_reference: bool = False) numpy.ndarray | anndata.AnnData [source]#
Predict the effects of variants based on a trained model.
- Parameters:
variants – Dataframe containing the variants to predict effects for. Should contain columns “chrom”, “pos”, “ref” and “alt”.
model – Model used to predict the effects of the variants.
devices – Device(s) to use for prediction.
seq_len – Length of the sequences to be generated. Defaults to the length used to train the model.
num_workers – Number of workers to use for data loading.
genome – Name of the genome
rc – Whether to average the variant effect over both strands.
max_seq_shift – Number of bases over which to shift the variant containing sequence and average effects.
compare_func – Function to compare the alternate and reference alleles. Defaults to “divide”. Also supported is “subtract”.
return_ad – Return the results as an AnnData object. This will only work if the length of the model output is 1.
check_reference – If True, check each variant for whether the reference allele matches the sequence in the reference genome.
- Returns:
Predicted variant impact. If return_ad is True and effect_func is None, the output will be an anndata object containing the reference allele predictions in .X and the alternate allele predictions in .layers[“alt”]. If return_ad is True and effect_func is not None, the output will be an anndata object containing the difference or ratio between the alt and ref allele predictions in .X. If return_ad is False, the output will be a numpy array.
- grelu.variant.marginalize_variants(model: Callable, variants: pandas.DataFrame, genome: str, seq_len: int | None = None, devices: str | int | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 20, seed: int | None = None, prediction_transform: Callable | None = None, compare_func: str | Callable = 'log2FC', rc: bool = False, max_seq_shift: int = 0)[source]#
Runs a marginalization experiment.
Given a model, a pattern (short sequence) to insert, and a set of background sequences, get the predictions from the model before and after inserting the patterns into the (optionally shuffled) background sequences.
- Parameters:
model – trained model
variants – a dataframe containing variants
seq_len – The length of genomic sequences to extract surrounding the variants
genome – Name of the genome to use
device – Index of device on which to run inference
num_workers – Number of workers for inference
batch_size – Batch size for inference
n_shuffles – Number of times to shuffle background sequences
seed – Random seed
prediction_transform – A module to transform the model output
compare_func – Function to compare the alternate and reference alleles. Options are “divide” or “subtract”. If not provided, the separate predictions for each allele will be returned.
rc – If True, reverse complement the sequences for augmentation and average the variant effect
max_seq_shift – Maximum number of bases to shift the sequences for augmentation
- Returns:
Either the predictions in the ref and alt alleles (if compare_func is None), or the comparison between them (if compare_func is not None.