grelu.variant#

grelu.variant provides functions to filter and process genetic variants.

Functions#

`filter_variants`(→ Optional[pandas.DataFrame])	Filter variants by length.
`variants_to_intervals`(→ pandas.DataFrame)	Return genomic intervals centered around each variant.
`variant_to_seqs`(→ Tuple[str, str])	param chrom: chromosome
`check_reference`(→ None)	Check that the given reference alleles match those present in the reference genome.
`predict_variant_effects`(→ Union[numpy.ndarray, ...)	Predict the effects of variants based on a trained model.
`marginalize_variants`(model, variants, genome[, ...])	Runs a marginalization experiment.

Module Contents#

grelu.variant.filter_variants(variants, standard_bases: bool = True, max_insert_len: int | None = 0, max_del_len: int | None = 0, inplace: bool = False, null_string: str = '-') → pandas.DataFrame | None[source]#

Filter variants by length.

Parameters:

variants – A DataFrame of genetic variants. It should contain columns “ref” for the reference allele sequence and “alt” for the alternate allele sequence.
standard_bases – If True, drop variants whose alleles include nonstandard bases (other than A,C,G,T).
max_insert_len – Maximum insertion length to allow.
max_del_len – Maximum deletion length to allow.
inplace – If False, return a copy. Otherwise, do operation in place and return None.
null_string – string used to indicate the absence of a base

Returns:

A filtered dataFrame containing only filtered variants (if inplace=False).

grelu.variant.variants_to_intervals(variants: pandas.DataFrame, seq_len: int = 1, inplace: bool = False) → pandas.DataFrame[source]#

Return genomic intervals centered around each variant.

Parameters:

variants – A DataFrame of genetic variants. It should contain columns “chrom” for the chromosome and “pos” for the position.
seq_len – Length of the resulting genomic intervals.

Returns:

A pandas dataframe containing genomic intervals centered on the variants.

grelu.variant.variant_to_seqs(chrom: str, pos: int, ref: str, alt: str, genome: str, seq_len: int = 1) → Tuple[str, str][source]#

Parameters:

chrom – chromosome
pos – position
ref – reference allele
alt – alternate allele
seq_len – Length of the resulting sequences
genome – Name of the genome

Returns:

A pair of strings centered on the variant, one containing the reference allele and one containing the alternate allele.

grelu.variant.check_reference(variants: pandas.DataFrame, genome: str = 'hg38', null_string: str = '-') → None[source]#

Check that the given reference alleles match those present in the reference genome.

Parameters:

variants – A DataFrame containing variant information, with columns ‘chrom’, ‘pos’, ‘ref’, and ‘alt’.
genome – Name of the genome
null_string – String used to indicate the absence of a base.

Raises:

A warning message that lists indices of variants whose reference allele does not –
match the genome. –

grelu.variant.predict_variant_effects(variants: pandas.DataFrame, model: Callable, devices: int | str = 'cpu', seq_len: int | None = None, batch_size: int = 64, num_workers: int = 1, genome: str = 'hg38', rc: bool = False, max_seq_shift: int = 0, compare_func: str | Callable | None = 'divide', return_ad: bool = True, check_reference: bool = False, prediction_transform: Callable | None = None) → numpy.ndarray | anndata.AnnData[source]#

Predict the effects of variants based on a trained model.

Parameters:

variants – Dataframe containing the variants to predict effects for. Should contain columns “chrom”, “pos”, “ref” and “alt”.
model – Model used to predict the effects of the variants.
devices – Device(s) to use for prediction.
seq_len – Length of the sequences to be generated. Defaults to the length used to train the model.
num_workers – Number of workers to use for data loading.
genome – Name of the genome
rc – Whether to average the variant effect over both strands.
max_seq_shift – Number of bases over which to shift the variant containing sequence and average effects.
compare_func – Function to compare the alternate and reference alleles. Defaults to “divide”. Also supported is “subtract”.
return_ad – Return the results as an AnnData object. This will only work if the length of the model output is 1.
check_reference – If True, check each variant for whether the reference allele matches the sequence in the reference genome.
prediction_transform – A module to transform the model output

Returns:

Predicted variant impact. If return_ad is True and effect_func is None, the output will be an anndata object containing the reference allele predictions in .X and the alternate allele predictions in .layers[“alt”]. If return_ad is True and effect_func is not None, the output will be an anndata object containing the difference or ratio between the alt and ref allele predictions in .X. If return_ad is False, the output will be a numpy array.

grelu.variant.marginalize_variants(model: Callable, variants: pandas.DataFrame, genome: str, seq_len: int | None = None, devices: str | int | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 20, seed: int | None = None, prediction_transform: Callable | None = None, compare_func: str | Callable = 'log2FC', rc: bool = False, max_seq_shift: int = 0)[source]#

Runs a marginalization experiment.

Given a model, a pattern (short sequence) to insert, and a set of background sequences, get the predictions from the model before and after inserting the patterns into the (optionally shuffled) background sequences.

Parameters:

model – trained model
variants – a dataframe containing variants
seq_len – The length of genomic sequences to extract surrounding the variants
genome – Name of the genome to use
device – Index of device on which to run inference
num_workers – Number of workers for inference
batch_size – Batch size for inference
n_shuffles – Number of times to shuffle background sequences
seed – Random seed
prediction_transform – A module to transform the model output
compare_func – Function to compare the alternate and reference alleles. Options are “divide” or “subtract”. If not provided, the separate predictions for each allele will be returned.
rc – If True, reverse complement the sequences for augmentation and average the variant effect
max_seq_shift – Maximum number of bases to shift the sequences for augmentation

Returns:

Either the predictions in the ref and alt alleles (if compare_func is None), or the comparison between them (if compare_func is not None.

grelu.variant#

Functions#

Module Contents#

This Page