grelu.variant#

This module provides functions to filter and process genetic variants.

Functions#

filter_variants(→ Optional[pandas.DataFrame])

Filter variants by length.

variants_to_intervals(→ pandas.DataFrame)

Return genomic intervals centered around each variant.

variant_to_seqs(→ Tuple[str, str])

param chrom:

chromosome

check_reference(→ None)

Check that the given reference alleles match those present in the reference genome.

predict_variant_effects(→ Union[numpy.ndarray, ...)

Predict the effects of variants based on a trained model.

marginalize_variants(model, variants, genome[, ...])

Runs a marginalization experiment.

Module Contents#

grelu.variant.filter_variants(variants, standard_bases: bool = True, max_insert_len: int | None = 0, max_del_len: int | None = 0, inplace: bool = False, null_string: str = '-') pandas.DataFrame | None[source]#

Filter variants by length.

Parameters:
  • variants – A DataFrame of genetic variants. It should contain columns “ref” for the reference allele sequence and “alt” for the alternate allele sequence.

  • standard_bases – If True, drop variants whose alleles include nonstandard bases (other than A,C,G,T).

  • max_insert_len – Maximum insertion length to allow.

  • max_del_len – Maximum deletion length to allow.

  • inplace – If False, return a copy. Otherwise, do operation in place and return None.

  • null_string – string used to indicate the absence of a base

Returns:

A filtered dataFrame containing only filtered variants (if inplace=False).

grelu.variant.variants_to_intervals(variants: pandas.DataFrame, seq_len: int = 1, inplace: bool = False) pandas.DataFrame[source]#

Return genomic intervals centered around each variant.

Parameters:
  • variants – A DataFrame of genetic variants. It should contain columns “chrom” for the chromosome and “pos” for the position.

  • seq_len – Length of the resulting genomic intervals.

Returns:

A pandas dataframe containing genomic intervals centered on the variants.

grelu.variant.variant_to_seqs(chrom: str, pos: int, ref: str, alt: str, genome: str, seq_len: int = 1) Tuple[str, str][source]#
Parameters:
  • chrom – chromosome

  • pos – position

  • ref – reference allele

  • alt – alternate allele

  • seq_len – Length of the resulting sequences

  • genome – Name of the genome

Returns:

A pair of strings centered on the variant, one containing the reference allele and one containing the alternate allele.

grelu.variant.check_reference(variants: pandas.DataFrame, genome: str = 'hg38', null_string: str = '-') None[source]#

Check that the given reference alleles match those present in the reference genome.

Parameters:
  • variants – A DataFrame containing variant information, with columns ‘chrom’, ‘pos’, ‘ref’, and ‘alt’.

  • genome – Name of the genome

  • null_string – String used to indicate the absence of a base.

Raises:
  • A warning message that lists indices of variants whose reference allele does not

  • match the genome.

grelu.variant.predict_variant_effects(variants: pandas.DataFrame, model: Callable, devices: int | str = 'cpu', seq_len: int | None = None, batch_size: int = 64, num_workers: int = 1, genome: str = 'hg38', rc: bool = False, max_seq_shift: int = 0, compare_func: str | Callable | None = 'divide', return_ad: bool = True, check_reference: bool = False) numpy.ndarray | anndata.AnnData[source]#

Predict the effects of variants based on a trained model.

Parameters:
  • variants – Dataframe containing the variants to predict effects for. Should contain columns “chrom”, “pos”, “ref” and “alt”.

  • model – Model used to predict the effects of the variants.

  • devices – Device(s) to use for prediction.

  • seq_len – Length of the sequences to be generated. Defaults to the length used to train the model.

  • num_workers – Number of workers to use for data loading.

  • genome – Name of the genome

  • rc – Whether to average the variant effect over both strands.

  • max_seq_shift – Number of bases over which to shift the variant containing sequence and average effects.

  • compare_func – Function to compare the alternate and reference alleles. Defaults to “divide”. Also supported is “subtract”.

  • return_ad – Return the results as an AnnData object. This will only work if the length of the model output is 1.

  • check_reference – If True, check each variant for whether the reference allele matches the sequence in the reference genome.

Returns:

Predicted variant impact. If return_ad is True and effect_func is None, the output will be an anndata object containing the reference allele predictions in .X and the alternate allele predictions in .layers[“alt”]. If return_ad is True and effect_func is not None, the output will be an anndata object containing the difference or ratio between the alt and ref allele predictions in .X. If return_ad is False, the output will be a numpy array.

grelu.variant.marginalize_variants(model: Callable, variants: pandas.DataFrame, genome: str, seq_len: int | None = None, devices: str | int | List[int] = 'cpu', num_workers: int = 1, batch_size: int = 64, n_shuffles: int = 20, seed: int | None = None, prediction_transform: Callable | None = None, compare_func: str | Callable = 'log2FC', rc: bool = False, max_seq_shift: int = 0)[source]#

Runs a marginalization experiment.

Given a model, a pattern (short sequence) to insert, and a set of background sequences, get the predictions from the model before and after inserting the patterns into the (optionally shuffled) background sequences.

Parameters:
  • model – trained model

  • variants – a dataframe containing variants

  • seq_len – The length of genomic sequences to extract surrounding the variants

  • genome – Name of the genome to use

  • device – Index of device on which to run inference

  • num_workers – Number of workers for inference

  • batch_size – Batch size for inference

  • n_shuffles – Number of times to shuffle background sequences

  • seed – Random seed

  • prediction_transform – A module to transform the model output

  • compare_func – Function to compare the alternate and reference alleles. Options are “divide” or “subtract”. If not provided, the separate predictions for each allele will be returned.

  • rc – If True, reverse complement the sequences for augmentation and average the variant effect

  • max_seq_shift – Maximum number of bases to shift the sequences for augmentation

Returns:

Either the predictions in the ref and alt alleles (if compare_func is None), or the comparison between them (if compare_func is not None.