decima.vep package¶
Submodules¶
decima.vep.attributions module¶
Variant Effect Attribution Module.
This module provides functionality to compute feature attributions for genetic variants. It calculates the contribution of input sequences to model predictions, allowing for the interpretation of variant effects in motifs of transcription factors.
Examples
>>> variant_effect_attribution(
... df_variant="variants.vcf",
... output_h5="attributions.h5",
... tasks=[
... "T_cell",
... "B_cell",
... ],
... model=0,
... metadata_anndata="results.h5ad",
... )
- decima.vep.attributions.variant_effect_attribution(variants, output_prefix, tasks=None, off_tasks=None, model=0, metadata_anndata=None, method='inputxgradient', transform='specificity', batch_size=1, num_workers=4, device=None, gene_col=None, distance_type='tss', min_distance=0, max_distance=inf, genome='hg38')[source]¶
Computes variant effect attributions for a set of variants and writes them to an HDF5 file.
This function calculates the contribution of input features (sequence) to the model’s prediction for specific tasks (cell types), contrasting with off-target tasks if specified. It supports various attribution methods (e.g., InputXGradient) and transformations (e.g., Specificity).
- Parameters:
df_variant (Union[pd.DataFrame, str]) – Input variants. Can be a pandas DataFrame or a path to a file (.tsv, .csv, or .vcf). If a file path is provided, it will be loaded. Required columns/fields depend on the input format but generally include chromosome, position, reference allele, alternate allele.
output_prefix (str) – Path to the output HDF5 file where attributions will be saved. If None, results might not be persisted.
tasks (Union[str, List[str]], optional) – Specific task(s) or cell type(s) to compute attributions for. If None, uses all available tasks or a default set. Defaults to None.
off_tasks (Union[str, List[str]], optional) – Task(s) to use as a background or negative set for specificity calculations. Defaults to None.
model (int, optional) – Index or identifier of the model to use from the ensemble. Defaults to 0.
metadata_anndata (str, optional) – Path to the AnnData file containing model metadata and results (DecimaResult). Used to resolve task names and indices. Defaults to None.
method (str, optional) – The attribution method to use. Options: “inputxgradient”, “saliency”, “integratedgradients”. Defaults to “inputxgradient”.
transform (str, optional) – Transformation to apply to the model output before attribution. Options: “specificity” (target - off_target) or “aggregate”. Defaults to “specificity”.
num_workers (int, optional) – Number of worker processes for data loading. Defaults to 4.
device (str, optional) – Compute device to use (e.g., “cpu”, “cuda”, “cuda:0”). If None, automatically detects available device. Defaults to None.
gene_col (str, optional) – Name of the column in df_variant containing gene identifiers. If provided, variants are associated with specific genes. Defaults to None.
distance_type (str, optional) – Method to calculate distance between variant and gene. Options: “tss” (Transcription Start Site). Defaults to “tss”.
min_distance (float, optional) – Minimum distance from the gene feature (e.g., TSS) for a variant to be included. Defaults to 0.
max_distance (float, optional) – Maximum distance from the gene feature (e.g., TSS) for a variant to be included. Defaults to infinity.
genome (str, optional) – Genome assembly version (e.g., “hg38”). Defaults to “hg38”.
- Returns:
List of paths to the output HDF5 files.
- Return type:
List[str]
Examples
Compute attributions for variants in a VCF file for specific tasks:
>>> variant_effect_attribution( ... variants="variants.vcf", ... output_prefix="attributions", ... tasks=[ ... "T_cell", ... "B_cell", ... ], ... model=0, ... metadata_anndata="results.h5ad", ... )
decima.vep.vep module¶
Variant Effect Prediction Module.
This module provides functionality to predict the effect of genetic variants on gene expression.
Examples
>>> predict_variant_effect(
... df_variant="variants.vcf",
... output_pq="predictions.parquet",
... model=0,
... )
- decima.vep.vep.predict_variant_effect(df_variant, output_pq=None, tasks=None, model='ensemble', metadata_anndata=None, chunksize=10000, batch_size=1, num_workers=16, device=None, include_cols=None, gene_col=None, distance_type='tss', min_distance=0, max_distance=inf, genome='hg38', save_replicates=False, reference_cache=True, float_precision='32')[source]¶
Predict variant effect and save to parquet
- Parameters:
df_variant (pd.DataFrame or str) – DataFrame with variant information or path to variant file
output_pq (str, optional) – Path to save the parquet file. Defaults to None.
tasks (str, optional) – Tasks to predict. Defaults to None.
model (int, optional) – Model to use. Defaults to DEFAULT_ENSEMBLE.
metadata_anndata (str, optional) – Path to anndata file. Defaults to None.
chunksize (int, optional) – Number of variants to predict in each chunk. Defaults to 10_000.
batch_size (int, optional) – Batch size. Defaults to 1.
num_workers (int, optional) – Number of workers. Defaults to 16.
device (str, optional) – Device to use. Defaults to None.
include_cols (list, optional) – Columns to include in the output. Defaults to None.
gene_col (str, optional) – Column name for gene names. Defaults to None.
distance_type (str, optional) – Type of distance. Defaults to “tss”.
min_distance (float, optional) – Minimum distance from the end of the gene. Defaults to 0 (inclusive).
max_distance (float, optional) – Maximum distance from the TSS. Defaults to inf (exclusive).
genome (str, optional) – Genome name or path to the genome fasta file. Defaults to “hg38”.
save_replicates (bool, optional) – Save the replicates in the output. Defaults to False.
reference_cache (bool, optional) – Whether to use reference cache. Defaults to True.
float_precision (str, optional) – Floating-point precision. Defaults to “32”.
- Return type:
Module contents¶
- decima.vep.predict_variant_effect(df_variant, output_pq=None, tasks=None, model='ensemble', metadata_anndata=None, chunksize=10000, batch_size=1, num_workers=16, device=None, include_cols=None, gene_col=None, distance_type='tss', min_distance=0, max_distance=inf, genome='hg38', save_replicates=False, reference_cache=True, float_precision='32')[source]¶
Predict variant effect and save to parquet
- Parameters:
df_variant (pd.DataFrame or str) – DataFrame with variant information or path to variant file
output_pq (str, optional) – Path to save the parquet file. Defaults to None.
tasks (str, optional) – Tasks to predict. Defaults to None.
model (int, optional) – Model to use. Defaults to DEFAULT_ENSEMBLE.
metadata_anndata (str, optional) – Path to anndata file. Defaults to None.
chunksize (int, optional) – Number of variants to predict in each chunk. Defaults to 10_000.
batch_size (int, optional) – Batch size. Defaults to 1.
num_workers (int, optional) – Number of workers. Defaults to 16.
device (str, optional) – Device to use. Defaults to None.
include_cols (list, optional) – Columns to include in the output. Defaults to None.
gene_col (str, optional) – Column name for gene names. Defaults to None.
distance_type (str, optional) – Type of distance. Defaults to “tss”.
min_distance (float, optional) – Minimum distance from the end of the gene. Defaults to 0 (inclusive).
max_distance (float, optional) – Maximum distance from the TSS. Defaults to inf (exclusive).
genome (str, optional) – Genome name or path to the genome fasta file. Defaults to “hg38”.
save_replicates (bool, optional) – Save the replicates in the output. Defaults to False.
reference_cache (bool, optional) – Whether to use reference cache. Defaults to True.
float_precision (str, optional) – Floating-point precision. Defaults to “32”.
- Return type:
- decima.vep.variant_effect_attribution(variants, output_prefix, tasks=None, off_tasks=None, model=0, metadata_anndata=None, method='inputxgradient', transform='specificity', batch_size=1, num_workers=4, device=None, gene_col=None, distance_type='tss', min_distance=0, max_distance=inf, genome='hg38')[source]¶
Computes variant effect attributions for a set of variants and writes them to an HDF5 file.
This function calculates the contribution of input features (sequence) to the model’s prediction for specific tasks (cell types), contrasting with off-target tasks if specified. It supports various attribution methods (e.g., InputXGradient) and transformations (e.g., Specificity).
- Parameters:
df_variant (Union[pd.DataFrame, str]) – Input variants. Can be a pandas DataFrame or a path to a file (.tsv, .csv, or .vcf). If a file path is provided, it will be loaded. Required columns/fields depend on the input format but generally include chromosome, position, reference allele, alternate allele.
output_prefix (str) – Path to the output HDF5 file where attributions will be saved. If None, results might not be persisted.
tasks (Union[str, List[str]], optional) – Specific task(s) or cell type(s) to compute attributions for. If None, uses all available tasks or a default set. Defaults to None.
off_tasks (Union[str, List[str]], optional) – Task(s) to use as a background or negative set for specificity calculations. Defaults to None.
model (int, optional) – Index or identifier of the model to use from the ensemble. Defaults to 0.
metadata_anndata (str, optional) – Path to the AnnData file containing model metadata and results (DecimaResult). Used to resolve task names and indices. Defaults to None.
method (str, optional) – The attribution method to use. Options: “inputxgradient”, “saliency”, “integratedgradients”. Defaults to “inputxgradient”.
transform (str, optional) – Transformation to apply to the model output before attribution. Options: “specificity” (target - off_target) or “aggregate”. Defaults to “specificity”.
num_workers (int, optional) – Number of worker processes for data loading. Defaults to 4.
device (str, optional) – Compute device to use (e.g., “cpu”, “cuda”, “cuda:0”). If None, automatically detects available device. Defaults to None.
gene_col (str, optional) – Name of the column in df_variant containing gene identifiers. If provided, variants are associated with specific genes. Defaults to None.
distance_type (str, optional) – Method to calculate distance between variant and gene. Options: “tss” (Transcription Start Site). Defaults to “tss”.
min_distance (float, optional) – Minimum distance from the gene feature (e.g., TSS) for a variant to be included. Defaults to 0.
max_distance (float, optional) – Maximum distance from the gene feature (e.g., TSS) for a variant to be included. Defaults to infinity.
genome (str, optional) – Genome assembly version (e.g., “hg38”). Defaults to “hg38”.
- Returns:
List of paths to the output HDF5 files.
- Return type:
List[str]
Examples
Compute attributions for variants in a VCF file for specific tasks:
>>> variant_effect_attribution( ... variants="variants.vcf", ... output_prefix="attributions", ... tasks=[ ... "T_cell", ... "B_cell", ... ], ... model=0, ... metadata_anndata="results.h5ad", ... )