decima.data package

Submodules

decima.data.dataset module

class decima.data.dataset.HDF5Dataset(key, h5_file, ad=None, seq_len=524288, max_seq_shift=0, seed=0, augment_mode='random')[source]

Bases: Dataset

__getitem__(idx)[source]
__init__(key, h5_file, ad=None, seq_len=524288, max_seq_shift=0, seed=0, augment_mode='random')[source]
__len__()[source]
__parameters__ = ()
close()[source]
extract_label(idx)[source]
extract_seq(idx)[source]
extract_tasks(ad=None)[source]
class decima.data.dataset.VariantDataset(variants, metadata_anndata=None, seq_len=524288, max_seq_shift=0, include_cols=None, gene_col=None, min_from_end=0, distance_type='tss', min_distance=0, max_distance=inf)[source]

Bases: Dataset

Dataset for variant effect prediction

Parameters:
  • variants (pd.DataFrame) – DataFrame with variants

  • anndata (AnnData) – AnnData object with gene metadata

  • seq_len (int) – Length of the sequence

  • max_seq_shift (int) – Maximum sequence shift

  • include_cols (list) – List of columns to include in the output

  • gene_col (str) – Column name for gene names

  • min_from_end (int) – Minimum distance from the end of the gene

  • distance_type (str) – Type of distance

  • min_distance (int) – Minimum distance from the TSS

  • max_distance (int) – Maximum distance from the TSS

Returns:

Dataset for variant effect prediction

Return type:

Dataset

Examples

>>> import pandas as pd
>>> import anndata as ad
>>> from decima.data.dataset import (
...     VariantDataset,
... )
>>> variants = pd.read_csv(
...     "variants.csv"
... )
>>> dataset = (
...     VariantDataset(
...         variants
...     )
... )
>>> dataset[0]
{'seq': tensor([[1.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 1.0000],
                [0.0000, 1.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
                [0.0000, 0.0000, 1.0000,  ..., 0.0000, 1.0000, 0.0000],
                [0.0000, 0.0000, 0.0000,  ..., 1.0000, 0.0000, 0.0000],
                [0.0000, 0.0000, 1.0000,  ..., 1.0000, 0.0000, 0.0000]]), 'warning': []}
DEFAULT_COLUMNS = ['chrom', 'pos', 'ref', 'alt', 'gene', 'start', 'end', 'strand', 'gene_mask_start', 'gene_mask_end', 'rel_pos', 'ref_tx', 'alt_tx', 'tss_dist']
__annotations__ = {}
__getitem__(idx)[source]
__init__(variants, metadata_anndata=None, seq_len=524288, max_seq_shift=0, include_cols=None, gene_col=None, min_from_end=0, distance_type='tss', min_distance=0, max_distance=inf)[source]
__len__()[source]
__parameters__ = ()
collate_fn(batch)[source]
static overlap_genes(df_variants, df_genes, gene_col=None, include_cols=None, min_from_end=0, distance_type='tss', min_distance=0, max_distance=inf)[source]
validate_allele_seq(gene, variant)[source]

decima.data.preprocess module

decima.data.preprocess.aggregate_anndata(ad, by_cols=['cell_type', 'tissue', 'organ', 'disease', 'study', 'dataset', 'region', 'subregion', 'celltype_coarse'], sum_cols=['n_cells'])[source]
decima.data.preprocess.assign_borzoi_folds(ad, splits)[source]
decima.data.preprocess.change_values(df, col, value_dict)[source]
decima.data.preprocess.get_frac_N(interval, genome='hg38')[source]
decima.data.preprocess.load_ncbi_string(string)[source]
decima.data.preprocess.make_inputs(gene, ad)[source]
decima.data.preprocess.match_cellranger_2024(ad, genes24)[source]
decima.data.preprocess.match_ncbi(ad, ncbi)[source]
decima.data.preprocess.match_ref_ad(ad, ref_ad)[source]
decima.data.preprocess.merge_transcripts(gtf)[source]
decima.data.preprocess.var_to_intervals(ad, chr_end_pad=10000, genome='hg38', seq_len=524288, crop_coords=163840)[source]

decima.data.read_hdf5 module

decima.data.read_hdf5.count_genes(h5_file, key=None)[source]
decima.data.read_hdf5.extract_gene_data(h5_file, gene, seq_len=524288, merge=True)[source]
decima.data.read_hdf5.get_gene_idx(h5_file, gene, key=None)[source]
decima.data.read_hdf5.index_genes(h5_file, key=None)[source]
decima.data.read_hdf5.list_genes(h5_file, key=None)[source]
decima.data.read_hdf5.mutate(seq, allele, pos)[source]

decima.data.read_vcf module

decima.data.write_hdf5 module

decima.data.write_hdf5.write_hdf5(file, ad, pad=0)[source]

Module contents