decima.data package¶
Submodules¶
decima.data.dataset module¶
- class decima.data.dataset.HDF5Dataset(key, h5_file, ad=None, seq_len=524288, max_seq_shift=0, seed=0, augment_mode='random')[source]¶
Bases:
Dataset
- __init__(key, h5_file, ad=None, seq_len=524288, max_seq_shift=0, seed=0, augment_mode='random')[source]¶
- __parameters__ = ()¶
- class decima.data.dataset.VariantDataset(variants, metadata_anndata=None, seq_len=524288, max_seq_shift=0, include_cols=None, gene_col=None, min_from_end=0, distance_type='tss', min_distance=0, max_distance=inf)[source]¶
Bases:
Dataset
Dataset for variant effect prediction
- Parameters:
variants (pd.DataFrame) – DataFrame with variants
anndata (AnnData) – AnnData object with gene metadata
seq_len (int) – Length of the sequence
max_seq_shift (int) – Maximum sequence shift
include_cols (list) – List of columns to include in the output
gene_col (str) – Column name for gene names
min_from_end (int) – Minimum distance from the end of the gene
distance_type (str) – Type of distance
min_distance (int) – Minimum distance from the TSS
max_distance (int) – Maximum distance from the TSS
- Returns:
Dataset for variant effect prediction
- Return type:
Dataset
Examples
>>> import pandas as pd >>> import anndata as ad >>> from decima.data.dataset import ( ... VariantDataset, ... ) >>> variants = pd.read_csv( ... "variants.csv" ... ) >>> dataset = ( ... VariantDataset( ... variants ... ) ... ) >>> dataset[0] {'seq': tensor([[1.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 1.0000], [0.0000, 1.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [0.0000, 0.0000, 1.0000, ..., 0.0000, 1.0000, 0.0000], [0.0000, 0.0000, 0.0000, ..., 1.0000, 0.0000, 0.0000], [0.0000, 0.0000, 1.0000, ..., 1.0000, 0.0000, 0.0000]]), 'warning': []}
- DEFAULT_COLUMNS = ['chrom', 'pos', 'ref', 'alt', 'gene', 'start', 'end', 'strand', 'gene_mask_start', 'gene_mask_end', 'rel_pos', 'ref_tx', 'alt_tx', 'tss_dist']¶
- __annotations__ = {}¶
- __init__(variants, metadata_anndata=None, seq_len=524288, max_seq_shift=0, include_cols=None, gene_col=None, min_from_end=0, distance_type='tss', min_distance=0, max_distance=inf)[source]¶
- __parameters__ = ()¶