decima package¶
Subpackages¶
- decima.cli package
- decima.core package
- Submodules
- decima.core.attribution module
AttributionAttribution.__init__()Attribution.__repr__()Attribution.__sub__()Attribution.chromAttribution.endAttribution.fasta_str()Attribution.find_peaks()Attribution.from_seq()Attribution.gene_endAttribution.gene_startAttribution.peaksAttribution.peaks_to_bed()Attribution.plot_peaks()Attribution.plot_seqlogo()Attribution.save_bigwig()Attribution.save_fasta()Attribution.save_peaks()Attribution.scan_motifs()Attribution.startAttribution.strand
AttributionResultAttributionResult.__enter__()AttributionResult.__exit__()AttributionResult.__init__()AttributionResult.__repr__()AttributionResult.aggregate()AttributionResult.close()AttributionResult.load()AttributionResult.load_attribution()AttributionResult.open()AttributionResult.recursive_seqlet_calling()
VariantAttributionResult
- decima.core.metadata module
CellMetadataCellMetadata.nameCellMetadata.cell_typeCellMetadata.tissueCellMetadata.organCellMetadata.diseaseCellMetadata.studyCellMetadata.datasetCellMetadata.regionCellMetadata.subregionCellMetadata.celltype_coarseCellMetadata.n_cellsCellMetadata.total_countsCellMetadata.n_genesCellMetadata.size_factorCellMetadata.train_pearsonCellMetadata.val_pearsonCellMetadata.test_pearsonCellMetadata.__annotations__CellMetadata.__dataclass_fields__CellMetadata.__dataclass_params__CellMetadata.__eq__()CellMetadata.__hash__CellMetadata.__init__()CellMetadata.__match_args__CellMetadata.__repr__()CellMetadata.cell_typeCellMetadata.celltype_coarseCellMetadata.co_nameCellMetadata.co_termCellMetadata.datasetCellMetadata.diseaseCellMetadata.frac_nanCellMetadata.from_series()CellMetadata.n_cellsCellMetadata.n_genesCellMetadata.nameCellMetadata.organCellMetadata.regionCellMetadata.size_factorCellMetadata.studyCellMetadata.subregionCellMetadata.test_pearsonCellMetadata.tissueCellMetadata.total_countsCellMetadata.train_pearsonCellMetadata.val_pearson
GeneMetadataGeneMetadata.nameGeneMetadata.chromGeneMetadata.startGeneMetadata.endGeneMetadata.strandGeneMetadata.gene_typeGeneMetadata.frac_nanGeneMetadata.mean_countsGeneMetadata.n_tracksGeneMetadata.gene_startGeneMetadata.gene_endGeneMetadata.gene_lengthGeneMetadata.gene_mask_startGeneMetadata.gene_mask_endGeneMetadata.frac_NGeneMetadata.foldGeneMetadata.datasetGeneMetadata.gene_idGeneMetadata.pearsonGeneMetadata.size_factor_pearsonGeneMetadata.__annotations__GeneMetadata.__dataclass_fields__GeneMetadata.__dataclass_params__GeneMetadata.__eq__()GeneMetadata.__hash__GeneMetadata.__init__()GeneMetadata.__match_args__GeneMetadata.__repr__()GeneMetadata.chromGeneMetadata.datasetGeneMetadata.downstream_basesGeneMetadata.endGeneMetadata.ensembl_canonical_tssGeneMetadata.foldGeneMetadata.frac_NGeneMetadata.frac_nanGeneMetadata.from_series()GeneMetadata.gene_endGeneMetadata.gene_idGeneMetadata.gene_lengthGeneMetadata.gene_mask_endGeneMetadata.gene_mask_startGeneMetadata.gene_startGeneMetadata.gene_typeGeneMetadata.max_countsGeneMetadata.mean_countsGeneMetadata.n_tracksGeneMetadata.nameGeneMetadata.pearsonGeneMetadata.size_factor_pearsonGeneMetadata.startGeneMetadata.strandGeneMetadata.upstream_bases
- decima.core.result module
DecimaResultDecimaResult.__annotations__DecimaResult.__init__()DecimaResult.__repr__()DecimaResult.assert_genes()DecimaResult.attributions()DecimaResult.cell_metadataDecimaResult.cellsDecimaResult.correlation()DecimaResult.gene_metadataDecimaResult.gene_sequence()DecimaResult.genesDecimaResult.get_cell_metadata()DecimaResult.get_gene_metadata()DecimaResult.ground_truthDecimaResult.load()DecimaResult.load_model()DecimaResult.marker_zscores()DecimaResult.modelDecimaResult.plot_correlation()DecimaResult.predicted_expression_matrix()DecimaResult.predicted_gene_expression()DecimaResult.prepare_one_hot()DecimaResult.query_cells()DecimaResult.query_tasks()DecimaResult.shape
- Module contents
- decima.data package
- Submodules
- decima.data.dataset module
GeneDatasetHDF5DatasetSeqDatasetVariantDatasetVariantDataset.DEFAULT_COLUMNSVariantDataset.__annotations__VariantDataset.__getitem__()VariantDataset.__init__()VariantDataset.__len__()VariantDataset.__parameters__VariantDataset.__repr__()VariantDataset.collate_fn()VariantDataset.overlap_genes()VariantDataset.predicted_expression_cache()VariantDataset.validate_allele_seq()
- decima.data.preprocess module
- decima.data.read_hdf5 module
- decima.data.write_hdf5 module
- Module contents
- decima.hub package
- decima.interpret package
- decima.model package
- Submodules
- decima.model.decima_model module
- decima.model.lightning module
EnsembleLightningModelEnsembleLightningModel.__annotations__EnsembleLightningModel.__init__()EnsembleLightningModel.add_transform()EnsembleLightningModel.forward()EnsembleLightningModel.load_from_checkpoints()EnsembleLightningModel.predict_on_dataset()EnsembleLightningModel.predict_step()EnsembleLightningModel.reset_transform()EnsembleLightningModel.test_step()EnsembleLightningModel.training_step()EnsembleLightningModel.transform()EnsembleLightningModel.validation_step()
GeneMaskLightningModelLightningModelLightningModel.__annotations__LightningModel.__init__()LightningModel.add_transform()LightningModel.configure_optimizers()LightningModel.count_params()LightningModel.format_input()LightningModel.forward()LightningModel.get_task_idxs()LightningModel.load_safetensor()LightningModel.make_predict_loader()LightningModel.make_test_loader()LightningModel.make_train_loader()LightningModel.on_save_checkpoint()LightningModel.on_test_epoch_end()LightningModel.on_validation_epoch_end()LightningModel.parse_logger()LightningModel.predict_on_dataset()LightningModel.predict_step()LightningModel.reset_transform()LightningModel.test_step()LightningModel.train_on_dataset()LightningModel.training_step()LightningModel.validation_step()
- decima.model.loss module
- decima.model.metrics module
- Module contents
- decima.plot package
- decima.tools package
- decima.train package
- decima.utils package
- Submodules
- decima.utils.dataframe module
- decima.utils.inject module
- decima.utils.io module
- decima.utils.motifs module
- decima.utils.qc module
- decima.utils.sequence module
- decima.utils.variant module
- Module contents
- decima.vep package
Submodules¶
decima.constants module¶
Decima constants.
Module contents¶
- class decima.DecimaResult(anndata)[source]¶
Bases:
objectContainer for Decima results and model predictions.
This class provides a unified interface for loading pre-trained Decima models and associated metadata, making predictions, and performing attribution analyses.
- The DecimaResult object contains:
An AnnData object with gene expression and metadata
A trained model for making predictions
Methods for attribution analysis and interpretation
- Parameters:
anndata – AnnData object containing gene expression data and metadata
Examples
>>> # Load default pre-trained model and metadata >>> result = DecimaResult.load() >>> result.load_model( ... rep=0 ... ) >>> # Perform attribution analysis >>> attributions = result.attributions( ... output_dir="attrs_SP1I_classical_monoctypes", ... gene="SPI1", ... tasks='cell_type == "classical monocyte"', ... )
- Properties:
model: Decima model genes: List of gene names cells: List of cell names cell_metadata: Cell metadata gene_metadata: Gene metadata shape: Shape of the expression matrix attributions: Attributions for a gene
- attributions(gene, tasks=None, off_tasks=None, transform='specificity', method='inputxgradient', threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, genome='hg38')[source]¶
Get attributions for a specific gene.
- Parameters:
gene (
str) – Gene nametasks (
Optional[List[str]]) – List of cells to use as on taskoff_tasks (
Optional[List[str]]) – List of cells to use as off tasktransform (
str) – Attribution transform methodmethod (
str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.threshold (
float) – Threshold for attribution analysismin_seqlet_len (
int) – Minimum length for seqlet callingmax_seqlet_len (
int) – Maximum length for seqlet callingadditional_flanks (
int) – Additional flanks for seqlet callinggenome (
str) – Genome to use for attribution analysis default is “hg38”. Can be genome name or path to custom genome fasta file.
- Returns:
Container with inputs, predictions, attribution scores and TSS position
- Return type:
- correlation(tasks, off_tasks, dataset='test')[source]¶
Compute the correlation between the ground truth and the predicted expression.
- Parameters:
tasks – List of cells to use as on task.
off_tasks – List of cells to use as off task.
dataset – Dataset to use for computation.
- Returns:
Pearson correlation coefficient.
- Return type:
- classmethod load(anndata_name_or_path=None)[source]¶
Load a DecimaResult object from an anndata file or a path to an anndata file.
- Parameters:
- Returns:
DecimaResult object
Examples
>>> result = DecimaResult.load() # Load default decima metadata >>> result = DecimaResult.load( ... "path/to/anndata.h5ad" ... ) # Load custom anndata object from file
- load_model(model='v1_rep0', device='cpu')[source]¶
Load the trained model from a checkpoint path.
- Parameters:
- Returns:
self
Examples
>>> result = DecimaResult.load() >>> result.load_model() # Load default model (rep0) >>> result.load_model( ... model="path/to/checkpoint.ckpt" ... ) >>> result.load_model( ... model=2 ... )
- marker_zscores(tasks, off_tasks=None, layer='preds')[source]¶
Compute marker z-scores to identify differentially expressed genes.
- Parameters:
tasks – Target cells. Query string or list of cell IDs.
off_tasks – Background cells. Query string, list of cell IDs, or None (uses all other cells).
layer – Expression data layer. “preds” (default), “expression”, or custom layer name.
- Returns:
Columns are ‘gene’, ‘score’ (z-score), ‘task’.
- Return type:
Examples
>>> # Classical monocytes vs all others >>> markers = result.marker_zscores( ... "cell_type == 'classical monocyte'" ... ) >>> top_genes = markers.nlargest( ... 10, "score" ... )
>>> markers = result.marker_zscores( ... tasks="cell_type == 'classical monocyte'", ... off_tasks="cell_type == 'lymphoid progenitor'", ... )
- property model¶
Decima model.
- plot_correlation(tasks, off_tasks, dataset='test')[source]¶
Plot the correlation between the ground truth and the predicted expression.
- Parameters:
tasks – List of cells to use as on task.
off_tasks – List of cells to use as off task.
dataset – Dataset to use for computation.
- Returns:
Plot of the correlation between the ground truth and the predicted expression.
- Return type:
p9.ggplot
Examples
>>> result = DecimaResult.load() >>> result.plot_correlation( ... tasks="cell_type == 'classical monocyte'", ... off_tasks="cell_type == 'lymphoid progenitor'", ... )
- predicted_expression_matrix(genes=None, model_name=None)[source]¶
Get predicted expression matrix for all or specific genes.
- predicted_gene_expression(gene, model_name)[source]¶
Get predicted expression for a specific gene.
- Parameters:
gene – Gene name
model_name – Model name
- Returns:
Predicted expression for the gene
- Return type:
torch.Tensor
- prepare_one_hot(gene, variants=None, padding=0, genome='hg38')[source]¶
Prepare one-hot encoding for a gene.
- Parameters:
- Returns:
One-hot encoding of the gene
- Return type:
torch.Tensor
- query_cells(query)[source]¶
Query cells based on a query string.
- Parameters:
query (
str) – Query string- Returns:
List of cell names
Examples
>>> result = DecimaResult.load() >>> cells = result.query_cells( ... "cell_type == 'classical monocyte'" ... ) >>> cells ['agg1', 'agg2', 'agg3', ...]
- decima.predict_attributions_seqlet_calling(output_prefix, genes=None, seqs=None, tasks=None, off_tasks=None, model='ensemble', metadata_anndata=None, method='inputxgradient', transform='specificity', num_workers=2, tss_distance=None, batch_size=1, top_n_markers=None, device=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13', genome='hg38')[source]¶
Generate and save attribution analysis results for a gene. This function performs attribution analysis for a given gene and cell types, saving the following output files to the specified directory:
- Output files:
├── {output_prefix}.attributions.h5 # Raw attribution score matrix per gene.
├── {output_prefix}.attributions.bigwig # Genome browser track of attribution as bigwig file.
├── {output_prefix}.seqlets.bed # List of attribution peaks in BED format.
├── {output_prefix}.motifs.tsv # Detected motifs in peak regions.
└── {output_prefix}.warnings.qc.log # QC warnings about prediction reliability.
- Parameters:
output_dir – Directory to save output files
gene – Gene symbol or ID to analyze
tasks (
Optional[List[str]]) – List of cell types to analyze attributions for either list of task names or query string to filter cell types to analyze attributions for (e.g. ‘cell_type == ‘classical monocyte’’). If not provided, all tasks will be analyzed.off_tasks (
Optional[List[str]]) – Optional list of cell types to contrast against either list of task names or query string to filter cell types to contrast against (e.g. ‘cell_type == ‘classical monocyte’’). If not provided, all tasks will be used as off tasks.model (
Union[int,str,None]) – Optional model to use for attribution analysis either replicate number or path to the model.method (
str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.device (
Optional[str]) – Device to use for attribution analysis (e.g. ‘cuda’, ‘cpu’). If not provided, the best available device will be used automatically.dpi – DPI for attribution plots.
- Raises:
FileExistsError – If output directory already exists.
Examples: >>> predict_save_attributions( … output_dir=”output_dir”, … genes=[ … “SPI1”, … “CD68”, … ], … tasks=”cell_type == ‘classical monocyte’”, … )
- decima.predict_variant_effect(df_variant, output_pq=None, tasks=None, model='ensemble', metadata_anndata=None, chunksize=10000, batch_size=1, num_workers=16, device=None, include_cols=None, gene_col=None, distance_type='tss', min_distance=0, max_distance=inf, genome='hg38', save_replicates=False, reference_cache=True, float_precision='32')[source]¶
Predict variant effect and save to parquet
- Parameters:
df_variant (pd.DataFrame or str) – DataFrame with variant information or path to variant file
output_pq (str, optional) – Path to save the parquet file. Defaults to None.
tasks (str, optional) – Tasks to predict. Defaults to None.
model (int, optional) – Model to use. Defaults to DEFAULT_ENSEMBLE.
metadata_anndata (str, optional) – Path to anndata file. Defaults to None.
chunksize (int, optional) – Number of variants to predict in each chunk. Defaults to 10_000.
batch_size (int, optional) – Batch size. Defaults to 1.
num_workers (int, optional) – Number of workers. Defaults to 16.
device (str, optional) – Device to use. Defaults to None.
include_cols (list, optional) – Columns to include in the output. Defaults to None.
gene_col (str, optional) – Column name for gene names. Defaults to None.
distance_type (str, optional) – Type of distance. Defaults to “tss”.
min_distance (float, optional) – Minimum distance from the end of the gene. Defaults to 0 (inclusive).
max_distance (float, optional) – Maximum distance from the TSS. Defaults to inf (exclusive).
genome (str, optional) – Genome name or path to the genome fasta file. Defaults to “hg38”.
save_replicates (bool, optional) – Save the replicates in the output. Defaults to False.
reference_cache (bool, optional) – Whether to use reference cache. Defaults to True.
float_precision (str, optional) – Floating-point precision. Defaults to “32”.
- Return type: