decima package

Subpackages

Submodules

decima.constants module

Decima constants.

Module contents

class decima.DecimaResult(anndata)[source]

Bases: object

Container for Decima results and model predictions.

This class provides a unified interface for loading pre-trained Decima models and associated metadata, making predictions, and performing attribution analyses.

The DecimaResult object contains:
  • An AnnData object with gene expression and metadata

  • A trained model for making predictions

  • Methods for attribution analysis and interpretation

Parameters:

anndata – AnnData object containing gene expression data and metadata

Examples

>>> # Load default pre-trained model and metadata
>>> result = DecimaResult.load()
>>> result.load_model(
...     rep=0
... )
>>> # Perform attribution analysis
>>> attributions = result.attributions(
...     output_dir="attrs_SP1I_classical_monoctypes",
...     gene="SPI1",
...     tasks='cell_type == "classical monocyte"',
... )
Properties:

model: Decima model genes: List of gene names cells: List of cell names cell_metadata: Cell metadata gene_metadata: Gene metadata shape: Shape of the expression matrix attributions: Attributions for a gene

__init__(anndata)[source]
__repr__()[source]

Return repr(self).

assert_genes(genes)[source]

Check if the genes are in the dataset.

Return type:

bool

attributions(gene, tasks=None, off_tasks=None, transform='specificity', method='inputxgradient', threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, genome='hg38')[source]

Get attributions for a specific gene.

Parameters:
  • gene (str) – Gene name

  • tasks (Optional[List[str]]) – List of cells to use as on task

  • off_tasks (Optional[List[str]]) – List of cells to use as off task

  • transform (str) – Attribution transform method

  • method (str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.

  • threshold (float) – Threshold for attribution analysis

  • min_seqlet_len (int) – Minimum length for seqlet calling

  • max_seqlet_len (int) – Maximum length for seqlet calling

  • additional_flanks (int) – Additional flanks for seqlet calling

  • genome (str) – Genome to use for attribution analysis default is “hg38”. Can be genome name or path to custom genome fasta file.

Returns:

Container with inputs, predictions, attribution scores and TSS position

Return type:

Attribution

property cell_metadata: DataFrame

Cell metadata including annotations, metrics, etc.

property cells: List[str]

List of cell identifiers in the dataset.

correlation(tasks, off_tasks, dataset='test')[source]

Compute the correlation between the ground truth and the predicted expression.

Parameters:
  • tasks – List of cells to use as on task.

  • off_tasks – List of cells to use as off task.

  • dataset – Dataset to use for computation.

Returns:

Pearson correlation coefficient.

Return type:

float

property gene_metadata: DataFrame

Gene metadata.

gene_sequence(gene, stranded=True, genome='hg38')[source]

Get sequence for a gene.

Parameters:
  • gene (str) – Gene name

  • stranded (bool) – Whether to return stranded sequence

  • genome (str) – Genome name or path to the genome fasta file. Default: “hg38”

Returns:

Sequence for the gene

Return type:

str

property genes: List[str]

List of gene names in the dataset.

get_cell_metadata(cell)[source]

Get metadata for a specific cell.

Return type:

CellMetadata

get_gene_metadata(gene)[source]

Get metadata for a specific gene.

Return type:

GeneMetadata

property ground_truth: DataFrame

Ground truth expression matrix.

classmethod load(anndata_name_or_path=None)[source]

Load a DecimaResult object from an anndata file or a path to an anndata file.

Parameters:
  • anndata_name_or_path (Union[str, AnnData, None]) – Name of the model or path to anndata file or anndata object

  • model – Model name or path to model checkpoint. If not provided, the default model will be loaded.

Returns:

DecimaResult object

Examples

>>> result = DecimaResult.load()  # Load default decima metadata
>>> result = DecimaResult.load(
...     "path/to/anndata.h5ad"
... )  # Load custom anndata object from file
load_model(model='v1_rep0', device='cpu')[source]

Load the trained model from a checkpoint path.

Parameters:
  • model (Union[int, str, None]) – Path to model checkpoint or replicate number (0-3) for pre-trained models

  • device (str) – Device to load model on

Returns:

self

Examples

>>> result = DecimaResult.load()
>>> result.load_model()  # Load default model (rep0)
>>> result.load_model(
...     model="path/to/checkpoint.ckpt"
... )
>>> result.load_model(
...     model=2
... )
marker_zscores(tasks, off_tasks=None, layer='preds')[source]

Compute marker z-scores to identify differentially expressed genes.

Parameters:
  • tasks – Target cells. Query string or list of cell IDs.

  • off_tasks – Background cells. Query string, list of cell IDs, or None (uses all other cells).

  • layer – Expression data layer. “preds” (default), “expression”, or custom layer name.

Returns:

Columns are ‘gene’, ‘score’ (z-score), ‘task’.

Return type:

pandas.DataFrame

Examples

>>> # Classical monocytes vs all others
>>> markers = result.marker_zscores(
...     "cell_type == 'classical monocyte'"
... )
>>> top_genes = markers.nlargest(
...     10, "score"
... )
>>> markers = result.marker_zscores(
...     tasks="cell_type == 'classical monocyte'",
...     off_tasks="cell_type == 'lymphoid progenitor'",
... )
property model

Decima model.

plot_correlation(tasks, off_tasks, dataset='test')[source]

Plot the correlation between the ground truth and the predicted expression.

Parameters:
  • tasks – List of cells to use as on task.

  • off_tasks – List of cells to use as off task.

  • dataset – Dataset to use for computation.

Returns:

Plot of the correlation between the ground truth and the predicted expression.

Return type:

p9.ggplot

Examples

>>> result = DecimaResult.load()
>>> result.plot_correlation(
...     tasks="cell_type == 'classical monocyte'",
...     off_tasks="cell_type == 'lymphoid progenitor'",
... )
predicted_expression_matrix(genes=None, model_name=None)[source]

Get predicted expression matrix for all or specific genes.

Parameters:

genes (Optional[List[str]]) – Optional list of genes to get predictions for. If None, returns all genes.

Returns:

Predicted expression matrix (cells x genes)

Return type:

pd.DataFrame

predicted_gene_expression(gene, model_name)[source]

Get predicted expression for a specific gene.

Parameters:
  • gene – Gene name

  • model_name – Model name

Returns:

Predicted expression for the gene

Return type:

torch.Tensor

prepare_one_hot(gene, variants=None, padding=0, genome='hg38')[source]

Prepare one-hot encoding for a gene.

Parameters:
  • gene (str) – Gene name

  • variants (Optional[List[Dict]]) – Optional list of variant dictionaries to inject into the sequence

  • padding (int) – Amount of padding to add on both sides of the sequence

  • genome (str) – Genome name or path to the genome fasta file. Default: “hg38”

Returns:

One-hot encoding of the gene

Return type:

torch.Tensor

query_cells(query)[source]

Query cells based on a query string.

Parameters:

query (str) – Query string

Returns:

List of cell names

Examples

>>> result = DecimaResult.load()
>>> cells = result.query_cells(
...     "cell_type == 'classical monocyte'"
... )
>>> cells
['agg1', 'agg2', 'agg3', ...]
query_tasks(tasks=None, off_tasks=None)[source]

Query tasks based on a query string.

Parameters:
Returns:

List of tasks

Examples

>>> result = DecimaResult.load()
>>> tasks = result.query_tasks(
...     "cell_type == 'classical monocyte'"
... )
>>> tasks
[...]
property shape: tuple

Shape of the expression matrix (n_cells, n_genes).

decima.predict_attributions_seqlet_calling(output_prefix, genes=None, seqs=None, tasks=None, off_tasks=None, model='ensemble', metadata_anndata=None, method='inputxgradient', transform='specificity', num_workers=2, tss_distance=None, batch_size=1, top_n_markers=None, device=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13', genome='hg38')[source]

Generate and save attribution analysis results for a gene. This function performs attribution analysis for a given gene and cell types, saving the following output files to the specified directory:

Output files:

├── {output_prefix}.attributions.h5 # Raw attribution score matrix per gene.

├── {output_prefix}.attributions.bigwig # Genome browser track of attribution as bigwig file.

├── {output_prefix}.seqlets.bed # List of attribution peaks in BED format.

├── {output_prefix}.motifs.tsv # Detected motifs in peak regions.

└── {output_prefix}.warnings.qc.log # QC warnings about prediction reliability.

Parameters:
  • output_dir – Directory to save output files

  • gene – Gene symbol or ID to analyze

  • tasks (Optional[List[str]]) – List of cell types to analyze attributions for either list of task names or query string to filter cell types to analyze attributions for (e.g. ‘cell_type == ‘classical monocyte’’). If not provided, all tasks will be analyzed.

  • off_tasks (Optional[List[str]]) – Optional list of cell types to contrast against either list of task names or query string to filter cell types to contrast against (e.g. ‘cell_type == ‘classical monocyte’’). If not provided, all tasks will be used as off tasks.

  • model (Union[int, str, None]) – Optional model to use for attribution analysis either replicate number or path to the model.

  • method (str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.

  • device (Optional[str]) – Device to use for attribution analysis (e.g. ‘cuda’, ‘cpu’). If not provided, the best available device will be used automatically.

  • dpi – DPI for attribution plots.

Raises:

FileExistsError – If output directory already exists.

Examples: >>> predict_save_attributions( … output_dir=”output_dir”, … genes=[ … “SPI1”, … “CD68”, … ], … tasks=”cell_type == ‘classical monocyte’”, … )

decima.predict_variant_effect(df_variant, output_pq=None, tasks=None, model='ensemble', metadata_anndata=None, chunksize=10000, batch_size=1, num_workers=16, device=None, include_cols=None, gene_col=None, distance_type='tss', min_distance=0, max_distance=inf, genome='hg38', save_replicates=False, reference_cache=True, float_precision='32')[source]

Predict variant effect and save to parquet

Parameters:
  • df_variant (pd.DataFrame or str) – DataFrame with variant information or path to variant file

  • output_pq (str, optional) – Path to save the parquet file. Defaults to None.

  • tasks (str, optional) – Tasks to predict. Defaults to None.

  • model (int, optional) – Model to use. Defaults to DEFAULT_ENSEMBLE.

  • metadata_anndata (str, optional) – Path to anndata file. Defaults to None.

  • chunksize (int, optional) – Number of variants to predict in each chunk. Defaults to 10_000.

  • batch_size (int, optional) – Batch size. Defaults to 1.

  • num_workers (int, optional) – Number of workers. Defaults to 16.

  • device (str, optional) – Device to use. Defaults to None.

  • include_cols (list, optional) – Columns to include in the output. Defaults to None.

  • gene_col (str, optional) – Column name for gene names. Defaults to None.

  • distance_type (str, optional) – Type of distance. Defaults to “tss”.

  • min_distance (float, optional) – Minimum distance from the end of the gene. Defaults to 0 (inclusive).

  • max_distance (float, optional) – Maximum distance from the TSS. Defaults to inf (exclusive).

  • genome (str, optional) – Genome name or path to the genome fasta file. Defaults to “hg38”.

  • save_replicates (bool, optional) – Save the replicates in the output. Defaults to False.

  • reference_cache (bool, optional) – Whether to use reference cache. Defaults to True.

  • float_precision (str, optional) – Floating-point precision. Defaults to “32”.

Return type:

None