decima.core package

Submodules

decima.core.attribution module

Attribution analysis from decima model.

class decima.core.attribution.Attribution(inputs, attrs, gene='', chrom=None, start=None, end=None, strand=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]

Bases: object

Attribution analysis results for a gene.

Parameters:
  • gene (Optional[str]) – Gene symbol or ID to analyze

  • inputs (Tensor) – One-hot encoded sequence

  • attrs (ndarray) – Attribution scores

  • gene – Gene name

  • chrom (Optional[str]) – Chromosome name

  • start (Optional[int]) – Start position

  • end (Optional[int]) – End position

  • strand (Optional[str]) – Strand

  • threshold (Optional[float]) – Threshold for peak finding

  • min_seqlet_len (Optional[int]) – Minimum sequence length for peak finding

  • max_seqlet_len (Optional[int]) – Maximum sequence length for peak finding

  • additional_flanks (Optional[int]) – Additional flanks to add to the gene

Returns:

Attribution analysis results for the gene and tasks

Return type:

Attribution

Examples

>>> attribution = Attribution(
    gene="A1BG",
    inputs=inputs,
    attrs=attrs,
    chrom="chr1",
    start=100,
    end=200,
    strand="+",
    threshold=5e-4,
    min_seqlet_len=4,
    max_seqlet_len=25,
    additional_flanks=0,
)
>>> attribution.plot_peaks()
>>> attribution.scan_motifs()
>>> attribution.save_bigwig(
...     "attributions.bigwig"
... )
>>> attribution.peaks_to_bed()
__init__(inputs, attrs, gene='', chrom=None, start=None, end=None, strand=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]

Initialize Attribution.

Parameters:
  • inputs (Tensor) – One-hot encoded sequence

  • attrs (ndarray) – Attribution scores

  • gene (Optional[str]) – Gene name

  • chrom (Optional[str]) – Chromosome name

  • start (Optional[int]) – Start position

  • end (Optional[int]) – End position

  • strand (Optional[str]) – Strand

  • threshold (Optional[float]) – Threshold for peak finding

  • min_seqlet_len (Optional[int]) – Minimum sequence length for peak finding

  • max_seqlet_len (Optional[int]) – Maximum sequence length for peak finding

  • additional_flanks (Optional[int]) – Additional flanks to add to the gene

  • pattern_type (Optional[str]) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively. “both” means both positive and negative patterns are considered.

__repr__()[source]

Return repr(self).

__sub__(other)[source]
property chrom: str

Get the chromosome name.

property end: int

Get the end position.

fasta_str()[source]

Get attribution scores as a fasta string.

static find_peaks(attrs, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]

Find peaks in attribution scores.

Parameters:
  • attrs – Attribution scores

  • threshold – Threshold for peak finding

  • min_seqlet_len – Minimum sequence length for peak finding

  • max_seqlet_len – Maximum sequence length for peak finding

  • additional_flanks – Additional flanks to add to the gene

  • pattern_type – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

Returns:

DataFrame of peaks with columns of:
  • ”peak”: Peak name in format “pattern_type.gene@from_tss

  • ”start”: Start position of the peak

  • ”end”: End position of the peak

  • ”attribution”: Attribution score of the peak

  • ”p-value”: P-value of the peak

  • ”from_tss”: Distance from the TSS to the peak

  • ”pattern_type”: Pattern type of the peak

Return type:

df

classmethod from_seq(inputs, tasks=None, off_tasks=None, model='v1_rep0', transform='specificity', method='inputxgradient', device='cpu', result=None, gene='', chrom=None, start=None, end=None, strand=None, gene_mask_start=None, gene_mask_end=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0)[source]

Initialize Attribution from sequence.

Parameters:
  • inputs (Union[str, Tensor, ndarray]) – Sequence to analyze either string of sequence, torch.Tensor or np.ndarray with shape (4, 524288) or (5, 524288) where the last dimension is a binary mask. If 4-dimensional, gene_mask_start and gene_mask_end must be provided.

  • tasks (Optional[list]) – List of cell types to analyze attributions for

  • off_tasks (Optional[list]) – List of cell types to contrast against

  • model (Union[int, str, None]) – Model to use for attribution analysis

  • transform (str) – Transformation to apply to attributions

  • method (str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.

  • device (Optional[str]) – Device to use for attribution analysis

  • result (Optional[str]) – Result object or path to result object or name of the model to load the result for.

  • gene (Optional[str]) – Gene name

  • chrom (Optional[str]) – Chromosome name

  • start (Optional[int]) – Start position

  • end (Optional[int]) – End position

  • strand (Optional[str]) – Strand

  • gene_start – Gene start position

  • gene_end – Gene end position

  • threshold (Optional[float]) – Threshold for peak finding

  • min_seqlet_len (Optional[int]) – Minimum sequence length for peak finding

  • max_seqlet_len (Optional[int]) – Maximum sequence length for peak finding

  • additional_flanks (Optional[int]) – Additional flanks to add to the gene

property gene_end: int

Get the gene end position.

property gene_start: int

Get the gene start position.

property peaks: DataFrame
peaks_to_bed()[source]

Convert peaks to bed format.

Returns:

Peaks in bed format where columns are:
  • chrom: Chromosome name

  • start: Start position in genome

  • end: End position in genome

  • name: Peak name in format “gene@from_tss

  • score: Score (-log10(p-value)) clipped to 0-100 based on the seqlet calling

  • strand: Strand == ‘.’

Return type:

pd.DataFrame

plot_peaks(overlapping_min_dist=1000, figsize=(10, 2))[source]

Plot attribution scores and highlight peaks.

Parameters:
  • overlapping_min_dist – Minimum distance between peaks to consider them overlapping

  • figsize – Figure size in inches (width, height)

Returns:

The plotted figure showing attribution scores with highlighted peaks

Return type:

plotnine.ggplot

Plot attribution scores around a relative location.

Parameters:
  • relative_loc – Position relative to TSS to center plot on

  • window – Number of bases to show on each side of center

Returns:

Attribution plot

Return type:

matplotlib.pyplot.Figure

save_bigwig(bigwig_path)[source]

Save attribution scores as a bigwig file.

Parameters:

bigwig_path (str) – Path to save bigwig file.

save_fasta(fasta_path)[source]

Save attribution scores as a fasta file.

save_peaks(bed_path)[source]

Save peaks to bed file.

Parameters:

bed_path (str) – Path to save bed file.

scan_motifs(motifs='hocomoco_v13', window=18, pthresh=0.0005)[source]

Scan for motifs in peak regions.

Parameters:
  • motifs (str) – Motif database to use

  • window (int) – Window size around peaks

  • pthresh (float) – P-value threshold for motif matches

Returns:

Motif scan results with columns:
  • ”motif”: Motif name

  • ”peak”: Peak name

  • ”start”: Start position of the peak

  • ”end”: End position of the peak

  • ”strand”: Strand of the peak

  • ”score”: Fimoe score of the motif

  • ”p-value”: Fimo p-value of the motif

  • ”matched_seq”: Matched sequence

  • ”site_attr_score”: Site attribution score

  • ”motif_attr_score”: Motif attribution score

  • ”from_tss”: Distance from the TSS to the peak

Return type:

pd.DataFrame

property start: int

Get the start position.

property strand: str

Get the strand.

class decima.core.attribution.AttributionResult(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]

Bases: object

Attribution result from decima model.

Parameters:
  • attribution_h5 (Union[str, List[str]]) – Path to attribution h5 file or list of paths to attribution h5 files generated by decima attributions-predict or decima attributions commands.

  • tss_distance (Optional[int]) – Distance from the TSS to include in the attribution analysis.

  • correct_grad – Whether to correct the gradient for the attribution analysis.

  • num_workers (Optional[int]) – Number of workers to use for the attribution analysis.

  • agg_func (Optional[str]) – Function to aggregate the attribution scores.

Examples

with AttributionResult(attribution_h5=[“example/attribution.h5”, “example/attribution2.h5”]) as ar:

seqs, attrs = ar.load(genes=[“SPI1”]) attribution = ar.load_attribution(gene=”SPI1”)

__enter__()[source]
__exit__(exc_type, exc_value, traceback)[source]
__init__(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]
__repr__()[source]

Return repr(self).

static aggregate(seqs, attrs, agg_func=None)[source]

Aggregate the attribution scores.

close()[source]

Close the attribution h5 files.

load(genes, gene_mask=False, **kwargs)[source]

Load the attribution scores for a list of genes.

Parameters:
  • genes (List[str]) – List of genes to load.

  • gene_mask (bool) – Whether to mask the gene.

Returns:

Array of sequences. attrs: Array of attribution scores.

Return type:

seqs

load_attribution(gene, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', **kwargs)[source]

Load the attribution scores for a gene.

Parameters:
  • gene (str) – Gene to load.

  • metadata_anndata (Optional[str]) – Metadata anndata object.

  • custom_genome (bool) – Whether to use custom genome.

  • threshold (float) – Threshold for peak finding.

  • min_seqlet_len (int) – Minimum sequence length for peak finding.

  • max_seqlet_len (int) – Maximum sequence length for peak finding.

  • additional_flanks (int) – Additional flanks to add to the gene.

  • pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

Returns:

Attribution object.

open()[source]

Open the attribution h5 files.

recursive_seqlet_calling(genes=None, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13', **kwargs)[source]

Perform recursive seqlet calling on the attribution scores.

Parameters:
  • genes (Optional[List[str]]) – List of genes to perform recursive seqlet calling on.

  • metadata_anndata (Optional[str]) – Metadata anndata object.

  • custom_genome (bool) – Whether to use custom genome.

  • threshold (float) – Threshold for peak finding.

  • min_seqlet_len (int) – Minimum sequence length for peak finding.

  • max_seqlet_len (int) – Maximum sequence length for peak finding.

  • additional_flanks (int) – Additional flanks to add to the gene.

  • pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

  • meme_motif_db (str) – MEME motif database to use for motif discovery.

Returns:

DataFrame of peaks. df_motifs: DataFrame of motifs.

Return type:

df_peaks

class decima.core.attribution.VariantAttributionResult(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]

Bases: AttributionResult

__annotations__ = {}
__init__(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]
load(variants, genes, gene_mask=False)[source]

Load the attribution scores for a list of genes and variants.

load_attribution(variant, gene, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', **kwargs)[source]

Load the attribution scores for a gene.

Parameters:
  • gene (str) – Gene to load.

  • metadata_anndata (Optional[str]) – Metadata anndata object.

  • custom_genome (bool) – Whether to use custom genome.

  • threshold (float) – Threshold for peak finding.

  • min_seqlet_len (int) – Minimum sequence length for peak finding.

  • max_seqlet_len (int) – Maximum sequence length for peak finding.

  • additional_flanks (int) – Additional flanks to add to the gene.

  • pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

Returns:

Attribution object.

open()[source]

Open the attribution h5 files.

recursive_seqlet_calling(variants, genes, metadata_anndata=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13')[source]

Perform recursive seqlet calling on the attribution scores.

Parameters:
  • genes (Optional[List[str]]) – List of genes to perform recursive seqlet calling on.

  • metadata_anndata (Optional[str]) – Metadata anndata object.

  • custom_genome – Whether to use custom genome.

  • threshold (float) – Threshold for peak finding.

  • min_seqlet_len (int) – Minimum sequence length for peak finding.

  • max_seqlet_len (int) – Maximum sequence length for peak finding.

  • additional_flanks (int) – Additional flanks to add to the gene.

  • pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

  • meme_motif_db (str) – MEME motif database to use for motif discovery.

Returns:

DataFrame of peaks. df_motifs: DataFrame of motifs.

Return type:

df_peaks

decima.core.metadata module

class decima.core.metadata.CellMetadata(name, cell_type, tissue, organ, disease, study, dataset, n_cells, total_counts, n_genes, size_factor, train_pearson, val_pearson, test_pearson, region=None, subregion=None, celltype_coarse=None, co_term=None, co_name=None, frac_nan=None)[source]

Bases: object

Metadata for a cell in the dataset.

name

Cell identifier

cell_type

Detailed cell type

tissue

Tissue identifier

organ

Organ name

disease

Disease state

study

Study identifier

dataset

Dataset identifier

region

Anatomical region

subregion

Anatomical subregion

celltype_coarse

Coarse cell type classification

n_cells

Number of cells

total_counts

Total count of transcripts

n_genes

Number of genes detected

size_factor

Size normalization factor

train_pearson

Pearson correlation in training set

val_pearson

Pearson correlation in validation set

test_pearson

Pearson correlation in test set

__annotations__ = {'cell_type': <class 'str'>, 'celltype_coarse': typing.Optional[str], 'co_name': typing.Optional[str], 'co_term': typing.Optional[str], 'dataset': <class 'str'>, 'disease': <class 'str'>, 'frac_nan': typing.Optional[float], 'n_cells': <class 'int'>, 'n_genes': <class 'int'>, 'name': <class 'str'>, 'organ': <class 'str'>, 'region': typing.Optional[str], 'size_factor': <class 'float'>, 'study': <class 'str'>, 'subregion': typing.Optional[str], 'test_pearson': <class 'float'>, 'tissue': <class 'str'>, 'total_counts': <class 'float'>, 'train_pearson': <class 'float'>, 'val_pearson': <class 'float'>}
__dataclass_fields__ = {'cell_type': Field(name='cell_type',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'celltype_coarse': Field(name='celltype_coarse',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'co_name': Field(name='co_name',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'co_term': Field(name='co_term',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dataset': Field(name='dataset',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'disease': Field(name='disease',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_nan': Field(name='frac_nan',type=typing.Optional[float],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_cells': Field(name='n_cells',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_genes': Field(name='n_genes',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'name': Field(name='name',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'organ': Field(name='organ',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'region': Field(name='region',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'size_factor': Field(name='size_factor',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'study': Field(name='study',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'subregion': Field(name='subregion',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'test_pearson': Field(name='test_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tissue': Field(name='tissue',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'total_counts': Field(name='total_counts',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'train_pearson': Field(name='train_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'val_pearson': Field(name='val_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(name, cell_type, tissue, organ, disease, study, dataset, n_cells, total_counts, n_genes, size_factor, train_pearson, val_pearson, test_pearson, region=None, subregion=None, celltype_coarse=None, co_term=None, co_name=None, frac_nan=None)
__match_args__ = ('name', 'cell_type', 'tissue', 'organ', 'disease', 'study', 'dataset', 'n_cells', 'total_counts', 'n_genes', 'size_factor', 'train_pearson', 'val_pearson', 'test_pearson', 'region', 'subregion', 'celltype_coarse', 'co_term', 'co_name', 'frac_nan')
__repr__()

Return repr(self).

cell_type: str
celltype_coarse: Optional[str] = None
co_name: Optional[str] = None
co_term: Optional[str] = None
dataset: str
disease: str
frac_nan: Optional[float] = None
classmethod from_series(name, series)[source]

Create CellMetadata from a pandas Series.

Return type:

CellMetadata

n_cells: int
n_genes: int
name: str
organ: str
region: Optional[str] = None
size_factor: float
study: str
subregion: Optional[str] = None
test_pearson: float
tissue: str
total_counts: float
train_pearson: float
val_pearson: float
class decima.core.metadata.GeneMetadata(name, chrom, start, end, strand, gene_type, frac_nan, mean_counts, n_tracks, gene_start, gene_end, gene_length, gene_mask_start, gene_mask_end, fold, dataset, gene_id, pearson, size_factor_pearson, frac_N=None, max_counts=None, ensembl_canonical_tss=None, upstream_bases=None, downstream_bases=None)[source]

Bases: object

Metadata for a gene in the dataset.

name

Gene name

chrom

Chromosome where the gene is located

start

Start position of the region around the gene to perform predictions in the chromosome

end

End position of the region around the gene to perform predictions in the chromosome

strand

Strand orientation (+ or -)

gene_type

Type of gene (e.g., protein_coding)

frac_nan

Fraction of NaN values

mean_counts

Mean count across samples

n_tracks

Number of tracks

gene_start

Gene start position

gene_end

Gene end position

gene_length

Length of the gene

gene_mask_start

Start position of the gene mask

gene_mask_end

End position of the gene mask

frac_N

Fraction of N bases

fold

Cross-validation fold

dataset

Dataset identifier

gene_id

Ensembl gene ID

pearson

Pearson correlation

size_factor_pearson

Size factor Pearson correlation

__annotations__ = {'chrom': <class 'str'>, 'dataset': <class 'str'>, 'downstream_bases': typing.Optional[int], 'end': <class 'int'>, 'ensembl_canonical_tss': typing.Optional[bool], 'fold': typing.List[str], 'frac_N': typing.Optional[float], 'frac_nan': <class 'float'>, 'gene_end': <class 'int'>, 'gene_id': <class 'str'>, 'gene_length': <class 'int'>, 'gene_mask_end': <class 'int'>, 'gene_mask_start': <class 'int'>, 'gene_start': <class 'int'>, 'gene_type': <class 'str'>, 'max_counts': typing.Optional[int], 'mean_counts': <class 'float'>, 'n_tracks': <class 'int'>, 'name': <class 'str'>, 'pearson': <class 'float'>, 'size_factor_pearson': <class 'float'>, 'start': <class 'int'>, 'strand': <class 'str'>, 'upstream_bases': typing.Optional[int]}
__dataclass_fields__ = {'chrom': Field(name='chrom',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dataset': Field(name='dataset',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'downstream_bases': Field(name='downstream_bases',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'end': Field(name='end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ensembl_canonical_tss': Field(name='ensembl_canonical_tss',type=typing.Optional[bool],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'fold': Field(name='fold',type=typing.List[str],default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_N': Field(name='frac_N',type=typing.Optional[float],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_nan': Field(name='frac_nan',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_end': Field(name='gene_end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_id': Field(name='gene_id',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_length': Field(name='gene_length',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_mask_end': Field(name='gene_mask_end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_mask_start': Field(name='gene_mask_start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_start': Field(name='gene_start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_type': Field(name='gene_type',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'max_counts': Field(name='max_counts',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'mean_counts': Field(name='mean_counts',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_tracks': Field(name='n_tracks',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'name': Field(name='name',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'pearson': Field(name='pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'size_factor_pearson': Field(name='size_factor_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'start': Field(name='start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'strand': Field(name='strand',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'upstream_bases': Field(name='upstream_bases',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(name, chrom, start, end, strand, gene_type, frac_nan, mean_counts, n_tracks, gene_start, gene_end, gene_length, gene_mask_start, gene_mask_end, fold, dataset, gene_id, pearson, size_factor_pearson, frac_N=None, max_counts=None, ensembl_canonical_tss=None, upstream_bases=None, downstream_bases=None)
__match_args__ = ('name', 'chrom', 'start', 'end', 'strand', 'gene_type', 'frac_nan', 'mean_counts', 'n_tracks', 'gene_start', 'gene_end', 'gene_length', 'gene_mask_start', 'gene_mask_end', 'fold', 'dataset', 'gene_id', 'pearson', 'size_factor_pearson', 'frac_N', 'max_counts', 'ensembl_canonical_tss', 'upstream_bases', 'downstream_bases')
__repr__()

Return repr(self).

chrom: str
dataset: str
downstream_bases: Optional[int] = None
end: int
ensembl_canonical_tss: Optional[bool] = None
fold: List[str]
frac_N: Optional[float] = None
frac_nan: float
classmethod from_series(name, series)[source]

Create GeneMetadata from a pandas Series.

Return type:

GeneMetadata

gene_end: int
gene_id: str
gene_length: int
gene_mask_end: int
gene_mask_start: int
gene_start: int
gene_type: str
max_counts: Optional[int] = None
mean_counts: float
n_tracks: int
name: str
pearson: float
size_factor_pearson: float
start: int
strand: str
upstream_bases: Optional[int] = None

decima.core.result module

class decima.core.result.DecimaResult(anndata)[source]

Bases: object

Container for Decima results and model predictions.

This class provides a unified interface for loading pre-trained Decima models and associated metadata, making predictions, and performing attribution analyses.

The DecimaResult object contains:
  • An AnnData object with gene expression and metadata

  • A trained model for making predictions

  • Methods for attribution analysis and interpretation

Parameters:

anndata – AnnData object containing gene expression data and metadata

Examples

>>> # Load default pre-trained model and metadata
>>> result = DecimaResult.load()
>>> result.load_model(
...     rep=0
... )
>>> # Perform attribution analysis
>>> attributions = result.attributions(
...     output_dir="attrs_SP1I_classical_monoctypes",
...     gene="SPI1",
...     tasks='cell_type == "classical monocyte"',
... )
Properties:

model: Decima model genes: List of gene names cells: List of cell names cell_metadata: Cell metadata gene_metadata: Gene metadata shape: Shape of the expression matrix attributions: Attributions for a gene

__annotations__ = {}
__init__(anndata)[source]
__repr__()[source]

Return repr(self).

assert_genes(genes)[source]

Check if the genes are in the dataset.

Return type:

bool

attributions(gene, tasks=None, off_tasks=None, transform='specificity', method='inputxgradient', threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, genome='hg38')[source]

Get attributions for a specific gene.

Parameters:
  • gene (str) – Gene name

  • tasks (Optional[List[str]]) – List of cells to use as on task

  • off_tasks (Optional[List[str]]) – List of cells to use as off task

  • transform (str) – Attribution transform method

  • method (str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.

  • threshold (float) – Threshold for attribution analysis

  • min_seqlet_len (int) – Minimum length for seqlet calling

  • max_seqlet_len (int) – Maximum length for seqlet calling

  • additional_flanks (int) – Additional flanks for seqlet calling

  • genome (str) – Genome to use for attribution analysis default is “hg38”. Can be genome name or path to custom genome fasta file.

Returns:

Container with inputs, predictions, attribution scores and TSS position

Return type:

Attribution

property cell_metadata: DataFrame

Cell metadata including annotations, metrics, etc.

property cells: List[str]

List of cell identifiers in the dataset.

correlation(tasks, off_tasks, dataset='test')[source]

Compute the correlation between the ground truth and the predicted expression.

Parameters:
  • tasks – List of cells to use as on task.

  • off_tasks – List of cells to use as off task.

  • dataset – Dataset to use for computation.

Returns:

Pearson correlation coefficient.

Return type:

float

property gene_metadata: DataFrame

Gene metadata.

gene_sequence(gene, stranded=True, genome='hg38')[source]

Get sequence for a gene.

Parameters:
  • gene (str) – Gene name

  • stranded (bool) – Whether to return stranded sequence

  • genome (str) – Genome name or path to the genome fasta file. Default: “hg38”

Returns:

Sequence for the gene

Return type:

str

property genes: List[str]

List of gene names in the dataset.

get_cell_metadata(cell)[source]

Get metadata for a specific cell.

Return type:

CellMetadata

get_gene_metadata(gene)[source]

Get metadata for a specific gene.

Return type:

GeneMetadata

property ground_truth: DataFrame

Ground truth expression matrix.

classmethod load(anndata_name_or_path=None)[source]

Load a DecimaResult object from an anndata file or a path to an anndata file.

Parameters:
  • anndata_name_or_path (Union[str, AnnData, None]) – Name of the model or path to anndata file or anndata object

  • model – Model name or path to model checkpoint. If not provided, the default model will be loaded.

Returns:

DecimaResult object

Examples

>>> result = DecimaResult.load()  # Load default decima metadata
>>> result = DecimaResult.load(
...     "path/to/anndata.h5ad"
... )  # Load custom anndata object from file
load_model(model='v1_rep0', device='cpu')[source]

Load the trained model from a checkpoint path.

Parameters:
  • model (Union[int, str, None]) – Path to model checkpoint or replicate number (0-3) for pre-trained models

  • device (str) – Device to load model on

Returns:

self

Examples

>>> result = DecimaResult.load()
>>> result.load_model()  # Load default model (rep0)
>>> result.load_model(
...     model="path/to/checkpoint.ckpt"
... )
>>> result.load_model(
...     model=2
... )
marker_zscores(tasks, off_tasks=None, layer='preds')[source]

Compute marker z-scores to identify differentially expressed genes.

Parameters:
  • tasks – Target cells. Query string or list of cell IDs.

  • off_tasks – Background cells. Query string, list of cell IDs, or None (uses all other cells).

  • layer – Expression data layer. “preds” (default), “expression”, or custom layer name.

Returns:

Columns are ‘gene’, ‘score’ (z-score), ‘task’.

Return type:

pandas.DataFrame

Examples

>>> # Classical monocytes vs all others
>>> markers = result.marker_zscores(
...     "cell_type == 'classical monocyte'"
... )
>>> top_genes = markers.nlargest(
...     10, "score"
... )
>>> markers = result.marker_zscores(
...     tasks="cell_type == 'classical monocyte'",
...     off_tasks="cell_type == 'lymphoid progenitor'",
... )
property model

Decima model.

plot_correlation(tasks, off_tasks, dataset='test')[source]

Plot the correlation between the ground truth and the predicted expression.

Parameters:
  • tasks – List of cells to use as on task.

  • off_tasks – List of cells to use as off task.

  • dataset – Dataset to use for computation.

Returns:

Plot of the correlation between the ground truth and the predicted expression.

Return type:

p9.ggplot

Examples

>>> result = DecimaResult.load()
>>> result.plot_correlation(
...     tasks="cell_type == 'classical monocyte'",
...     off_tasks="cell_type == 'lymphoid progenitor'",
... )
predicted_expression_matrix(genes=None, model_name=None)[source]

Get predicted expression matrix for all or specific genes.

Parameters:

genes (Optional[List[str]]) – Optional list of genes to get predictions for. If None, returns all genes.

Returns:

Predicted expression matrix (cells x genes)

Return type:

pd.DataFrame

predicted_gene_expression(gene, model_name)[source]

Get predicted expression for a specific gene.

Parameters:
  • gene – Gene name

  • model_name – Model name

Returns:

Predicted expression for the gene

Return type:

torch.Tensor

prepare_one_hot(gene, variants=None, padding=0, genome='hg38')[source]

Prepare one-hot encoding for a gene.

Parameters:
  • gene (str) – Gene name

  • variants (Optional[List[Dict]]) – Optional list of variant dictionaries to inject into the sequence

  • padding (int) – Amount of padding to add on both sides of the sequence

  • genome (str) – Genome name or path to the genome fasta file. Default: “hg38”

Returns:

One-hot encoding of the gene

Return type:

torch.Tensor

query_cells(query)[source]

Query cells based on a query string.

Parameters:

query (str) – Query string

Returns:

List of cell names

Examples

>>> result = DecimaResult.load()
>>> cells = result.query_cells(
...     "cell_type == 'classical monocyte'"
... )
>>> cells
['agg1', 'agg2', 'agg3', ...]
query_tasks(tasks=None, off_tasks=None)[source]

Query tasks based on a query string.

Parameters:
Returns:

List of tasks

Examples

>>> result = DecimaResult.load()
>>> tasks = result.query_tasks(
...     "cell_type == 'classical monocyte'"
... )
>>> tasks
[...]
property shape: tuple

Shape of the expression matrix (n_cells, n_genes).

Module contents