decima.core package¶

Submodules¶

decima.core.attribution module¶

Attribution analysis from decima model.

class decima.core.attribution.Attribution(inputs, attrs, gene='', chrom=None, start=None, end=None, strand=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]¶

Bases: object

Attribution analysis results for a gene.

Parameters:

gene (Optional[str]) – Gene symbol or ID to analyze
inputs (Tensor) – One-hot encoded sequence
attrs (ndarray) – Attribution scores
gene – Gene name
chrom (Optional[str]) – Chromosome name
start (Optional[int]) – Start position
end (Optional[int]) – End position
strand (Optional[str]) – Strand
threshold (Optional[float]) – Threshold for peak finding
min_seqlet_len (Optional[int]) – Minimum sequence length for peak finding
max_seqlet_len (Optional[int]) – Maximum sequence length for peak finding
additional_flanks (Optional[int]) – Additional flanks to add to the gene

Returns:

Attribution analysis results for the gene and tasks

Return type:

Attribution

Examples

>>> attribution = Attribution(
    gene="A1BG",
    inputs=inputs,
    attrs=attrs,
    chrom="chr1",
    start=100,
    end=200,
    strand="+",
    threshold=5e-4,
    min_seqlet_len=4,
    max_seqlet_len=25,
    additional_flanks=0,
)
>>> attribution.plot_peaks()
>>> attribution.scan_motifs()
>>> attribution.save_bigwig(
...     "attributions.bigwig"
... )
>>> attribution.peaks_to_bed()

__init__(inputs, attrs, gene='', chrom=None, start=None, end=None, strand=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]¶

Initialize Attribution.

Parameters:

inputs (Tensor) – One-hot encoded sequence
attrs (ndarray) – Attribution scores
gene (Optional[str]) – Gene name
chrom (Optional[str]) – Chromosome name
start (Optional[int]) – Start position
end (Optional[int]) – End position
strand (Optional[str]) – Strand
threshold (Optional[float]) – Threshold for peak finding
min_seqlet_len (Optional[int]) – Minimum sequence length for peak finding
max_seqlet_len (Optional[int]) – Maximum sequence length for peak finding
additional_flanks (Optional[int]) – Additional flanks to add to the gene
pattern_type (Optional[str]) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively. “both” means both positive and negative patterns are considered.

__repr__()[source]¶: Return repr(self).

__sub__(other)[source]¶

property chrom: str¶: Get the chromosome name.

property end: int¶: Get the end position.

fasta_str()[source]¶: Get attribution scores as a fasta string.

static find_peaks(attrs, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]¶

Find peaks in attribution scores.

Parameters:

attrs – Attribution scores
threshold – Threshold for peak finding
min_seqlet_len – Minimum sequence length for peak finding
max_seqlet_len – Maximum sequence length for peak finding
additional_flanks – Additional flanks to add to the gene
pattern_type – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

Returns:

DataFrame of peaks with columns of:

”peak”: Peak name in format “pattern_type.gene@from_tss”
”start”: Start position of the peak
”end”: End position of the peak
”attribution”: Attribution score of the peak
”p-value”: P-value of the peak
”from_tss”: Distance from the TSS to the peak
”pattern_type”: Pattern type of the peak

Return type:

classmethod from_seq(inputs, tasks=None, off_tasks=None, model='v1_rep0', transform='specificity', method='inputxgradient', device='cpu', result=None, gene='', chrom=None, start=None, end=None, strand=None, gene_mask_start=None, gene_mask_end=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0)[source]¶

Initialize Attribution from sequence.

Parameters:

inputs (Union[str, Tensor, ndarray]) – Sequence to analyze either string of sequence, torch.Tensor or np.ndarray with shape (4, 524288) or (5, 524288) where the last dimension is a binary mask. If 4-dimensional, gene_mask_start and gene_mask_end must be provided.
tasks (Optional[list]) – List of cell types to analyze attributions for
off_tasks (Optional[list]) – List of cell types to contrast against
model (Union[int, str, None]) – Model to use for attribution analysis
transform (str) – Transformation to apply to attributions
method (str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.
device (Optional[str]) – Device to use for attribution analysis
result (Optional[str]) – Result object or path to result object or name of the model to load the result for.
gene (Optional[str]) – Gene name
chrom (Optional[str]) – Chromosome name
start (Optional[int]) – Start position
end (Optional[int]) – End position
strand (Optional[str]) – Strand
gene_start – Gene start position
gene_end – Gene end position
threshold (Optional[float]) – Threshold for peak finding
min_seqlet_len (Optional[int]) – Minimum sequence length for peak finding
max_seqlet_len (Optional[int]) – Maximum sequence length for peak finding
additional_flanks (Optional[int]) – Additional flanks to add to the gene

property gene_end: int¶: Get the gene end position.

property gene_start: int¶: Get the gene start position.

property peaks: DataFrame¶

peaks_to_bed()[source]¶

Convert peaks to bed format.

Returns:

Peaks in bed format where columns are:

chrom: Chromosome name
start: Start position in genome
end: End position in genome
name: Peak name in format “gene@from_tss”
score: Score (-log10(p-value)) clipped to 0-100 based on the seqlet calling
strand: Strand == ‘.’

Return type:

pd.DataFrame

plot_peaks(overlapping_min_dist=1000, figsize=(10, 2))[source]¶

Plot attribution scores and highlight peaks.

Parameters:

overlapping_min_dist – Minimum distance between peaks to consider them overlapping
figsize – Figure size in inches (width, height)

Returns:

The plotted figure showing attribution scores with highlighted peaks

Return type:

plotnine.ggplot

plot_seqlogo(relative_loc=0, window=50, figsize=(10, 2))[source]¶

Plot attribution scores around a relative location.

Parameters:

relative_loc – Position relative to TSS to center plot on
window – Number of bases to show on each side of center

Returns:

Attribution plot

Return type:

matplotlib.pyplot.Figure

save_bigwig(bigwig_path)[source]¶

Save attribution scores as a bigwig file.

Parameters:: bigwig_path (str) – Path to save bigwig file.

save_fasta(fasta_path)[source]¶: Save attribution scores as a fasta file.

save_peaks(bed_path)[source]¶

Save peaks to bed file.

Parameters:: bed_path (str) – Path to save bed file.

scan_motifs(motifs='hocomoco_v13', window=18, pthresh=0.0005)[source]¶

Scan for motifs in peak regions.

Parameters:

motifs (str) – Motif database to use
window (int) – Window size around peaks
pthresh (float) – P-value threshold for motif matches

Returns:

Motif scan results with columns:

”motif”: Motif name
”peak”: Peak name
”start”: Start position of the peak
”end”: End position of the peak
”strand”: Strand of the peak
”score”: Fimoe score of the motif
”p-value”: Fimo p-value of the motif
”matched_seq”: Matched sequence
”site_attr_score”: Site attribution score
”motif_attr_score”: Motif attribution score
”from_tss”: Distance from the TSS to the peak

Return type:

pd.DataFrame

property start: int¶: Get the start position.

property strand: str¶: Get the strand.

class decima.core.attribution.AttributionResult(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶

Bases: object

Attribution result from decima model.

Parameters:

attribution_h5 (Union[str, List[str]]) – Path to attribution h5 file or list of paths to attribution h5 files generated by decima attributions-predict or decima attributions commands.
tss_distance (Optional[int]) – Distance from the TSS to include in the attribution analysis.
correct_grad – Whether to correct the gradient for the attribution analysis.
num_workers (Optional[int]) – Number of workers to use for the attribution analysis.
agg_func (Optional[str]) – Function to aggregate the attribution scores.

Examples

with AttributionResult(attribution_h5=[“example/attribution.h5”, “example/attribution2.h5”]) as ar:: seqs, attrs = ar.load(genes=[“SPI1”]) attribution = ar.load_attribution(gene=”SPI1”)

__enter__()[source]¶

__exit__(exc_type, exc_value, traceback)[source]¶

__init__(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶

__repr__()[source]¶: Return repr(self).

static aggregate(seqs, attrs, agg_func=None)[source]¶: Aggregate the attribution scores.

close()[source]¶: Close the attribution h5 files.

load(genes, gene_mask=False, **kwargs)[source]¶

Load the attribution scores for a list of genes.

Parameters:

genes (List[str]) – List of genes to load.
gene_mask (bool) – Whether to mask the gene.

Returns:

Array of sequences. attrs: Array of attribution scores.

Return type:

seqs

load_attribution(gene, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', **kwargs)[source]¶

Load the attribution scores for a gene.

Parameters:

gene (str) – Gene to load.
metadata_anndata (Optional[str]) – Metadata anndata object.
custom_genome (bool) – Whether to use custom genome.
threshold (float) – Threshold for peak finding.
min_seqlet_len (int) – Minimum sequence length for peak finding.
max_seqlet_len (int) – Maximum sequence length for peak finding.
additional_flanks (int) – Additional flanks to add to the gene.
pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

Returns:

Attribution object.

open()[source]¶: Open the attribution h5 files.

recursive_seqlet_calling(genes=None, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13', **kwargs)[source]¶

Perform recursive seqlet calling on the attribution scores.

Parameters:

genes (Optional[List[str]]) – List of genes to perform recursive seqlet calling on.
metadata_anndata (Optional[str]) – Metadata anndata object.
custom_genome (bool) – Whether to use custom genome.
threshold (float) – Threshold for peak finding.
min_seqlet_len (int) – Minimum sequence length for peak finding.
max_seqlet_len (int) – Maximum sequence length for peak finding.
additional_flanks (int) – Additional flanks to add to the gene.
pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.
meme_motif_db (str) – MEME motif database to use for motif discovery.

Returns:

DataFrame of peaks. df_motifs: DataFrame of motifs.

Return type:

df_peaks

class decima.core.attribution.VariantAttributionResult(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶

Bases: AttributionResult

__annotations__ = {}¶

__init__(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶

load(variants, genes, gene_mask=False)[source]¶: Load the attribution scores for a list of genes and variants.

load_attribution(variant, gene, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', **kwargs)[source]¶

Load the attribution scores for a gene.

Parameters:

gene (str) – Gene to load.
metadata_anndata (Optional[str]) – Metadata anndata object.
custom_genome (bool) – Whether to use custom genome.
threshold (float) – Threshold for peak finding.
min_seqlet_len (int) – Minimum sequence length for peak finding.
max_seqlet_len (int) – Maximum sequence length for peak finding.
additional_flanks (int) – Additional flanks to add to the gene.
pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.

Returns:

Attribution object.

open()[source]¶: Open the attribution h5 files.

recursive_seqlet_calling(variants, genes, metadata_anndata=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13')[source]¶

Perform recursive seqlet calling on the attribution scores.

Parameters:

genes (Optional[List[str]]) – List of genes to perform recursive seqlet calling on.
metadata_anndata (Optional[str]) – Metadata anndata object.
custom_genome – Whether to use custom genome.
threshold (float) – Threshold for peak finding.
min_seqlet_len (int) – Minimum sequence length for peak finding.
max_seqlet_len (int) – Maximum sequence length for peak finding.
additional_flanks (int) – Additional flanks to add to the gene.
pattern_type (str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.
meme_motif_db (str) – MEME motif database to use for motif discovery.

Returns:

DataFrame of peaks. df_motifs: DataFrame of motifs.

Return type:

df_peaks

decima.core.metadata module¶

class decima.core.metadata.CellMetadata(name, cell_type, tissue, organ, disease, study, dataset, n_cells, total_counts, n_genes, size_factor, train_pearson, val_pearson, test_pearson, region=None, subregion=None, celltype_coarse=None, co_term=None, co_name=None, frac_nan=None)[source]¶

Bases: object

Metadata for a cell in the dataset.

name¶: Cell identifier

cell_type¶: Detailed cell type

tissue¶: Tissue identifier

organ¶: Organ name

disease¶: Disease state

study¶: Study identifier

dataset¶: Dataset identifier

region¶: Anatomical region

subregion¶: Anatomical subregion

celltype_coarse¶: Coarse cell type classification

n_cells¶: Number of cells

total_counts¶: Total count of transcripts

n_genes¶: Number of genes detected

size_factor¶: Size normalization factor

train_pearson¶: Pearson correlation in training set

val_pearson¶: Pearson correlation in validation set

test_pearson¶: Pearson correlation in test set

__annotations__ = {'cell_type': <class 'str'>, 'celltype_coarse': typing.Optional[str], 'co_name': typing.Optional[str], 'co_term': typing.Optional[str], 'dataset': <class 'str'>, 'disease': <class 'str'>, 'frac_nan': typing.Optional[float], 'n_cells': <class 'int'>, 'n_genes': <class 'int'>, 'name': <class 'str'>, 'organ': <class 'str'>, 'region': typing.Optional[str], 'size_factor': <class 'float'>, 'study': <class 'str'>, 'subregion': typing.Optional[str], 'test_pearson': <class 'float'>, 'tissue': <class 'str'>, 'total_counts': <class 'float'>, 'train_pearson': <class 'float'>, 'val_pearson': <class 'float'>}¶

__dataclass_fields__ = {'cell_type': Field(name='cell_type',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'celltype_coarse': Field(name='celltype_coarse',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'co_name': Field(name='co_name',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'co_term': Field(name='co_term',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dataset': Field(name='dataset',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'disease': Field(name='disease',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_nan': Field(name='frac_nan',type=typing.Optional[float],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_cells': Field(name='n_cells',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_genes': Field(name='n_genes',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'name': Field(name='name',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'organ': Field(name='organ',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'region': Field(name='region',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'size_factor': Field(name='size_factor',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'study': Field(name='study',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'subregion': Field(name='subregion',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'test_pearson': Field(name='test_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tissue': Field(name='tissue',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'total_counts': Field(name='total_counts',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'train_pearson': Field(name='train_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'val_pearson': Field(name='val_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶

__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶

__eq__(other)¶: Return self==value.

__hash__ = None¶

__init__(name, cell_type, tissue, organ, disease, study, dataset, n_cells, total_counts, n_genes, size_factor, train_pearson, val_pearson, test_pearson, region=None, subregion=None, celltype_coarse=None, co_term=None, co_name=None, frac_nan=None)¶

__match_args__ = ('name', 'cell_type', 'tissue', 'organ', 'disease', 'study', 'dataset', 'n_cells', 'total_counts', 'n_genes', 'size_factor', 'train_pearson', 'val_pearson', 'test_pearson', 'region', 'subregion', 'celltype_coarse', 'co_term', 'co_name', 'frac_nan')¶

__repr__()¶: Return repr(self).

cell_type: str¶

celltype_coarse: Optional[str] = None¶

co_name: Optional[str] = None¶

co_term: Optional[str] = None¶

dataset: str¶

disease: str¶

frac_nan: Optional[float] = None¶

classmethod from_series(name, series)[source]¶

Create CellMetadata from a pandas Series.

Return type:: CellMetadata

n_cells: int¶

n_genes: int¶

name: str¶

organ: str¶

region: Optional[str] = None¶

size_factor: float¶

study: str¶

subregion: Optional[str] = None¶

test_pearson: float¶

tissue: str¶

total_counts: float¶

train_pearson: float¶

val_pearson: float¶

class decima.core.metadata.GeneMetadata(name, chrom, start, end, strand, gene_type, frac_nan, mean_counts, n_tracks, gene_start, gene_end, gene_length, gene_mask_start, gene_mask_end, fold, dataset, gene_id, pearson, size_factor_pearson, frac_N=None, max_counts=None, ensembl_canonical_tss=None, upstream_bases=None, downstream_bases=None)[source]¶

Bases: object

Metadata for a gene in the dataset.

name¶: Gene name

chrom¶: Chromosome where the gene is located

start¶: Start position of the region around the gene to perform predictions in the chromosome

end¶: End position of the region around the gene to perform predictions in the chromosome

strand¶: Strand orientation (+ or -)

gene_type¶: Type of gene (e.g., protein_coding)

frac_nan¶: Fraction of NaN values

mean_counts¶: Mean count across samples

n_tracks¶: Number of tracks

gene_start¶: Gene start position

gene_end¶: Gene end position

gene_length¶: Length of the gene

gene_mask_start¶: Start position of the gene mask

gene_mask_end¶: End position of the gene mask

frac_N¶: Fraction of N bases

fold¶: Cross-validation fold

dataset¶: Dataset identifier

gene_id¶: Ensembl gene ID

pearson¶: Pearson correlation

size_factor_pearson¶: Size factor Pearson correlation

__annotations__ = {'chrom': <class 'str'>, 'dataset': <class 'str'>, 'downstream_bases': typing.Optional[int], 'end': <class 'int'>, 'ensembl_canonical_tss': typing.Optional[bool], 'fold': typing.List[str], 'frac_N': typing.Optional[float], 'frac_nan': <class 'float'>, 'gene_end': <class 'int'>, 'gene_id': <class 'str'>, 'gene_length': <class 'int'>, 'gene_mask_end': <class 'int'>, 'gene_mask_start': <class 'int'>, 'gene_start': <class 'int'>, 'gene_type': <class 'str'>, 'max_counts': typing.Optional[int], 'mean_counts': <class 'float'>, 'n_tracks': <class 'int'>, 'name': <class 'str'>, 'pearson': <class 'float'>, 'size_factor_pearson': <class 'float'>, 'start': <class 'int'>, 'strand': <class 'str'>, 'upstream_bases': typing.Optional[int]}¶

__dataclass_fields__ = {'chrom': Field(name='chrom',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dataset': Field(name='dataset',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'downstream_bases': Field(name='downstream_bases',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'end': Field(name='end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ensembl_canonical_tss': Field(name='ensembl_canonical_tss',type=typing.Optional[bool],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'fold': Field(name='fold',type=typing.List[str],default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_N': Field(name='frac_N',type=typing.Optional[float],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_nan': Field(name='frac_nan',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_end': Field(name='gene_end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_id': Field(name='gene_id',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_length': Field(name='gene_length',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_mask_end': Field(name='gene_mask_end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_mask_start': Field(name='gene_mask_start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_start': Field(name='gene_start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_type': Field(name='gene_type',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'max_counts': Field(name='max_counts',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'mean_counts': Field(name='mean_counts',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_tracks': Field(name='n_tracks',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'name': Field(name='name',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'pearson': Field(name='pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'size_factor_pearson': Field(name='size_factor_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'start': Field(name='start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'strand': Field(name='strand',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'upstream_bases': Field(name='upstream_bases',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶

__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶

__eq__(other)¶: Return self==value.

__hash__ = None¶

__init__(name, chrom, start, end, strand, gene_type, frac_nan, mean_counts, n_tracks, gene_start, gene_end, gene_length, gene_mask_start, gene_mask_end, fold, dataset, gene_id, pearson, size_factor_pearson, frac_N=None, max_counts=None, ensembl_canonical_tss=None, upstream_bases=None, downstream_bases=None)¶

__match_args__ = ('name', 'chrom', 'start', 'end', 'strand', 'gene_type', 'frac_nan', 'mean_counts', 'n_tracks', 'gene_start', 'gene_end', 'gene_length', 'gene_mask_start', 'gene_mask_end', 'fold', 'dataset', 'gene_id', 'pearson', 'size_factor_pearson', 'frac_N', 'max_counts', 'ensembl_canonical_tss', 'upstream_bases', 'downstream_bases')¶

__repr__()¶: Return repr(self).

chrom: str¶

dataset: str¶

downstream_bases: Optional[int] = None¶

end: int¶

ensembl_canonical_tss: Optional[bool] = None¶

fold: List[str]¶

frac_N: Optional[float] = None¶

frac_nan: float¶

classmethod from_series(name, series)[source]¶

Create GeneMetadata from a pandas Series.

Return type:: GeneMetadata

gene_end: int¶

gene_id: str¶

gene_length: int¶

gene_mask_end: int¶

gene_mask_start: int¶

gene_start: int¶

gene_type: str¶

max_counts: Optional[int] = None¶

mean_counts: float¶

n_tracks: int¶

name: str¶

pearson: float¶

size_factor_pearson: float¶

start: int¶

strand: str¶

upstream_bases: Optional[int] = None¶

decima.core.result module¶

class decima.core.result.DecimaResult(anndata)[source]¶

Bases: object

Container for Decima results and model predictions.

This class provides a unified interface for loading pre-trained Decima models and associated metadata, making predictions, and performing attribution analyses.

The DecimaResult object contains:

An AnnData object with gene expression and metadata
A trained model for making predictions
Methods for attribution analysis and interpretation

Parameters:: anndata – AnnData object containing gene expression data and metadata

Examples

>>> # Load default pre-trained model and metadata
>>> result = DecimaResult.load()
>>> result.load_model(
...     rep=0
... )
>>> # Perform attribution analysis
>>> attributions = result.attributions(
...     output_dir="attrs_SP1I_classical_monoctypes",
...     gene="SPI1",
...     tasks='cell_type == "classical monocyte"',
... )

Properties:: model: Decima model genes: List of gene names cells: List of cell names cell_metadata: Cell metadata gene_metadata: Gene metadata shape: Shape of the expression matrix attributions: Attributions for a gene

__annotations__ = {}¶

__init__(anndata)[source]¶

__repr__()[source]¶: Return repr(self).

assert_genes(genes)[source]¶

Check if the genes are in the dataset.

Return type:: bool

attributions(gene, tasks=None, off_tasks=None, transform='specificity', method='inputxgradient', threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, genome='hg38')[source]¶

Get attributions for a specific gene.

Parameters:

gene (str) – Gene name
tasks (Optional[List[str]]) – List of cells to use as on task
off_tasks (Optional[List[str]]) – List of cells to use as off task
transform (str) – Attribution transform method
method (str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.
threshold (float) – Threshold for attribution analysis
min_seqlet_len (int) – Minimum length for seqlet calling
max_seqlet_len (int) – Maximum length for seqlet calling
additional_flanks (int) – Additional flanks for seqlet calling
genome (str) – Genome to use for attribution analysis default is “hg38”. Can be genome name or path to custom genome fasta file.

Returns:

Container with inputs, predictions, attribution scores and TSS position

Return type:

Attribution

property cell_metadata: DataFrame¶: Cell metadata including annotations, metrics, etc.

property cells: List[str]¶: List of cell identifiers in the dataset.

correlation(tasks, off_tasks, dataset='test')[source]¶

Compute the correlation between the ground truth and the predicted expression.

Parameters:

tasks – List of cells to use as on task.
off_tasks – List of cells to use as off task.
dataset – Dataset to use for computation.

Returns:

Pearson correlation coefficient.

Return type:

float

property gene_metadata: DataFrame¶: Gene metadata.

gene_sequence(gene, stranded=True, genome='hg38')[source]¶

Get sequence for a gene.

Parameters:

gene (str) – Gene name
stranded (bool) – Whether to return stranded sequence
genome (str) – Genome name or path to the genome fasta file. Default: “hg38”

Returns:

Sequence for the gene

Return type:

str

property genes: List[str]¶: List of gene names in the dataset.

get_cell_metadata(cell)[source]¶

Get metadata for a specific cell.

Return type:: CellMetadata

get_gene_metadata(gene)[source]¶

Get metadata for a specific gene.

Return type:: GeneMetadata

property ground_truth: DataFrame¶: Ground truth expression matrix.

classmethod load(anndata_name_or_path=None)[source]¶

Load a DecimaResult object from an anndata file or a path to an anndata file.

Parameters:

anndata_name_or_path (Union[str, AnnData, None]) – Name of the model or path to anndata file or anndata object
model – Model name or path to model checkpoint. If not provided, the default model will be loaded.

Returns:

DecimaResult object

Examples

>>> result = DecimaResult.load()  # Load default decima metadata
>>> result = DecimaResult.load(
...     "path/to/anndata.h5ad"
... )  # Load custom anndata object from file

load_model(model='v1_rep0', device='cpu')[source]¶

Load the trained model from a checkpoint path.

Parameters:

model (Union[int, str, None]) – Path to model checkpoint or replicate number (0-3) for pre-trained models
device (str) – Device to load model on

Returns:

self

Examples

>>> result = DecimaResult.load()
>>> result.load_model()  # Load default model (rep0)
>>> result.load_model(
...     model="path/to/checkpoint.ckpt"
... )
>>> result.load_model(
...     model=2
... )

marker_zscores(tasks, off_tasks=None, layer='preds')[source]¶

Compute marker z-scores to identify differentially expressed genes.

Parameters:

tasks – Target cells. Query string or list of cell IDs.
off_tasks – Background cells. Query string, list of cell IDs, or None (uses all other cells).
layer – Expression data layer. “preds” (default), “expression”, or custom layer name.

Returns:

Columns are ‘gene’, ‘score’ (z-score), ‘task’.

Return type:

pandas.DataFrame

Examples

>>> # Classical monocytes vs all others
>>> markers = result.marker_zscores(
...     "cell_type == 'classical monocyte'"
... )
>>> top_genes = markers.nlargest(
...     10, "score"
... )

>>> markers = result.marker_zscores(
...     tasks="cell_type == 'classical monocyte'",
...     off_tasks="cell_type == 'lymphoid progenitor'",
... )

property model¶: Decima model.

plot_correlation(tasks, off_tasks, dataset='test')[source]¶

Plot the correlation between the ground truth and the predicted expression.

Parameters:

tasks – List of cells to use as on task.
off_tasks – List of cells to use as off task.
dataset – Dataset to use for computation.

Returns:

Plot of the correlation between the ground truth and the predicted expression.

Return type:

p9.ggplot

Examples

>>> result = DecimaResult.load()
>>> result.plot_correlation(
...     tasks="cell_type == 'classical monocyte'",
...     off_tasks="cell_type == 'lymphoid progenitor'",
... )

predicted_expression_matrix(genes=None, model_name=None)[source]¶

Get predicted expression matrix for all or specific genes.

Parameters:: genes (Optional[List[str]]) – Optional list of genes to get predictions for. If None, returns all genes.
Returns:: Predicted expression matrix (cells x genes)
Return type:: pd.DataFrame

predicted_gene_expression(gene, model_name)[source]¶

Get predicted expression for a specific gene.

Parameters:

gene – Gene name
model_name – Model name

Returns:

Predicted expression for the gene

Return type:

torch.Tensor

prepare_one_hot(gene, variants=None, padding=0, genome='hg38')[source]¶

Prepare one-hot encoding for a gene.

Parameters:

gene (str) – Gene name
variants (Optional[List[Dict]]) – Optional list of variant dictionaries to inject into the sequence
padding (int) – Amount of padding to add on both sides of the sequence
genome (str) – Genome name or path to the genome fasta file. Default: “hg38”

Returns:

One-hot encoding of the gene

Return type:

torch.Tensor

query_cells(query)[source]¶

Query cells based on a query string.

Parameters:: query (str) – Query string
Returns:: List of cell names

Examples

>>> result = DecimaResult.load()
>>> cells = result.query_cells(
...     "cell_type == 'classical monocyte'"
... )
>>> cells
['agg1', 'agg2', 'agg3', ...]

query_tasks(tasks=None, off_tasks=None)[source]¶

Query tasks based on a query string.

Parameters:

tasks (Optional[List[str]]) – Query string
off_tasks (Optional[List[str]]) – Query string

Returns:

List of tasks

Examples

>>> result = DecimaResult.load()
>>> tasks = result.query_tasks(
...     "cell_type == 'classical monocyte'"
... )
>>> tasks
[...]

property shape: tuple¶: Shape of the expression matrix (n_cells, n_genes).

decima.core package¶

Submodules¶

decima.core.attribution module¶

decima.core.metadata module¶

decima.core.result module¶

Module contents¶