decima.core package¶
Submodules¶
decima.core.attribution module¶
Attribution analysis from decima model.
- class decima.core.attribution.Attribution(inputs, attrs, gene='', chrom=None, start=None, end=None, strand=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]¶
Bases:
objectAttribution analysis results for a gene.
- Parameters:
inputs (
Tensor) – One-hot encoded sequenceattrs (
ndarray) – Attribution scoresgene – Gene name
min_seqlet_len (
Optional[int]) – Minimum sequence length for peak findingmax_seqlet_len (
Optional[int]) – Maximum sequence length for peak findingadditional_flanks (
Optional[int]) – Additional flanks to add to the gene
- Returns:
Attribution analysis results for the gene and tasks
- Return type:
Examples
>>> attribution = Attribution( gene="A1BG", inputs=inputs, attrs=attrs, chrom="chr1", start=100, end=200, strand="+", threshold=5e-4, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, ) >>> attribution.plot_peaks() >>> attribution.scan_motifs() >>> attribution.save_bigwig( ... "attributions.bigwig" ... ) >>> attribution.peaks_to_bed()
- __init__(inputs, attrs, gene='', chrom=None, start=None, end=None, strand=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]¶
Initialize Attribution.
- Parameters:
inputs (
Tensor) – One-hot encoded sequenceattrs (
ndarray) – Attribution scoresmin_seqlet_len (
Optional[int]) – Minimum sequence length for peak findingmax_seqlet_len (
Optional[int]) – Maximum sequence length for peak findingadditional_flanks (
Optional[int]) – Additional flanks to add to the genepattern_type (
Optional[str]) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively. “both” means both positive and negative patterns are considered.
- static find_peaks(attrs, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both')[source]¶
Find peaks in attribution scores.
- Parameters:
attrs – Attribution scores
threshold – Threshold for peak finding
min_seqlet_len – Minimum sequence length for peak finding
max_seqlet_len – Maximum sequence length for peak finding
additional_flanks – Additional flanks to add to the gene
pattern_type – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.
- Returns:
- DataFrame of peaks with columns of:
”peak”: Peak name in format “pattern_type.gene@from_tss”
”start”: Start position of the peak
”end”: End position of the peak
”attribution”: Attribution score of the peak
”p-value”: P-value of the peak
”from_tss”: Distance from the TSS to the peak
”pattern_type”: Pattern type of the peak
- Return type:
df
- classmethod from_seq(inputs, tasks=None, off_tasks=None, model='v1_rep0', transform='specificity', method='inputxgradient', device='cpu', result=None, gene='', chrom=None, start=None, end=None, strand=None, gene_mask_start=None, gene_mask_end=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0)[source]¶
Initialize Attribution from sequence.
- Parameters:
inputs (
Union[str,Tensor,ndarray]) – Sequence to analyze either string of sequence, torch.Tensor or np.ndarray with shape (4, 524288) or (5, 524288) where the last dimension is a binary mask. If 4-dimensional, gene_mask_start and gene_mask_end must be provided.tasks (
Optional[list]) – List of cell types to analyze attributions foroff_tasks (
Optional[list]) – List of cell types to contrast againstmodel (
Union[int,str,None]) – Model to use for attribution analysistransform (
str) – Transformation to apply to attributionsmethod (
str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.device (
Optional[str]) – Device to use for attribution analysisresult (
Optional[str]) – Result object or path to result object or name of the model to load the result for.gene_start – Gene start position
gene_end – Gene end position
min_seqlet_len (
Optional[int]) – Minimum sequence length for peak findingmax_seqlet_len (
Optional[int]) – Maximum sequence length for peak findingadditional_flanks (
Optional[int]) – Additional flanks to add to the gene
- peaks_to_bed()[source]¶
Convert peaks to bed format.
- Returns:
- Peaks in bed format where columns are:
chrom: Chromosome name
start: Start position in genome
end: End position in genome
name: Peak name in format “gene@from_tss”
score: Score (-log10(p-value)) clipped to 0-100 based on the seqlet calling
strand: Strand == ‘.’
- Return type:
pd.DataFrame
- plot_peaks(overlapping_min_dist=1000, figsize=(10, 2))[source]¶
Plot attribution scores and highlight peaks.
- Parameters:
overlapping_min_dist – Minimum distance between peaks to consider them overlapping
figsize – Figure size in inches (width, height)
- Returns:
The plotted figure showing attribution scores with highlighted peaks
- Return type:
plotnine.ggplot
- plot_seqlogo(relative_loc=0, window=50, figsize=(10, 2))[source]¶
Plot attribution scores around a relative location.
- Parameters:
relative_loc – Position relative to TSS to center plot on
window – Number of bases to show on each side of center
- Returns:
Attribution plot
- Return type:
matplotlib.pyplot.Figure
- save_bigwig(bigwig_path)[source]¶
Save attribution scores as a bigwig file.
- Parameters:
bigwig_path (
str) – Path to save bigwig file.
- save_peaks(bed_path)[source]¶
Save peaks to bed file.
- Parameters:
bed_path (
str) – Path to save bed file.
- scan_motifs(motifs='hocomoco_v13', window=18, pthresh=0.0005)[source]¶
Scan for motifs in peak regions.
- Parameters:
- Returns:
- Motif scan results with columns:
”motif”: Motif name
”peak”: Peak name
”start”: Start position of the peak
”end”: End position of the peak
”strand”: Strand of the peak
”score”: Fimoe score of the motif
”p-value”: Fimo p-value of the motif
”matched_seq”: Matched sequence
”site_attr_score”: Site attribution score
”motif_attr_score”: Motif attribution score
”from_tss”: Distance from the TSS to the peak
- Return type:
pd.DataFrame
- class decima.core.attribution.AttributionResult(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶
Bases:
objectAttribution result from decima model.
- Parameters:
attribution_h5 (
Union[str,List[str]]) – Path to attribution h5 file or list of paths to attribution h5 files generated by decima attributions-predict or decima attributions commands.tss_distance (
Optional[int]) – Distance from the TSS to include in the attribution analysis.correct_grad – Whether to correct the gradient for the attribution analysis.
num_workers (
Optional[int]) – Number of workers to use for the attribution analysis.agg_func (
Optional[str]) – Function to aggregate the attribution scores.
Examples
- with AttributionResult(attribution_h5=[“example/attribution.h5”, “example/attribution2.h5”]) as ar:
seqs, attrs = ar.load(genes=[“SPI1”]) attribution = ar.load_attribution(gene=”SPI1”)
- __init__(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶
- load_attribution(gene, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', **kwargs)[source]¶
Load the attribution scores for a gene.
- Parameters:
gene (
str) – Gene to load.custom_genome (
bool) – Whether to use custom genome.threshold (
float) – Threshold for peak finding.min_seqlet_len (
int) – Minimum sequence length for peak finding.max_seqlet_len (
int) – Maximum sequence length for peak finding.additional_flanks (
int) – Additional flanks to add to the gene.pattern_type (
str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.
- Returns:
Attribution object.
- recursive_seqlet_calling(genes=None, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13', **kwargs)[source]¶
Perform recursive seqlet calling on the attribution scores.
- Parameters:
genes (
Optional[List[str]]) – List of genes to perform recursive seqlet calling on.custom_genome (
bool) – Whether to use custom genome.threshold (
float) – Threshold for peak finding.min_seqlet_len (
int) – Minimum sequence length for peak finding.max_seqlet_len (
int) – Maximum sequence length for peak finding.additional_flanks (
int) – Additional flanks to add to the gene.pattern_type (
str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.meme_motif_db (
str) – MEME motif database to use for motif discovery.
- Returns:
DataFrame of peaks. df_motifs: DataFrame of motifs.
- Return type:
df_peaks
- class decima.core.attribution.VariantAttributionResult(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶
Bases:
AttributionResult- __annotations__ = {}¶
- __init__(attribution_h5, tss_distance=None, correct_grad=True, num_workers=-1, agg_func=None)[source]¶
- load(variants, genes, gene_mask=False)[source]¶
Load the attribution scores for a list of genes and variants.
- load_attribution(variant, gene, metadata_anndata=None, custom_genome=False, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', **kwargs)[source]¶
Load the attribution scores for a gene.
- Parameters:
gene (
str) – Gene to load.custom_genome (
bool) – Whether to use custom genome.threshold (
float) – Threshold for peak finding.min_seqlet_len (
int) – Minimum sequence length for peak finding.max_seqlet_len (
int) – Maximum sequence length for peak finding.additional_flanks (
int) – Additional flanks to add to the gene.pattern_type (
str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.
- Returns:
Attribution object.
- recursive_seqlet_calling(variants, genes, metadata_anndata=None, threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, pattern_type='both', meme_motif_db='hocomoco_v13')[source]¶
Perform recursive seqlet calling on the attribution scores.
- Parameters:
genes (
Optional[List[str]]) – List of genes to perform recursive seqlet calling on.custom_genome – Whether to use custom genome.
threshold (
float) – Threshold for peak finding.min_seqlet_len (
int) – Minimum sequence length for peak finding.max_seqlet_len (
int) – Maximum sequence length for peak finding.additional_flanks (
int) – Additional flanks to add to the gene.pattern_type (
str) – Pattern type to use for peak finding default is “both”, alternatively “pos” or “neg” which will only consider positive or negative peaks respectively.meme_motif_db (
str) – MEME motif database to use for motif discovery.
- Returns:
DataFrame of peaks. df_motifs: DataFrame of motifs.
- Return type:
df_peaks
decima.core.metadata module¶
- class decima.core.metadata.CellMetadata(name, cell_type, tissue, organ, disease, study, dataset, n_cells, total_counts, n_genes, size_factor, train_pearson, val_pearson, test_pearson, region=None, subregion=None, celltype_coarse=None, co_term=None, co_name=None, frac_nan=None)[source]¶
Bases:
objectMetadata for a cell in the dataset.
- name¶
Cell identifier
- cell_type¶
Detailed cell type
- tissue¶
Tissue identifier
- organ¶
Organ name
- disease¶
Disease state
- study¶
Study identifier
- dataset¶
Dataset identifier
- region¶
Anatomical region
- subregion¶
Anatomical subregion
- celltype_coarse¶
Coarse cell type classification
- n_cells¶
Number of cells
- total_counts¶
Total count of transcripts
- n_genes¶
Number of genes detected
- size_factor¶
Size normalization factor
- train_pearson¶
Pearson correlation in training set
- val_pearson¶
Pearson correlation in validation set
- test_pearson¶
Pearson correlation in test set
- __annotations__ = {'cell_type': <class 'str'>, 'celltype_coarse': typing.Optional[str], 'co_name': typing.Optional[str], 'co_term': typing.Optional[str], 'dataset': <class 'str'>, 'disease': <class 'str'>, 'frac_nan': typing.Optional[float], 'n_cells': <class 'int'>, 'n_genes': <class 'int'>, 'name': <class 'str'>, 'organ': <class 'str'>, 'region': typing.Optional[str], 'size_factor': <class 'float'>, 'study': <class 'str'>, 'subregion': typing.Optional[str], 'test_pearson': <class 'float'>, 'tissue': <class 'str'>, 'total_counts': <class 'float'>, 'train_pearson': <class 'float'>, 'val_pearson': <class 'float'>}¶
- __dataclass_fields__ = {'cell_type': Field(name='cell_type',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'celltype_coarse': Field(name='celltype_coarse',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'co_name': Field(name='co_name',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'co_term': Field(name='co_term',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dataset': Field(name='dataset',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'disease': Field(name='disease',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_nan': Field(name='frac_nan',type=typing.Optional[float],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_cells': Field(name='n_cells',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_genes': Field(name='n_genes',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'name': Field(name='name',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'organ': Field(name='organ',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'region': Field(name='region',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'size_factor': Field(name='size_factor',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'study': Field(name='study',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'subregion': Field(name='subregion',type=typing.Optional[str],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'test_pearson': Field(name='test_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tissue': Field(name='tissue',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'total_counts': Field(name='total_counts',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'train_pearson': Field(name='train_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'val_pearson': Field(name='val_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(name, cell_type, tissue, organ, disease, study, dataset, n_cells, total_counts, n_genes, size_factor, train_pearson, val_pearson, test_pearson, region=None, subregion=None, celltype_coarse=None, co_term=None, co_name=None, frac_nan=None)¶
- __match_args__ = ('name', 'cell_type', 'tissue', 'organ', 'disease', 'study', 'dataset', 'n_cells', 'total_counts', 'n_genes', 'size_factor', 'train_pearson', 'val_pearson', 'test_pearson', 'region', 'subregion', 'celltype_coarse', 'co_term', 'co_name', 'frac_nan')¶
- __repr__()¶
Return repr(self).
- class decima.core.metadata.GeneMetadata(name, chrom, start, end, strand, gene_type, frac_nan, mean_counts, n_tracks, gene_start, gene_end, gene_length, gene_mask_start, gene_mask_end, fold, dataset, gene_id, pearson, size_factor_pearson, frac_N=None, max_counts=None, ensembl_canonical_tss=None, upstream_bases=None, downstream_bases=None)[source]¶
Bases:
objectMetadata for a gene in the dataset.
- name¶
Gene name
- chrom¶
Chromosome where the gene is located
- start¶
Start position of the region around the gene to perform predictions in the chromosome
- end¶
End position of the region around the gene to perform predictions in the chromosome
- strand¶
Strand orientation (+ or -)
- gene_type¶
Type of gene (e.g., protein_coding)
- frac_nan¶
Fraction of NaN values
- mean_counts¶
Mean count across samples
- n_tracks¶
Number of tracks
- gene_start¶
Gene start position
- gene_end¶
Gene end position
- gene_length¶
Length of the gene
- gene_mask_start¶
Start position of the gene mask
- gene_mask_end¶
End position of the gene mask
- frac_N¶
Fraction of N bases
- fold¶
Cross-validation fold
- dataset¶
Dataset identifier
- gene_id¶
Ensembl gene ID
- pearson¶
Pearson correlation
- size_factor_pearson¶
Size factor Pearson correlation
- __annotations__ = {'chrom': <class 'str'>, 'dataset': <class 'str'>, 'downstream_bases': typing.Optional[int], 'end': <class 'int'>, 'ensembl_canonical_tss': typing.Optional[bool], 'fold': typing.List[str], 'frac_N': typing.Optional[float], 'frac_nan': <class 'float'>, 'gene_end': <class 'int'>, 'gene_id': <class 'str'>, 'gene_length': <class 'int'>, 'gene_mask_end': <class 'int'>, 'gene_mask_start': <class 'int'>, 'gene_start': <class 'int'>, 'gene_type': <class 'str'>, 'max_counts': typing.Optional[int], 'mean_counts': <class 'float'>, 'n_tracks': <class 'int'>, 'name': <class 'str'>, 'pearson': <class 'float'>, 'size_factor_pearson': <class 'float'>, 'start': <class 'int'>, 'strand': <class 'str'>, 'upstream_bases': typing.Optional[int]}¶
- __dataclass_fields__ = {'chrom': Field(name='chrom',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dataset': Field(name='dataset',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'downstream_bases': Field(name='downstream_bases',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'end': Field(name='end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ensembl_canonical_tss': Field(name='ensembl_canonical_tss',type=typing.Optional[bool],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'fold': Field(name='fold',type=typing.List[str],default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_N': Field(name='frac_N',type=typing.Optional[float],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'frac_nan': Field(name='frac_nan',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_end': Field(name='gene_end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_id': Field(name='gene_id',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_length': Field(name='gene_length',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_mask_end': Field(name='gene_mask_end',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_mask_start': Field(name='gene_mask_start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_start': Field(name='gene_start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'gene_type': Field(name='gene_type',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'max_counts': Field(name='max_counts',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'mean_counts': Field(name='mean_counts',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'n_tracks': Field(name='n_tracks',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'name': Field(name='name',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'pearson': Field(name='pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'size_factor_pearson': Field(name='size_factor_pearson',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'start': Field(name='start',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'strand': Field(name='strand',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'upstream_bases': Field(name='upstream_bases',type=typing.Optional[int],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(name, chrom, start, end, strand, gene_type, frac_nan, mean_counts, n_tracks, gene_start, gene_end, gene_length, gene_mask_start, gene_mask_end, fold, dataset, gene_id, pearson, size_factor_pearson, frac_N=None, max_counts=None, ensembl_canonical_tss=None, upstream_bases=None, downstream_bases=None)¶
- __match_args__ = ('name', 'chrom', 'start', 'end', 'strand', 'gene_type', 'frac_nan', 'mean_counts', 'n_tracks', 'gene_start', 'gene_end', 'gene_length', 'gene_mask_start', 'gene_mask_end', 'fold', 'dataset', 'gene_id', 'pearson', 'size_factor_pearson', 'frac_N', 'max_counts', 'ensembl_canonical_tss', 'upstream_bases', 'downstream_bases')¶
- __repr__()¶
Return repr(self).
decima.core.result module¶
- class decima.core.result.DecimaResult(anndata)[source]¶
Bases:
objectContainer for Decima results and model predictions.
This class provides a unified interface for loading pre-trained Decima models and associated metadata, making predictions, and performing attribution analyses.
- The DecimaResult object contains:
An AnnData object with gene expression and metadata
A trained model for making predictions
Methods for attribution analysis and interpretation
- Parameters:
anndata – AnnData object containing gene expression data and metadata
Examples
>>> # Load default pre-trained model and metadata >>> result = DecimaResult.load() >>> result.load_model( ... rep=0 ... ) >>> # Perform attribution analysis >>> attributions = result.attributions( ... output_dir="attrs_SP1I_classical_monoctypes", ... gene="SPI1", ... tasks='cell_type == "classical monocyte"', ... )
- Properties:
model: Decima model genes: List of gene names cells: List of cell names cell_metadata: Cell metadata gene_metadata: Gene metadata shape: Shape of the expression matrix attributions: Attributions for a gene
- __annotations__ = {}¶
- attributions(gene, tasks=None, off_tasks=None, transform='specificity', method='inputxgradient', threshold=0.0005, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, genome='hg38')[source]¶
Get attributions for a specific gene.
- Parameters:
gene (
str) – Gene nametasks (
Optional[List[str]]) – List of cells to use as on taskoff_tasks (
Optional[List[str]]) – List of cells to use as off tasktransform (
str) – Attribution transform methodmethod (
str) – Method to use for attribution analysis available options: “saliency”, “inputxgradient”, “integratedgradients”.threshold (
float) – Threshold for attribution analysismin_seqlet_len (
int) – Minimum length for seqlet callingmax_seqlet_len (
int) – Maximum length for seqlet callingadditional_flanks (
int) – Additional flanks for seqlet callinggenome (
str) – Genome to use for attribution analysis default is “hg38”. Can be genome name or path to custom genome fasta file.
- Returns:
Container with inputs, predictions, attribution scores and TSS position
- Return type:
- correlation(tasks, off_tasks, dataset='test')[source]¶
Compute the correlation between the ground truth and the predicted expression.
- Parameters:
tasks – List of cells to use as on task.
off_tasks – List of cells to use as off task.
dataset – Dataset to use for computation.
- Returns:
Pearson correlation coefficient.
- Return type:
- classmethod load(anndata_name_or_path=None)[source]¶
Load a DecimaResult object from an anndata file or a path to an anndata file.
- Parameters:
- Returns:
DecimaResult object
Examples
>>> result = DecimaResult.load() # Load default decima metadata >>> result = DecimaResult.load( ... "path/to/anndata.h5ad" ... ) # Load custom anndata object from file
- load_model(model='v1_rep0', device='cpu')[source]¶
Load the trained model from a checkpoint path.
- Parameters:
- Returns:
self
Examples
>>> result = DecimaResult.load() >>> result.load_model() # Load default model (rep0) >>> result.load_model( ... model="path/to/checkpoint.ckpt" ... ) >>> result.load_model( ... model=2 ... )
- marker_zscores(tasks, off_tasks=None, layer='preds')[source]¶
Compute marker z-scores to identify differentially expressed genes.
- Parameters:
tasks – Target cells. Query string or list of cell IDs.
off_tasks – Background cells. Query string, list of cell IDs, or None (uses all other cells).
layer – Expression data layer. “preds” (default), “expression”, or custom layer name.
- Returns:
Columns are ‘gene’, ‘score’ (z-score), ‘task’.
- Return type:
Examples
>>> # Classical monocytes vs all others >>> markers = result.marker_zscores( ... "cell_type == 'classical monocyte'" ... ) >>> top_genes = markers.nlargest( ... 10, "score" ... )
>>> markers = result.marker_zscores( ... tasks="cell_type == 'classical monocyte'", ... off_tasks="cell_type == 'lymphoid progenitor'", ... )
- property model¶
Decima model.
- plot_correlation(tasks, off_tasks, dataset='test')[source]¶
Plot the correlation between the ground truth and the predicted expression.
- Parameters:
tasks – List of cells to use as on task.
off_tasks – List of cells to use as off task.
dataset – Dataset to use for computation.
- Returns:
Plot of the correlation between the ground truth and the predicted expression.
- Return type:
p9.ggplot
Examples
>>> result = DecimaResult.load() >>> result.plot_correlation( ... tasks="cell_type == 'classical monocyte'", ... off_tasks="cell_type == 'lymphoid progenitor'", ... )
- predicted_expression_matrix(genes=None, model_name=None)[source]¶
Get predicted expression matrix for all or specific genes.
- predicted_gene_expression(gene, model_name)[source]¶
Get predicted expression for a specific gene.
- Parameters:
gene – Gene name
model_name – Model name
- Returns:
Predicted expression for the gene
- Return type:
torch.Tensor
- prepare_one_hot(gene, variants=None, padding=0, genome='hg38')[source]¶
Prepare one-hot encoding for a gene.
- Parameters:
- Returns:
One-hot encoding of the gene
- Return type:
torch.Tensor
- query_cells(query)[source]¶
Query cells based on a query string.
- Parameters:
query (
str) – Query string- Returns:
List of cell names
Examples
>>> result = DecimaResult.load() >>> cells = result.query_cells( ... "cell_type == 'classical monocyte'" ... ) >>> cells ['agg1', 'agg2', 'agg3', ...]