decima.utils package¶
Submodules¶
decima.utils.dataframe module¶
- class decima.utils.dataframe.ChunkDataFrameWriter(output_path, metadata=None)[source]¶
Bases:
object
- decima.utils.dataframe.chunk_df(df, chunksize)[source]¶
Chunk dataframe into chunks of size chunksize
- Parameters:
df (pd.DataFrame) – Input dataframe
chunksize (int) – Size of each chunk
- Returns:
Generator of dataframe chunks
- Return type:
Generator[pd.DataFrame, None, None]
- decima.utils.dataframe.ensemble_predictions(files, output_pq=None, save_replicates=False)[source]¶
Aggregate replicates from parquet files
- decima.utils.dataframe.read_metadata_from_replicate_parquets(files)[source]¶
Read metadata from multiple parquet files and return as a DataFrame.
This function reads key-value metadata from each parquet file and extracts model, distance parameters and other metadata into a structured DataFrame. All files must contain the required metadata fields.
- Parameters:
files (List[str]) – List of parquet file paths to read metadata from
- Returns:
- DataFrame containing metadata with columns:
model: Model identifier
max_distance: Maximum distance used for predictions
min_distance: Minimum distance used for predictions
file: Source file path
- Return type:
pd.DataFrame
- Raises:
KeyError – If any required metadata field is missing from a file
decima.utils.inject module¶
- class decima.utils.inject.SeqBuilder(chrom, start, end, anchor, track=None, genome='hg38')[source]¶
Bases:
objectBuild the sequence from the variants.
- Parameters:
- decima.utils.inject.prepare_seq_alt_allele(gene, variants, genome='hg38')[source]¶
Prepare the sequence and alt allele for a gene.
Example
————–{———}——–: ref *——x——{———}——–: alt new sequence fetched from the upsteam due to deletion.
————–{———}——–: ref ————–{—-++—}—-++–: alt 4 bp cropped from the downstream due to insertion.
^anchor
- Parameters:
gene (
GeneMetadata) – gene metadata in the format of GeneMetadata.variants (
List[Dict]) – variants to inject in the format of [{“chrom”: str, “pos”: int, “ref”: str, “alt”: str}, …].
- Returns:
the sequence (str) and gene mask start and end positions (int, int)
- Return type:
decima.utils.io module¶
- class decima.utils.io.AttributionWriter(path, genes, model_name, metadata_anndata=None, genome='hg38', bigwig=True, correct_grad_bigwig=True, custom_genes=False)[source]¶
Bases:
objectWrite gene attribution data to HDF5 and BigWig files.
- Output files:
- HDF5 file:
genes: Gene names (string array)
attribution: Attribution scores (genes x 4 x context_size)
sequence: One-hot sequences (genes x 4 x context_size)
Attributes: model_name, genome
BigWig file: Mean attribution scores at genomic coordinates
- Parameters:
path – Output HDF5 file path.
genes – Gene names to write.
model_name – Model identifier for metadata.
metadata_anndata – Gene metadata source or path to the metadata anndata file. If not provided, the compatible metadata for the model will be used.
genome (
str) – Reference genome version.bigwig (
bool) – Create BigWig file for genome browser.correct_grad_bigwig (
bool) – Correct gradient bigwig for bias.custom_genes (
bool) – If True, do not assert that the genes are in the result.
Examples
>>> with ( ... AttributionWriter( ... "attrs.h5", ... ["SPI1"], ... "model_v1", ... ) as writer ... ): ... writer.add( ... "SPI1", ... attribution_array, ... sequence_array, ... )
- __init__(path, genes, model_name, metadata_anndata=None, genome='hg38', bigwig=True, correct_grad_bigwig=True, custom_genes=False)[source]¶
- add(gene, seqs, attrs, gene_mask_start=None, gene_mask_end=None)[source]¶
Add attribution data for a gene.
- Parameters:
gene (
str) – Gene name from the genes list.attrs (
ndarray) – Attribution scores, shape (4, DECIMA_CONTEXT_SIZE).seqs (
ndarray) – One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE).gene_mask_start (
Optional[int]) – Gene mask start position. If None, the gene mask start position will be loaded from the result.gene_mask_end (
Optional[int]) – Gene mask end position. If None, the gene mask end position will be loaded from the result.
- class decima.utils.io.BigWigWriter(path, genome='hg38', threshold=1e-05)[source]¶
Bases:
objectWrite genomic data to BigWig format for genome browser visualization.
Opens a BigWig file for writing, accumulates data, then writes and closes on exit.
- Parameters:
path – Output BigWig file path.
genome (
str) – Reference genome name for chromosome sizes.
Examples
>>> with BigWigWriter( ... "output.bigwig", ... "hg38", ... ) as writer: ... writer.add( ... "chr1", ... 1000, ... 2000, ... attribution_scores, ... )
- class decima.utils.io.VariantAttributionWriter(path, genes, variants, model_name, metadata_anndata=None, genome='hg38')[source]¶
Bases:
AttributionWriter- __annotations__ = {}¶
decima.utils.motifs module¶
decima.utils.qc module¶
- class decima.utils.qc.QCLogger(log_file, metadata_anndata=None)[source]¶
Bases:
objectLogger for QC
- Parameters:
- log_correlation(tasks, off_tasks=None, plot=True)[source]¶
Log the correlation between tasks and off_tasks
decima.utils.sequence module¶
- decima.utils.sequence.one_hot_to_seq(one_hot)[source]¶
Convert one-hot encoded sequence to a string
- Parameters:
one_hot (np.ndarray or torch.Tensor) – One-hot encoded sequence
- Returns:
String representation of the sequence
- Return type:
- decima.utils.sequence.prepare_mask_gene(gene_start, gene_end, padding=0)[source]¶
Prepare gene mask tensor for gene regions.
- Parameters:
gene_start – Start position of the gene
gene_end – End position of the gene
padding – Amount of padding to add on both sides. Defaults to 0
- Returns:
Gene mask tensor with 1s in gene region and 0s elsewhere
- Return type:
torch.Tensor