decima.utils package

Submodules

decima.utils.dataframe module

class decima.utils.dataframe.ChunkDataFrameWriter(output_path, metadata=None)[source]

Bases: object

__enter__()[source]
__exit__(exc_type, exc_val, exc_tb)[source]
__init__(output_path, metadata=None)[source]

Initialize ParquetWriter

Parameters:
  • output_path (str) – Path to the output parquet file

  • metadata (dict) – Metadata to write to the parquet file. Keys and values must be string-like / coercible to bytes.

write(chunk)[source]

Write dataframe chunk to parquet file

Parameters:

chunk (pd.DataFrame) – DataFrame chunk to write

Return type:

None

decima.utils.dataframe.chunk_df(df, chunksize)[source]

Chunk dataframe into chunks of size chunksize

Parameters:
  • df (pd.DataFrame) – Input dataframe

  • chunksize (int) – Size of each chunk

Returns:

Generator of dataframe chunks

Return type:

Generator[pd.DataFrame, None, None]

decima.utils.dataframe.ensemble_predictions(files, output_pq=None, save_replicates=False)[source]

Aggregate replicates from parquet files

Parameters:
  • files (List[str]) – List of parquet files to aggregate

  • output_pq (Optional[str]) – Path to the output parquet file

Return type:

None

decima.utils.dataframe.read_metadata_from_replicate_parquets(files)[source]

Read metadata from multiple parquet files and return as a DataFrame.

This function reads key-value metadata from each parquet file and extracts model, distance parameters and other metadata into a structured DataFrame. All files must contain the required metadata fields.

Parameters:

files (List[str]) – List of parquet file paths to read metadata from

Returns:

DataFrame containing metadata with columns:
  • model: Model identifier

  • max_distance: Maximum distance used for predictions

  • min_distance: Minimum distance used for predictions

  • file: Source file path

Return type:

pd.DataFrame

Raises:

KeyError – If any required metadata field is missing from a file

decima.utils.dataframe.write_df_chunks_to_parquet(chunks, output_path, metadata=None)[source]

Write dataframe chunks to parquet file

Parameters:
  • chunks (Iterator[pd.DataFrame]) – Iterator of dataframe chunks

  • output_path (str) – Path to the output parquet file

  • metadata (dict) – Metadata to write to the parquet file. If None, no metadata is written.

Return type:

None

decima.utils.inject module

class decima.utils.inject.SeqBuilder(chrom, start, end, anchor, track=None, genome='hg38')[source]

Bases: object

Build the sequence from the variants.

Parameters:
  • chrom (str) – chromosome

  • start (int) – start position

  • end (int) – end position

  • anchor (int) – anchor position

  • track (List[int]) – track positions shifts due to indels.

  • genome (str) – Genome name or path to the genome fasta file. Defaults to “hg38”.

__init__(chrom, start, end, anchor, track=None, genome='hg38')[source]
concat()[source]

Build the string from sequence objects.

Returns:

the final sequence.

Return type:

str

inject(variant)[source]

Inject the variant into the sequence.

Parameters:

variant (Dict) – variant to inject in the format of {“chrom”: str, “pos”: int, “ref”: str, “alt”: str}

Returns:

self

decima.utils.inject.prepare_seq_alt_allele(gene, variants, genome='hg38')[source]

Prepare the sequence and alt allele for a gene.

Example

————–{———}——–: ref *——x——{———}——–: alt new sequence fetched from the upsteam due to deletion.

————–{———}——–: ref ————–{—-++—}—-++–: alt 4 bp cropped from the downstream due to insertion.

^anchor

Parameters:
  • gene (GeneMetadata) – gene metadata in the format of GeneMetadata.

  • variants (List[Dict]) – variants to inject in the format of [{“chrom”: str, “pos”: int, “ref”: str, “alt”: str}, …].

Returns:

the sequence (str) and gene mask start and end positions (int, int)

Return type:

tuple

decima.utils.io module

class decima.utils.io.AttributionWriter(path, genes, model_name, metadata_anndata=None, genome='hg38', bigwig=True, correct_grad_bigwig=True, custom_genes=False)[source]

Bases: object

Write gene attribution data to HDF5 and BigWig files.

Output files:
HDF5 file:
  • genes: Gene names (string array)

  • attribution: Attribution scores (genes x 4 x context_size)

  • sequence: One-hot sequences (genes x 4 x context_size)

  • Attributes: model_name, genome

BigWig file: Mean attribution scores at genomic coordinates

Parameters:
  • path – Output HDF5 file path.

  • genes – Gene names to write.

  • model_name – Model identifier for metadata.

  • metadata_anndata – Gene metadata source or path to the metadata anndata file. If not provided, the compatible metadata for the model will be used.

  • genome (str) – Reference genome version.

  • bigwig (bool) – Create BigWig file for genome browser.

  • correct_grad_bigwig (bool) – Correct gradient bigwig for bias.

  • custom_genes (bool) – If True, do not assert that the genes are in the result.

Examples

>>> with (
...     AttributionWriter(
...         "attrs.h5",
...         ["SPI1"],
...         "model_v1",
...     ) as writer
... ):
...     writer.add(
...         "SPI1",
...         attribution_array,
...         sequence_array,
...     )
__enter__()[source]

Context manager entry - opens files for writing.

__exit__(exc_type, exc_value, traceback)[source]

Context manager exit - closes files.

__init__(path, genes, model_name, metadata_anndata=None, genome='hg38', bigwig=True, correct_grad_bigwig=True, custom_genes=False)[source]
add(gene, seqs, attrs, gene_mask_start=None, gene_mask_end=None)[source]

Add attribution data for a gene.

Parameters:
  • gene (str) – Gene name from the genes list.

  • attrs (ndarray) – Attribution scores, shape (4, DECIMA_CONTEXT_SIZE).

  • seqs (ndarray) – One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE).

  • gene_mask_start (Optional[int]) – Gene mask start position. If None, the gene mask start position will be loaded from the result.

  • gene_mask_end (Optional[int]) – Gene mask end position. If None, the gene mask end position will be loaded from the result.

close()[source]

Close HDF5 file and optional BigWig file.

open()[source]

Open HDF5 file and optional BigWig file for writing.

class decima.utils.io.BigWigWriter(path, genome='hg38', threshold=1e-05)[source]

Bases: object

Write genomic data to BigWig format for genome browser visualization.

Opens a BigWig file for writing, accumulates data, then writes and closes on exit.

Parameters:
  • path – Output BigWig file path.

  • genome (str) – Reference genome name for chromosome sizes.

Examples

>>> with BigWigWriter(
...     "output.bigwig",
...     "hg38",
... ) as writer:
...     writer.add(
...         "chr1",
...         1000,
...         2000,
...         attribution_scores,
...     )
__enter__()[source]

Context manager entry - opens BigWig file.

__exit__(exc_type, exc_value, traceback)[source]

Context manager exit - closes BigWig file.

__init__(path, genome='hg38', threshold=1e-05)[source]
add(chrom, start, end, values)[source]

Add genomic interval data to be written.

Parameters:
  • chrom (str) – Chromosome name.

  • start (int) – Start position.

  • end (int) – End position.

  • values (ndarray) – Array of values for each position.

close()[source]

Write accumulated data to BigWig file and close.

open()[source]

Open BigWig file for writing and add chromosome header.

class decima.utils.io.VariantAttributionWriter(path, genes, variants, model_name, metadata_anndata=None, genome='hg38')[source]

Bases: AttributionWriter

__annotations__ = {}
__init__(path, genes, variants, model_name, metadata_anndata=None, genome='hg38')[source]
add(variant, gene, rel_pos, seqs_ref, attrs_ref, seqs_alt, attrs_alt, gene_mask_start, gene_mask_end)[source]

Add attribution data for a variant gene pair.

Parameters:
  • variant (str) – Variant name from the variants list.

  • gene (str) – Gene name from the genes list.

  • attrs – Attribution scores, shape (4, DECIMA_CONTEXT_SIZE).

  • seqs – One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE).

close()[source]

Close HDF5 file and optional BigWig file.

open()[source]

Open HDF5 file and optional BigWig file for writing.

decima.utils.io.import_cyvcf2()[source]
decima.utils.io.read_fasta_gene_mask(fasta_file)[source]

Read the fasta file and return the gene mask

Parameters:

fasta_file (str) – Path to the fasta file

Returns:

DataFrame with the gene mask

Return type:

pd.DataFrame

decima.utils.io.read_vcf_chunks(vcf_file, chunksize)[source]

Read the vcf file and return the chunks

Parameters:
  • vcf_file (str) – Path to the vcf file

  • chunksize (int) – Size of the chunks

Returns:

Iterator of DataFrames with the chunks

Return type:

Iterator[pd.DataFrame]

decima.utils.motifs module

decima.utils.motifs.motif_start_end(attributions, motif)[source]

Get the start and end of the motif attributions and motif matrix and returns start and end positions for the window with maximum score.

Parameters:
  • attributions (ndarray) – Attribution scores with shape (batch_size, seqlet_len, 4)

  • motif (ndarray) – Motif matrix with shape (motif_len, 4)

Returns:

Start and end positions of the motif with shape (batch_size, 2)

Return type:

[np.ndarray, np.ndarray]

decima.utils.qc module

class decima.utils.qc.QCLogger(log_file, metadata_anndata=None)[source]

Bases: object

Logger for QC

Parameters:
  • log_file (str) – Path to the log file

  • metadata_anndata (str) – Path to the metadata anndata file

__enter__()[source]
__exit__(exc_type, exc_value, traceback)[source]
__init__(log_file, metadata_anndata=None)[source]
close()[source]
log(message, level='info')[source]

Log a message

Parameters:
  • message (str) – Message to log

  • level (str, optional) – Level of the message. Defaults to “info”.

log_correlation(tasks, off_tasks=None, plot=True)[source]

Log the correlation between tasks and off_tasks

Parameters:
  • tasks (str) – Tasks to use for correlation

  • off_tasks (str) – Off tasks to use for correlation

  • plot (bool, optional) – Whether to plot the correlation. Defaults to True.

log_gene(gene, threshold=0.5)[source]

Log the correlation of a gene with the model

Parameters:
  • gene (str) – Gene to log

  • threshold (float, optional) – Threshold for logging. Defaults to 0.5.

open()[source]

decima.utils.sequence module

decima.utils.sequence.one_hot_to_seq(one_hot)[source]

Convert one-hot encoded sequence to a string

Parameters:

one_hot (np.ndarray or torch.Tensor) – One-hot encoded sequence

Returns:

String representation of the sequence

Return type:

str

decima.utils.sequence.prepare_mask_gene(gene_start, gene_end, padding=0)[source]

Prepare gene mask tensor for gene regions.

Parameters:
  • gene_start – Start position of the gene

  • gene_end – End position of the gene

  • padding – Amount of padding to add on both sides. Defaults to 0

Returns:

Gene mask tensor with 1s in gene region and 0s elsewhere

Return type:

torch.Tensor

decima.utils.variant module

decima.utils.variant.process_variants(variants, ad=None, min_from_end=0)[source]

Module contents

decima.utils.get_compute_device(device=None)[source]

Get the best available device for computation.

Parameters:

device (Optional[str]) – Optional device specification. If None, automatically selects best available device.

Returns:

The selected device for computation

Return type:

torch.device