decima.utils package¶

Submodules¶

decima.utils.dataframe module¶

class decima.utils.dataframe.ChunkDataFrameWriter(output_path, metadata=None)[source]¶

Bases: object

__enter__()[source]¶

__exit__(exc_type, exc_val, exc_tb)[source]¶

__init__(output_path, metadata=None)[source]¶

Initialize ParquetWriter

Parameters:

output_path (str) – Path to the output parquet file
metadata (dict) – Metadata to write to the parquet file. Keys and values must be string-like / coercible to bytes.

write(chunk)[source]¶

Write dataframe chunk to parquet file

Parameters:: chunk (pd.DataFrame) – DataFrame chunk to write
Return type:: None

decima.utils.dataframe.chunk_df(df, chunksize)[source]¶

Chunk dataframe into chunks of size chunksize

Parameters:

df (pd.DataFrame) – Input dataframe
chunksize (int) – Size of each chunk

Returns:

Generator of dataframe chunks

Return type:

Generator[pd.DataFrame, None, None]

decima.utils.dataframe.ensemble_predictions(files, output_pq=None, save_replicates=False)[source]¶

Aggregate replicates from parquet files

Parameters:

files (List[str]) – List of parquet files to aggregate
output_pq (Optional[str]) – Path to the output parquet file

Return type:

None

decima.utils.dataframe.read_metadata_from_replicate_parquets(files)[source]¶

Read metadata from multiple parquet files and return as a DataFrame.

This function reads key-value metadata from each parquet file and extracts model, distance parameters and other metadata into a structured DataFrame. All files must contain the required metadata fields.

Parameters:

files (List[str]) – List of parquet file paths to read metadata from

Returns:

DataFrame containing metadata with columns:

model: Model identifier
max_distance: Maximum distance used for predictions
min_distance: Minimum distance used for predictions
file: Source file path

Return type:

pd.DataFrame

Raises:

KeyError – If any required metadata field is missing from a file

decima.utils.dataframe.write_df_chunks_to_parquet(chunks, output_path, metadata=None)[source]¶

Write dataframe chunks to parquet file

Parameters:

chunks (Iterator[pd.DataFrame]) – Iterator of dataframe chunks
output_path (str) – Path to the output parquet file
metadata (dict) – Metadata to write to the parquet file. If None, no metadata is written.

Return type:

None

decima.utils.inject module¶

class decima.utils.inject.SeqBuilder(chrom, start, end, anchor, track=None)[source]¶

Bases: object

Build the sequence from the variants.

Parameters:

chrom (str) – chromosome
start (int) – start position
end (int) – end position
anchor (int) – anchor position
track (List[int]) – track positions shifts due to indels.

__init__(chrom, start, end, anchor, track=None)[source]¶

concat()[source]¶

Build the string from sequence objects.

Returns:: the final sequence.
Return type:: str

inject(variant)[source]¶

Inject the variant into the sequence.

Parameters:: variant (Dict) – variant to inject in the format of {“chrom”: str, “pos”: int, “ref”: str, “alt”: str}
Returns:: self

decima.utils.inject.prepare_seq_alt_allele(gene, variants)[source]¶

Prepare the sequence and alt allele for a gene.

Example

————–{———}——–: ref *——x——{———}——–: alt new sequence fetched from the upsteam due to deletion.

————–{———}——–: ref ————–{—-++—}—-++–: alt 4 bp cropped from the downstream due to insertion.

^anchor

Parameters:

gene (GeneMetadata) – gene metadata in the format of GeneMetadata.
variants (List[Dict]) – variants to inject in the format of [{“chrom”: str, “pos”: int, “ref”: str, “alt”: str}, …].

Returns:

the sequence (str) and gene mask start and end positions (int, int)

Return type:

tuple

decima.utils.io module¶

decima.utils.io.import_cyvcf2()[source]¶

decima.utils.io.read_fasta_gene_mask(fasta_file)[source]¶

Return type:: DataFrame

decima.utils.io.read_vcf_chunks(vcf_file, chunksize)[source]¶

Return type:: Iterator[DataFrame]

decima.utils.variant module¶

decima.utils.variant.process_variants(variants, ad=None, min_from_end=0)[source]¶

Module contents¶

decima.utils.get_compute_device(device=None)[source]¶

Get the best available device for computation.

Parameters:: device (Optional[str]) – Optional device specification. If None, automatically selects best available device.
Returns:: The selected device for computation
Return type:: torch.device