decima.utils package¶
Submodules¶
decima.utils.dataframe module¶
- class decima.utils.dataframe.ChunkDataFrameWriter(output_path, metadata=None)[source]¶
Bases:
object
- decima.utils.dataframe.chunk_df(df, chunksize)[source]¶
Chunk dataframe into chunks of size chunksize
- Parameters:
df (pd.DataFrame) – Input dataframe
chunksize (int) – Size of each chunk
- Returns:
Generator of dataframe chunks
- Return type:
Generator[pd.DataFrame, None, None]
- decima.utils.dataframe.ensemble_predictions(files, output_pq=None, save_replicates=False)[source]¶
Aggregate replicates from parquet files
- decima.utils.dataframe.read_metadata_from_replicate_parquets(files)[source]¶
Read metadata from multiple parquet files and return as a DataFrame.
This function reads key-value metadata from each parquet file and extracts model, distance parameters and other metadata into a structured DataFrame. All files must contain the required metadata fields.
- Parameters:
files (List[str]) – List of parquet file paths to read metadata from
- Returns:
- DataFrame containing metadata with columns:
model: Model identifier
max_distance: Maximum distance used for predictions
min_distance: Minimum distance used for predictions
file: Source file path
- Return type:
pd.DataFrame
- Raises:
KeyError – If any required metadata field is missing from a file
decima.utils.inject module¶
- class decima.utils.inject.SeqBuilder(chrom, start, end, anchor, track=None)[source]¶
Bases:
object
Build the sequence from the variants.
- Parameters:
- decima.utils.inject.prepare_seq_alt_allele(gene, variants)[source]¶
Prepare the sequence and alt allele for a gene.
Example
————–{———}——–: ref *——x——{———}——–: alt new sequence fetched from the upsteam due to deletion.
————–{———}——–: ref ————–{—-++—}—-++–: alt 4 bp cropped from the downstream due to insertion.
^anchor
- Parameters:
gene (
GeneMetadata
) – gene metadata in the format of GeneMetadata.variants (
List
[Dict
]) – variants to inject in the format of [{“chrom”: str, “pos”: int, “ref”: str, “alt”: str}, …].
- Returns:
the sequence (str) and gene mask start and end positions (int, int)
- Return type: