grelu.data.augment#
grelu.data.augment contains functions to augment genomic sequences or functional genomic data.
All functions assume that the input is either:
a 1-D numpy array containing an integer encoded DNA sequence of shape (length,) or;
a 2-D numpy array containing a label of shape (tasks, length).
The augmented output must be returned in the same format. All augmentation functions also require an index (idx) which is an integer or boolean value.
This module also contains the Augmenter class which is responsible for applying multiple augmentations to a given DNA sequence or (sequence, label) pair.
Attributes#
Classes#
A class that generates augmented DNA sequences or (sequence, label) pairs. |
Functions#
|
|
|
Given an integer index, split it into multiple indices, each ranging from 0 |
|
Shift a sliding window along a sequence or label by the given number of bases. |
|
Reverse complement a sequence based on the index |
|
Reverse a label based on the index |
Module Contents#
- grelu.data.augment._split_overall_idx(idx: int, max_values: List[int]) List[List[int]] [source]#
Given an integer index, split it into multiple indices, each ranging from 0 to a specified maximum value
- grelu.data.augment.shift(arr: numpy.ndarray, seq_len: int, idx: int) numpy.ndarray [source]#
Shift a sliding window along a sequence or label by the given number of bases.
- Parameters:
arr – Numpy array with length as the last dimension.
seq_len – Desired length for the output sequence.
idx – Start position
- Returns:
Shifted sequence
- grelu.data.augment.rc_seq(seq: numpy.ndarray, idx: bool) numpy.ndarray [source]#
Reverse complement a sequence based on the index
- Parameters:
seq – Integer-encoded sequence.
idx – If True, the reverse complement sequence will be returned. If False, the sequence will be returned unchanged.
- Returns:
Same or reverse complemented sequence
- grelu.data.augment.rc_label(label: numpy.ndarray, idx: bool) numpy.ndarray [source]#
Reverse a label based on the index
- Parameters:
label – Numpy array with length as the last dimension
idx – If True, the label will be reversed along the length axis. If False, the label will be returned unchanged.
- Returns:
Same or reversed label
- class grelu.data.augment.Augmenter(rc: bool = False, max_seq_shift: int = 0, max_pair_shift: int = 0, n_mutated_seqs: int = 0, n_mutated_bases: int | None = None, protect: List[int] = [], seq_len: int | None = None, label_len: int | None = None, seed: int | None = None, mode: str = 'serial')[source]#
A class that generates augmented DNA sequences or (sequence, label) pairs.
- Parameters:
rc – If True, augmentation by reverse complementation will be performed.
max_seq_shift – Maximum number of bases by which the sequence alone can be shifted. This is normally a small value (< 10).
max_pair_shift – Maximum number of bases by which the sequence and label can be jointly shifted. This can be a larger value.
n_mutated_seqs – Number of augmented sequences to generate by random mutation
n_mutated_bases – The number of bases to mutate in each augmented sequence. Only used if n_mutated_seqs is greater than 0.
protect – A list of positions to protect from random mutation. Only used if n_mutated_seqs is greater than 0.
seq_len – Length of the augmented sequences
label_len – Length of the augmented labels
seed – Random seed for reproducibility.
mode – “random” or “serial”
- __len__() int [source]#
The total number of augmented sequences that can be produced from a single DNA sequence
- _split(idx: int) List[tuple] [source]#
Function to split an input index into indices specifying each type of augmentation
- _get_random_idxs() List[tuple] [source]#
Function to select indices for each type of augmentation randomly
- __call__(idx: int, seq: numpy.ndarray, label: numpy.ndarray | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray] [source]#
Perform augmentation on a given integer-encoded DNA sequence or (sequence, label) pair
- Parameters:
idx – Index specifying the augmentation to be performed.
seq – A single integer encoded DNA sequence
label – A numpy array of shape (T, L) containing the label
- Returns:
The augmented DNA sequence or (sequence, label) pair if label is supplied.