grelu.data.augment#
Functions to augment data. All functions assume that the input is a numpy array containing an integer encoded DNA sequence of shape (L,) or a numpy array containing a label of shape (T, L). The augmented output will be in the same format.
Attributes#
Classes#
A class that generates augmented DNA sequences or (sequence, label) pairs. |
Functions#
|
|
|
Given an integer index, split it into multiple indices, each ranging from 0 |
|
Shift a sliding window along a sequence or label by the given number of bases. |
|
Reverse complement a sequence based on the index |
|
Reverse a label based on the index |
Module Contents#
- grelu.data.augment._split_overall_idx(idx: int, max_values: List[int]) List[List[int]] [source]#
Given an integer index, split it into multiple indices, each ranging from 0 to a specified maximum value
- grelu.data.augment.shift(arr: numpy.ndarray, seq_len: int, idx: int) numpy.ndarray [source]#
Shift a sliding window along a sequence or label by the given number of bases.
- Parameters:
arr – Numpy array with length as the last dimension.
seq_len – Desired length for the output sequence.
idx – Start position
- Returns:
Shifted sequence
- grelu.data.augment.rc_seq(seq: numpy.ndarray, idx: bool) numpy.ndarray [source]#
Reverse complement a sequence based on the index
- Parameters:
seq – Integer-encoded sequence.
idx – If True, the reverse complement sequence will be returned. If False, the sequence will be returned unchanged.
- Returns:
Same or reverse complemented sequence
- grelu.data.augment.rc_label(label: numpy.ndarray, idx: bool) numpy.ndarray [source]#
Reverse a label based on the index
- Parameters:
label – Numpy array with length as the last dimension
idx – If True, the label will be reversed along the length axis. If False, the label will be returned unchanged.
- Returns:
Same or reversed label
- class grelu.data.augment.Augmenter(rc: bool = False, max_seq_shift: int = 0, max_pair_shift: int = 0, n_mutated_seqs: int = 0, n_mutated_bases: int | None = None, protect: List[int] = [], seq_len: int | None = None, label_len: int | None = None, seed: int | None = None, mode: str = 'serial')[source]#
A class that generates augmented DNA sequences or (sequence, label) pairs.
- Parameters:
rc – If True, augmentation by reverse complementation will be performed.
max_seq_shift – Maximum number of bases by which the sequence alone can be shifted. This is normally a small value (< 10).
max_pair_shift – Maximum number of bases by which the sequence and label can be jointly shifted. This can be a larger value.
n_mutated_seqs – Number of augmented sequences to generate by random mutation
n_mutated_bases – The number of bases to mutate in each augmented sequence. Only used if n_mutated_seqs is greater than 0.
protect – A list of positions to protect from random mutation. Only used if n_mutated_seqs is greater than 0.
seq_len – Length of the augmented sequences
label_len – Length of the augmented labels
seed – Random seed for reproducibility.
mode – “random” or “serial”
- __len__() int [source]#
The total number of augmented sequences that can be produced from a single DNA sequence
- _split(idx: int) List[tuple] [source]#
Function to split an input index into indices specifying each type of augmentation
- _get_random_idxs() List[tuple] [source]#
Function to select indices for each type of augmentation randomly
- __call__(idx: int, seq: numpy.ndarray, label: numpy.ndarray | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray] [source]#
Perform augmentation on a given integer-encoded DNA sequence or (sequence, label) pair
- Parameters:
idx – Index specifying the augmentation to be performed.
seq – A single integer encoded DNA sequence
label – A numpy array of shape (T, L) containing the label
- Returns:
The augmented DNA sequence or (sequence, label) pair if label is supplied.