grelu.data.augment#

Functions to augment data. All functions assume that the input is a numpy array containing an integer encoded DNA sequence of shape (L,) or a numpy array containing a label of shape (T, L). The augmented output will be in the same format.

Attributes#

Classes#

Augmenter

A class that generates augmented DNA sequences or (sequence, label) pairs.

Functions#

_get_multipliers(→ List[int])

_split_overall_idx(→ List[List[int]])

Given an integer index, split it into multiple indices, each ranging from 0

shift(→ numpy.ndarray)

Shift a sliding window along a sequence or label by the given number of bases.

rc_seq(→ numpy.ndarray)

Reverse complement a sequence based on the index

rc_label(→ numpy.ndarray)

Reverse a label based on the index

Module Contents#

grelu.data.augment.AUGMENTATION_MULTIPLIER_FUNCS[source]#
grelu.data.augment._get_multipliers(**kwargs) List[int][source]#
grelu.data.augment._split_overall_idx(idx: int, max_values: List[int]) List[List[int]][source]#

Given an integer index, split it into multiple indices, each ranging from 0 to a specified maximum value

grelu.data.augment.shift(arr: numpy.ndarray, seq_len: int, idx: int) numpy.ndarray[source]#

Shift a sliding window along a sequence or label by the given number of bases.

Parameters:
  • arr – Numpy array with length as the last dimension.

  • seq_len – Desired length for the output sequence.

  • idx – Start position

Returns:

Shifted sequence

grelu.data.augment.rc_seq(seq: numpy.ndarray, idx: bool) numpy.ndarray[source]#

Reverse complement a sequence based on the index

Parameters:
  • seq – Integer-encoded sequence.

  • idx – If True, the reverse complement sequence will be returned. If False, the sequence will be returned unchanged.

Returns:

Same or reverse complemented sequence

grelu.data.augment.rc_label(label: numpy.ndarray, idx: bool) numpy.ndarray[source]#

Reverse a label based on the index

Parameters:
  • label – Numpy array with length as the last dimension

  • idx – If True, the label will be reversed along the length axis. If False, the label will be returned unchanged.

Returns:

Same or reversed label

class grelu.data.augment.Augmenter(rc: bool = False, max_seq_shift: int = 0, max_pair_shift: int = 0, n_mutated_seqs: int = 0, n_mutated_bases: int | None = None, protect: List[int] = [], seq_len: int | None = None, label_len: int | None = None, seed: int | None = None, mode: str = 'serial')[source]#

A class that generates augmented DNA sequences or (sequence, label) pairs.

Parameters:
  • rc – If True, augmentation by reverse complementation will be performed.

  • max_seq_shift – Maximum number of bases by which the sequence alone can be shifted. This is normally a small value (< 10).

  • max_pair_shift – Maximum number of bases by which the sequence and label can be jointly shifted. This can be a larger value.

  • n_mutated_seqs – Number of augmented sequences to generate by random mutation

  • n_mutated_bases – The number of bases to mutate in each augmented sequence. Only used if n_mutated_seqs is greater than 0.

  • protect – A list of positions to protect from random mutation. Only used if n_mutated_seqs is greater than 0.

  • seq_len – Length of the augmented sequences

  • label_len – Length of the augmented labels

  • seed – Random seed for reproducibility.

  • mode – “random” or “serial”

__len__() int[source]#

The total number of augmented sequences that can be produced from a single DNA sequence

_split(idx: int) List[tuple][source]#

Function to split an input index into indices specifying each type of augmentation

_get_random_idxs() List[tuple][source]#

Function to select indices for each type of augmentation randomly

__call__(idx: int, seq: numpy.ndarray, label: numpy.ndarray | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray][source]#

Perform augmentation on a given integer-encoded DNA sequence or (sequence, label) pair

Parameters:
  • idx – Index specifying the augmentation to be performed.

  • seq – A single integer encoded DNA sequence

  • label – A numpy array of shape (T, L) containing the label

Returns:

The augmented DNA sequence or (sequence, label) pair if label is supplied.