grelu.data.augment#

grelu.data.augment contains functions to augment genomic sequences or functional genomic data.

All functions assume that the input is either:

  1. a 1-D numpy array containing an integer encoded DNA sequence of shape (length,) or;

  2. a 2-D numpy array containing a label of shape (tasks, length).

The augmented output must be returned in the same format. All augmentation functions also require an index (idx) which is an integer or boolean value.

This module also contains the Augmenter class which is responsible for applying multiple augmentations to a given DNA sequence or (sequence, label) pair.

Attributes#

Classes#

Augmenter

A class that generates augmented DNA sequences or (sequence, label) pairs.

Functions#

_get_multipliers(→ List[int])

_split_overall_idx(→ List[List[int]])

Given an integer index, split it into multiple indices, each ranging from 0

shift(→ numpy.ndarray)

Shift a sliding window along a sequence or label by the given number of bases.

rc_seq(→ numpy.ndarray)

Reverse complement a sequence based on the index

rc_label(→ numpy.ndarray)

Reverse a label based on the index

Module Contents#

grelu.data.augment.AUGMENTATION_MULTIPLIER_FUNCS[source]#
grelu.data.augment._get_multipliers(**kwargs) List[int][source]#
grelu.data.augment._split_overall_idx(idx: int, max_values: List[int]) List[List[int]][source]#

Given an integer index, split it into multiple indices, each ranging from 0 to a specified maximum value

grelu.data.augment.shift(arr: numpy.ndarray, seq_len: int, idx: int) numpy.ndarray[source]#

Shift a sliding window along a sequence or label by the given number of bases.

Parameters:
  • arr – Numpy array with length as the last dimension.

  • seq_len – Desired length for the output sequence.

  • idx – Start position

Returns:

Shifted sequence

grelu.data.augment.rc_seq(seq: numpy.ndarray, idx: bool) numpy.ndarray[source]#

Reverse complement a sequence based on the index

Parameters:
  • seq – Integer-encoded sequence.

  • idx – If True, the reverse complement sequence will be returned. If False, the sequence will be returned unchanged.

Returns:

Same or reverse complemented sequence

grelu.data.augment.rc_label(label: numpy.ndarray, idx: bool) numpy.ndarray[source]#

Reverse a label based on the index

Parameters:
  • label – Numpy array with length as the last dimension

  • idx – If True, the label will be reversed along the length axis. If False, the label will be returned unchanged.

Returns:

Same or reversed label

class grelu.data.augment.Augmenter(rc: bool = False, max_seq_shift: int = 0, max_pair_shift: int = 0, n_mutated_seqs: int = 0, n_mutated_bases: int | None = None, protect: List[int] = [], seq_len: int | None = None, label_len: int | None = None, seed: int | None = None, mode: str = 'serial')[source]#

A class that generates augmented DNA sequences or (sequence, label) pairs.

Parameters:
  • rc – If True, augmentation by reverse complementation will be performed.

  • max_seq_shift – Maximum number of bases by which the sequence alone can be shifted. This is normally a small value (< 10).

  • max_pair_shift – Maximum number of bases by which the sequence and label can be jointly shifted. This can be a larger value.

  • n_mutated_seqs – Number of augmented sequences to generate by random mutation

  • n_mutated_bases – The number of bases to mutate in each augmented sequence. Only used if n_mutated_seqs is greater than 0.

  • protect – A list of positions to protect from random mutation. Only used if n_mutated_seqs is greater than 0.

  • seq_len – Length of the augmented sequences

  • label_len – Length of the augmented labels

  • seed – Random seed for reproducibility.

  • mode – “random” or “serial”

protect = [][source]#
seq_len = None[source]#
label_len = None[source]#
n_mutated_bases = None[source]#
rc = False[source]#
max_seq_shift = 0[source]#
max_pair_shift = 0[source]#
n_mutated_seqs = 0[source]#
shift_label[source]#
shift_seq[source]#
mutate[source]#
max_values[source]#
products[source]#
mode = 'serial'[source]#
rng[source]#
__len__() int[source]#

The total number of augmented sequences that can be produced from a single DNA sequence

_split(idx: int) List[tuple][source]#

Function to split an input index into indices specifying each type of augmentation

_get_random_idxs() List[tuple][source]#

Function to select indices for each type of augmentation randomly

__call__(idx: int, seq: numpy.ndarray, label: numpy.ndarray | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray][source]#

Perform augmentation on a given integer-encoded DNA sequence or (sequence, label) pair

Parameters:
  • idx – Index specifying the augmentation to be performed.

  • seq – A single integer encoded DNA sequence

  • label – A numpy array of shape (T, L) containing the label

Returns:

The augmented DNA sequence or (sequence, label) pair if label is supplied.