grelu.sequence.utils#

General utilities for analysis of DNA sequences

Attributes#

Functions#

get_lengths(→ Union[int, List[int]])

Given DNA sequences, return their lengths.

check_equal_lengths(→ bool)

Given DNA sequences, check whether they are all of equal length

get_unique_length(→ int)

Check if given sequences are all of equal length and if so, return the length.

pad(→ Union[str, List[str], numpy.ndarray])

Pad the input DNA sequence(s) with Ns at the desired end to reach

trim(→ Union[str, List[str], numpy.ndarray])

Trim DNA sequences to reach the desired length (seq_len).

resize(→ Union[str, List[str], numpy.ndarray])

Resize the given sequences to the desired length (seq_len).

reverse_complement(→ Union[str, List[str], numpy.ndarray])

Reverse complement input DNA sequences

dinuc_shuffle(seqs[, n_shuffles, start, end, ...])

Dinucleotide shuffle the given sequences.

generate_random_sequences(→ Union[str, List[str], ...)

Generate random DNA sequences as strings or batches.

Module Contents#

grelu.sequence.utils.RC_HASH: Dict[str, str][source]#
grelu.sequence.utils.get_lengths(seqs: pandas.DataFrame | str | List[str], first_only: bool = False, input_type: str | None = None) int | List[int][source]#

Given DNA sequences, return their lengths.

Parameters:
  • seqs – DNA sequences as strings or genomic intervals

  • first_only – If True, only return the length of the first sequence. If False, returns a list of lengths of all sequences if multiple sequences are supplied.

  • input_type – Format of the input sequence. Accepted values are “intervals” or “strings”.

Returns:

The length of each sequence

Raises:

ValueError – if the input is not in interval or string format.

grelu.sequence.utils.check_equal_lengths(seqs: pandas.DataFrame | List[str]) bool[source]#

Given DNA sequences, check whether they are all of equal length

Parameters:

seqs – DNA sequences as a list of strings or a dataframe of genomic intervals

Returns:

If the sequences are all of equal length, returns True.

Otherwise, returns False.

Raises:

ValueError – if the input is not in interval or string format.

grelu.sequence.utils.get_unique_length(seqs: pandas.DataFrame | List[str] | numpy.ndarray | torch.Tensor) int[source]#

Check if given sequences are all of equal length and if so, return the length.

Parameters:

seqs – DNA sequences or genomic intervals of equal length

Returns:

The fixed length of all the input sequences.

Raises:

ValueError – if the input is not in interval or string format.

grelu.sequence.utils.pad(seqs: str | List[str] | numpy.ndarray, seq_len: int | None, end: str = 'both', input_type: str | None = None) str | List[str] | numpy.ndarray[source]#

Pad the input DNA sequence(s) with Ns at the desired end to reach seq_len. If seq_len is not provided, it is set to the length of the longest sequence.

Parameters:
  • seqs – DNA sequences as strings or in index encoded format

  • seq_len – Desired sequence length to pad to

  • end – Which end of the sequence to pad. Accepted values are “left”, “right” and “both”.

  • input_type – Format of the input sequences. Accepted values are “strings” or “indices”.

Returns:

Padded sequences of length seq_len.

Raises:

ValueError – If the input is not in string or integer encoded format.

grelu.sequence.utils.trim(seqs: str | List[str] | numpy.ndarray, seq_len: int | None = None, end: str = 'both', input_type: str | None = None) str | List[str] | numpy.ndarray[source]#

Trim DNA sequences to reach the desired length (seq_len). If seq_len is not provided, it is set to the length of the shortest sequence.

Parameters:
  • seqs – DNA sequences as strings or in index encoded format

  • seq_len – Desired sequence length to trim to

  • end – Which end of the sequence to trim. Accepted values are “left”, “right” and “both”.

  • input_type – Format of the input sequences. Accepted values are “strings” or “indices”.

Returns:

Trimmed sequences of length seq_len.

Raises:

ValueError – if the input is not in string or integer encoded format.

grelu.sequence.utils.resize(seqs: str | List[str] | numpy.ndarray, seq_len: int, end: str = 'both', input_type: str | None = None) str | List[str] | numpy.ndarray[source]#

Resize the given sequences to the desired length (seq_len). Sequences shorter than seq_len will be padded with Ns. Sequences longer than seq_len will be trimmed.

Parameters:
  • seqs – DNA sequences as intervals, strings, or integer encoded format

  • seq_len – Desired length of output sequences.

  • end – Which end of the sequence to trim or extend. Accepted values are “left”, “right” or “both”.

  • input_type – Format of the input sequences. Accepted values are “intervals”, “strings” or “indices”.

Returns:

Resized sequences in the same format

Raises:

ValueError – if input sequences are not in interval, string or integer encoded format

grelu.sequence.utils.reverse_complement(seqs: [str, List[str], numpy.ndarray], input_type: str | None = None) str | List[str] | numpy.ndarray[source]#

Reverse complement input DNA sequences

Parameters:
  • seqs – DNA sequences as strings or index encoding

  • input_type – Format of the input sequences. Accepted values are “strings” or “indices”.

Returns:

reverse complemented sequences in the same format as the input.

Raises:

ValueError – If the input DNA sequence is not in string or index encoded format.

grelu.sequence.utils.dinuc_shuffle(seqs: pandas.DataFrame | numpy.ndarray | List[str], n_shuffles: int = 1, start=0, end=-1, input_type: str | None = None, seed: int | None = None, genome: str | None = None)[source]#

Dinucleotide shuffle the given sequences.

Parameters:
  • seqs – Sequences

  • n_shuffles – Number of times to shuffle each sequence

  • input_type – Format of the input sequence. Accepted values are “strings”, “indices” and “one_hot”

  • seed – Random seed

  • genome – Name of the genome to use if genomic intervals are supplied.

Returns:

Shuffled sequences in the same format as the input

grelu.sequence.utils.generate_random_sequences(seq_len: int, n: int = 1, seed: int | None = None, output_format: str = 'indices') str | List[str] | numpy.ndarray | torch.Tensor[source]#

Generate random DNA sequences as strings or batches.

Parameters:
  • seq_len – Uniform expected length of output sequences.

  • n – Number of random sequences to generate.

  • seed – Seed value for random number generator.

  • output_format – Format in which the output should be returned. Accepted values are “strings”, “indices” and “one_hot”

Returns:

A list of generated sequences.