grelu.sequence.utils#

grelu.sequence.utils contains general utilities for analysis of DNA sequences

Attributes#

RC_HASH

Functions#

`get_lengths`(→ Union[int, List[int]])	Given DNA sequences, return their lengths.
`check_equal_lengths`(→ bool)	Given DNA sequences, check whether they are all of equal length
`get_unique_length`(→ int)	Check if given sequences are all of equal length and if so, return the length.
`pad`(→ Union[str, List[str], numpy.ndarray])	Pad the input DNA sequence(s) with Ns at the desired end to reach
`trim`(→ Union[str, List[str], numpy.ndarray])	Trim DNA sequences to reach the desired length (seq_len).
`resize`(→ Union[str, List[str], numpy.ndarray])	Resize the given sequences to the desired length (seq_len).
`reverse_complement`(→ Union[str, List[str], numpy.ndarray])	Reverse complement input DNA sequences
`dinuc_shuffle`(seqs[, n_shuffles, start, end, ...])	Dinucleotide shuffle the given sequences.
`generate_random_sequences`(→ Union[str, List[str], ...)	Generate random DNA sequences as strings or batches.

Module Contents#

grelu.sequence.utils.RC_HASH: Dict[str, str][source]#

grelu.sequence.utils.get_lengths(seqs: pandas.DataFrame | str | List[str], first_only: bool = False, input_type: str | None = None) → int | List[int][source]#

Given DNA sequences, return their lengths.

Parameters:

seqs – DNA sequences as strings or genomic intervals
first_only – If True, only return the length of the first sequence. If False, returns a list of lengths of all sequences if multiple sequences are supplied.
input_type – Format of the input sequence. Accepted values are “intervals” or “strings”.

Returns:

The length of each sequence

Raises:

ValueError – if the input is not in interval or string format.

grelu.sequence.utils.check_equal_lengths(seqs: pandas.DataFrame | List[str]) → bool[source]#

Given DNA sequences, check whether they are all of equal length

Parameters:

seqs – DNA sequences as a list of strings or a dataframe of genomic intervals

Returns:

If the sequences are all of equal length, returns True.: Otherwise, returns False.

Raises:

ValueError – if the input is not in interval or string format.

grelu.sequence.utils.get_unique_length(seqs: pandas.DataFrame | List[str] | numpy.ndarray | torch.Tensor) → int[source]#

Check if given sequences are all of equal length and if so, return the length.

Parameters:: seqs – DNA sequences or genomic intervals of equal length
Returns:: The fixed length of all the input sequences.
Raises:: ValueError – if the input is not in interval or string format.

Pad the input DNA sequence(s) with Ns at the desired end to reach seq_len. If seq_len is not provided, it is set to the length of the longest sequence.

Parameters:

seqs – DNA sequences as strings or in index encoded format
seq_len – Desired sequence length to pad to
end – Which end of the sequence to pad. Accepted values are “left”, “right” and “both”.
input_type – Format of the input sequences. Accepted values are “strings” or “indices”.

Returns:

Padded sequences of length seq_len.

Raises:

ValueError – If the input is not in string or integer encoded format.

Trim DNA sequences to reach the desired length (seq_len). If seq_len is not provided, it is set to the length of the shortest sequence.

Parameters:

seqs – DNA sequences as strings or in index encoded format
seq_len – Desired sequence length to trim to
end – Which end of the sequence to trim. Accepted values are “left”, “right” and “both”.
input_type – Format of the input sequences. Accepted values are “strings” or “indices”.

Returns:

Trimmed sequences of length seq_len.

Raises:

ValueError – if the input is not in string or integer encoded format.

Resize the given sequences to the desired length (seq_len). Sequences shorter than seq_len will be padded with Ns. Sequences longer than seq_len will be trimmed.

Parameters:

seqs – DNA sequences as intervals, strings, or integer encoded format
seq_len – Desired length of output sequences.
end – Which end of the sequence to trim or extend. Accepted values are “left”, “right” or “both”.
input_type – Format of the input sequences. Accepted values are “intervals”, “strings” or “indices”.

Returns:

Resized sequences in the same format

Raises:

ValueError – if input sequences are not in interval, string or integer encoded format

grelu.sequence.utils.reverse_complement(seqs: [str, List[str], numpy.ndarray], input_type: str | None = None) → str | List[str] | numpy.ndarray[source]#

Reverse complement input DNA sequences

Parameters:

seqs – DNA sequences as strings or index encoding
input_type – Format of the input sequences. Accepted values are “strings” or “indices”.

Returns:

reverse complemented sequences in the same format as the input.

Raises:

ValueError – If the input DNA sequence is not in string or index encoded format.

Dinucleotide shuffle the given sequences.

Parameters:

seqs – Sequences
n_shuffles – Number of times to shuffle each sequence
input_type – Format of the input sequence. Accepted values are “strings”, “indices” and “one_hot”
seed – Random seed
genome – Name of the genome to use if genomic intervals are supplied.

Returns:

Shuffled sequences in the same format as the input

grelu.sequence.utils.generate_random_sequences(seq_len: int, n: int = 1, seed: int | None = None, output_format: str = 'indices') → str | List[str] | numpy.ndarray | torch.Tensor[source]#

Generate random DNA sequences as strings or batches.

Parameters:

seq_len – Uniform expected length of output sequences.
n – Number of random sequences to generate.
seed – Seed value for random number generator.
output_format – Format in which the output should be returned. Accepted values are “strings”, “indices” and “one_hot”

Returns:

A list of generated sequences.

grelu.sequence.utils#

Attributes#

Functions#

Module Contents#

This Page