grelu.sequence.utils#
General utilities for analysis of DNA sequences
Attributes#
Functions#
|
Given DNA sequences, return their lengths. |
|
Given DNA sequences, check whether they are all of equal length |
|
Check if given sequences are all of equal length and if so, return the length. |
|
Pad the input DNA sequence(s) with Ns at the desired end to reach |
|
Trim DNA sequences to reach the desired length (seq_len). |
|
Resize the given sequences to the desired length (seq_len). |
|
Reverse complement input DNA sequences |
|
Dinucleotide shuffle the given sequences. |
|
Generate random DNA sequences as strings or batches. |
Module Contents#
- grelu.sequence.utils.get_lengths(seqs: pandas.DataFrame | str | List[str], first_only: bool = False, input_type: str | None = None) int | List[int] [source]#
Given DNA sequences, return their lengths.
- Parameters:
seqs – DNA sequences as strings or genomic intervals
first_only – If True, only return the length of the first sequence. If False, returns a list of lengths of all sequences if multiple sequences are supplied.
input_type – Format of the input sequence. Accepted values are “intervals” or “strings”.
- Returns:
The length of each sequence
- Raises:
ValueError – if the input is not in interval or string format.
- grelu.sequence.utils.check_equal_lengths(seqs: pandas.DataFrame | List[str]) bool [source]#
Given DNA sequences, check whether they are all of equal length
- Parameters:
seqs – DNA sequences as a list of strings or a dataframe of genomic intervals
- Returns:
- If the sequences are all of equal length, returns True.
Otherwise, returns False.
- Raises:
ValueError – if the input is not in interval or string format.
- grelu.sequence.utils.get_unique_length(seqs: pandas.DataFrame | List[str] | numpy.ndarray | torch.Tensor) int [source]#
Check if given sequences are all of equal length and if so, return the length.
- Parameters:
seqs – DNA sequences or genomic intervals of equal length
- Returns:
The fixed length of all the input sequences.
- Raises:
ValueError – if the input is not in interval or string format.
- grelu.sequence.utils.pad(seqs: str | List[str] | numpy.ndarray, seq_len: int | None, end: str = 'both', input_type: str | None = None) str | List[str] | numpy.ndarray [source]#
Pad the input DNA sequence(s) with Ns at the desired end to reach seq_len. If seq_len is not provided, it is set to the length of the longest sequence.
- Parameters:
seqs – DNA sequences as strings or in index encoded format
seq_len – Desired sequence length to pad to
end – Which end of the sequence to pad. Accepted values are “left”, “right” and “both”.
input_type – Format of the input sequences. Accepted values are “strings” or “indices”.
- Returns:
Padded sequences of length seq_len.
- Raises:
ValueError – If the input is not in string or integer encoded format.
- grelu.sequence.utils.trim(seqs: str | List[str] | numpy.ndarray, seq_len: int | None = None, end: str = 'both', input_type: str | None = None) str | List[str] | numpy.ndarray [source]#
Trim DNA sequences to reach the desired length (seq_len). If seq_len is not provided, it is set to the length of the shortest sequence.
- Parameters:
seqs – DNA sequences as strings or in index encoded format
seq_len – Desired sequence length to trim to
end – Which end of the sequence to trim. Accepted values are “left”, “right” and “both”.
input_type – Format of the input sequences. Accepted values are “strings” or “indices”.
- Returns:
Trimmed sequences of length seq_len.
- Raises:
ValueError – if the input is not in string or integer encoded format.
- grelu.sequence.utils.resize(seqs: str | List[str] | numpy.ndarray, seq_len: int, end: str = 'both', input_type: str | None = None) str | List[str] | numpy.ndarray [source]#
Resize the given sequences to the desired length (seq_len). Sequences shorter than seq_len will be padded with Ns. Sequences longer than seq_len will be trimmed.
- Parameters:
seqs – DNA sequences as intervals, strings, or integer encoded format
seq_len – Desired length of output sequences.
end – Which end of the sequence to trim or extend. Accepted values are “left”, “right” or “both”.
input_type – Format of the input sequences. Accepted values are “intervals”, “strings” or “indices”.
- Returns:
Resized sequences in the same format
- Raises:
ValueError – if input sequences are not in interval, string or integer encoded format
- grelu.sequence.utils.reverse_complement(seqs: [str, List[str], numpy.ndarray], input_type: str | None = None) str | List[str] | numpy.ndarray [source]#
Reverse complement input DNA sequences
- Parameters:
seqs – DNA sequences as strings or index encoding
input_type – Format of the input sequences. Accepted values are “strings” or “indices”.
- Returns:
reverse complemented sequences in the same format as the input.
- Raises:
ValueError – If the input DNA sequence is not in string or index encoded format.
- grelu.sequence.utils.dinuc_shuffle(seqs: pandas.DataFrame | numpy.ndarray | List[str], n_shuffles: int = 1, start=0, end=-1, input_type: str | None = None, seed: int | None = None, genome: str | None = None)[source]#
Dinucleotide shuffle the given sequences.
- Parameters:
seqs – Sequences
n_shuffles – Number of times to shuffle each sequence
input_type – Format of the input sequence. Accepted values are “strings”, “indices” and “one_hot”
seed – Random seed
genome – Name of the genome to use if genomic intervals are supplied.
- Returns:
Shuffled sequences in the same format as the input
- grelu.sequence.utils.generate_random_sequences(seq_len: int, n: int = 1, seed: int | None = None, output_format: str = 'indices') str | List[str] | numpy.ndarray | torch.Tensor [source]#
Generate random DNA sequences as strings or batches.
- Parameters:
seq_len – Uniform expected length of output sequences.
n – Number of random sequences to generate.
seed – Seed value for random number generator.
output_format – Format in which the output should be returned. Accepted values are “strings”, “indices” and “one_hot”
- Returns:
A list of generated sequences.