grelu.sequence.format#
- Functions related to checking the format of input DNA sequences and converting
them between accepted sequence formats.
The following are accepted sequence formats: 1. intervals: a pd.DataFrame object containing valid genomic intervals 2. strings: A string or list of strings 3. indices: A numpy array of shape (L,) or (B, L) and dtype np.int8 4. one_hot: A torch tensor of shape (4, L) or (B, 4, L) and dtype torch.float32
Attributes#
Functions#
|
Check if a pandas dataframe contains valid genomic intervals. |
|
Check if an input string or list of strings contains only valid DNA bases. |
|
Check if an input array contains valid integer-encoded DNA sequences. |
|
Check if an input tensor contains valid one-hot encoded DNA sequences. |
|
Given one or more DNA sequences in any accepted format, return the sequence format. |
|
Extract DNA sequences from the specified intervals in a genome. |
|
Convert DNA sequence strings into integer encoded format. |
|
Convert integer-encoded DNA sequences to one-hot encoded format. |
|
Convert a list of DNA sequences to one-hot encoded format. |
|
Convert a one-hot encoded sequence to integer encoded format |
|
Convert a one-hot encoded sequence to a list of strings |
|
Convert indices to strings. Any index outside 0:3 range will be converted to 'N' |
|
Convert input DNA sequence data into the desired format. |
Module Contents#
- grelu.sequence.format.check_intervals(df: pandas.DataFrame) bool [source]#
Check if a pandas dataframe contains valid genomic intervals.
- Parameters:
df – Dataframe to check
- Returns:
Whether the dataframe contains valid genomic intervals
- grelu.sequence.format.check_string_dna(strings: str | List[str]) bool [source]#
Check if an input string or list of strings contains only valid DNA bases.
- Parameters:
strings – string or list of strings
- Returns:
If all the provided strings are valid DNA sequences, returns True. Otherwise, returns False.
- grelu.sequence.format.check_indices(indices: numpy.ndarray) bool [source]#
Check if an input array contains valid integer-encoded DNA sequences.
- Parameters:
indices – Numpy array.
- Returns:
If the array contains valid integer-encoded DNA sequences, returns True. Otherwise, returns False.
- grelu.sequence.format.check_one_hot(one_hot: torch.Tensor) bool [source]#
Check if an input tensor contains valid one-hot encoded DNA sequences.
- Parameters:
one_hot – torch tensor
- Returns:
Whether the tensor is a valid one-hot encoded DNA sequence or batch of sequences.
- grelu.sequence.format.get_input_type(inputs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor)[source]#
Given one or more DNA sequences in any accepted format, return the sequence format.
- Parameters:
inputs – Input sequences as intervals, strings, index-encoded, or one-hot encoded
- Returns:
The input format, one of “intervals”, “strings”, “indices” or “one_hot”
- Raises:
KeyError – If the input dataframe is missing one or more of the required columns chrom, start, end.
ValueError – If the input sequence has non-allowed characters.
TypeError – If the input is not of a supported type.
- grelu.sequence.format.intervals_to_strings(intervals: pandas.DataFrame | pandas.Series | dict, genome: str) str | List[str] [source]#
Extract DNA sequences from the specified intervals in a genome.
- Parameters:
intervals – A pandas DataFrame, Series or dictionary containing the genomic interval(s) to extract.
genome – Name of the genome to use.
- Returns:
A list of DNA sequences extracted from the intervals.
- grelu.sequence.format.strings_to_indices(strings: str | List[str], add_batch_axis: bool = False) numpy.ndarray [source]#
Convert DNA sequence strings into integer encoded format.
- Parameters:
strings – A DNA sequence or list of sequences. If a list of multiple sequences is provided, they must all have equal length.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 1-dimensional array.
- Returns:
The integer-encoded sequences.
- grelu.sequence.format.indices_to_one_hot(indices: numpy.ndarray, add_batch_axis: bool = False) torch.Tensor [source]#
Convert integer-encoded DNA sequences to one-hot encoded format.
- Parameters:
indices – Integer-encoded DNA sequences.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.
- Returns:
The one-hot encoded sequences.
- grelu.sequence.format.strings_to_one_hot(strings: str | List[str], add_batch_axis: bool = False) torch.Tensor [source]#
Convert a list of DNA sequences to one-hot encoded format.
- Parameters:
seqs – A DNA sequence or a list of DNA sequences.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.
- Returns:
The one-hot encoded DNA sequence(s).
- Raises:
AssertionError – If the input sequences are not of the same length,
or if the input is not a string or a list of strings. –
- grelu.sequence.format.one_hot_to_indices(one_hot: torch.Tensor) numpy.ndarray [source]#
Convert a one-hot encoded sequence to integer encoded format
- Parameters:
one_hot – A one-hot encoded DNA sequence or batch of sequences.
- Returns:
The integer-encoded sequences.
- grelu.sequence.format.one_hot_to_strings(one_hot: torch.Tensor) List[str] [source]#
Convert a one-hot encoded sequence to a list of strings
- Parameters:
one_hot – A one-hot encoded DNA sequence or batch of sequences.
- Returns:
A list of DNA sequences.
- grelu.sequence.format.indices_to_strings(indices: numpy.ndarray) List[str] [source]#
Convert indices to strings. Any index outside 0:3 range will be converted to ‘N’
- Parameters:
strings – A DNA sequence or list of sequences.
- Returns:
The input sequences as a list of strings.
- grelu.sequence.format.convert_input_type(inputs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor, output_type: str = 'indices', genome: str | None = None, add_batch_axis: bool = False, input_type: str | None = None) pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor [source]#
Convert input DNA sequence data into the desired format.
- Parameters:
inputs – DNA sequence(s) in one of the following formats: intervals, strings, indices, or one-hot encoded.
output_type – The desired output format.
genome – The name of the genome to use if genomic intervals are provided.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.
input_type – Format of the input sequence (optional)
- Returns:
The converted DNA sequence(s) in the desired format.
- Raises:
ValueError – If the conversion is not possible between the input and output formats.