grelu.sequence.format#

grelu.sequence.format contains functions related to checking the format of input DNA sequences and converting them between accepted sequence formats.

The following are accepted sequence formats in gReLU: 1. intervals: a pd.DataFrame object containing valid genomic intervals 2. strings: A string or list of strings 3. indices: A numpy array of shape (length,) or (N, length) and dtype np.int8 4. one_hot: A torch tensor of shape (4, length) or (N, 4, length) and dtype torch.float32

Attributes#

`ALLOWED_BASES`
`STANDARD_BASES`
`BASE_TO_INDEX_HASH`
`INDEX_TO_BASE_HASH`

Functions#

`check_intervals`(→ bool)	Check if a pandas dataframe contains valid genomic intervals.
`check_string_dna`(→ bool)	Check if an input string or list of strings contains only valid DNA bases.
`check_indices`(→ bool)	Check if an input array contains valid integer-encoded DNA sequences.
`check_one_hot`(→ bool)	Check if an input tensor contains valid one-hot encoded DNA sequences.
`get_input_type`(inputs)	Given one or more DNA sequences in any accepted format, return the sequence format.
`intervals_to_strings`(→ Union[str, List[str]])	Extract DNA sequences from the specified intervals in a genome.
`strings_to_indices`(→ numpy.ndarray)	Convert DNA sequence strings into integer encoded format.
`indices_to_one_hot`(→ torch.Tensor)	Convert integer-encoded DNA sequences to one-hot encoded format.
`strings_to_one_hot`(→ torch.Tensor)	Convert a list of DNA sequences to one-hot encoded format.
`one_hot_to_indices`(→ numpy.ndarray)	Convert a one-hot encoded sequence to integer encoded format
`one_hot_to_strings`(→ List[str])	Convert a one-hot encoded sequence to a list of strings
`indices_to_strings`(→ List[str])	Convert indices to strings. Any index outside 0:3 range will be converted to 'N'
`convert_input_type`(→ Union[pandas.DataFrame, str, ...)	Convert input DNA sequence data into the desired format.

Module Contents#

grelu.sequence.format.ALLOWED_BASES: List[str] = ['A', 'C', 'G', 'T', 'N'][source]#

grelu.sequence.format.STANDARD_BASES: List[str] = ['A', 'C', 'G', 'T'][source]#

grelu.sequence.format.BASE_TO_INDEX_HASH: Dict[str, int][source]#

grelu.sequence.format.INDEX_TO_BASE_HASH: Dict[int, str][source]#

grelu.sequence.format.check_intervals(df: pandas.DataFrame) → bool[source]#

Check if a pandas dataframe contains valid genomic intervals.

Parameters:: df – Dataframe to check
Returns:: Whether the dataframe contains valid genomic intervals

grelu.sequence.format.check_string_dna(strings: str | List[str]) → bool[source]#

Check if an input string or list of strings contains only valid DNA bases.

Parameters:: strings – string or list of strings
Returns:: If all the provided strings are valid DNA sequences, returns True. Otherwise, returns False.

grelu.sequence.format.check_indices(indices: numpy.ndarray) → bool[source]#

Check if an input array contains valid integer-encoded DNA sequences.

Parameters:: indices – Numpy array.
Returns:: If the array contains valid integer-encoded DNA sequences, returns True. Otherwise, returns False.

grelu.sequence.format.check_one_hot(one_hot: torch.Tensor) → bool[source]#

Check if an input tensor contains valid one-hot encoded DNA sequences.

Parameters:: one_hot – torch tensor
Returns:: Whether the tensor is a valid one-hot encoded DNA sequence or batch of sequences.

grelu.sequence.format.get_input_type(inputs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor)[source]#

Given one or more DNA sequences in any accepted format, return the sequence format.

Parameters:

inputs – Input sequences as intervals, strings, index-encoded, or one-hot encoded

Returns:

The input format, one of “intervals”, “strings”, “indices” or “one_hot”

Raises:

KeyError – If the input dataframe is missing one or more of the required columns chrom, start, end.
ValueError – If the input sequence has non-allowed characters.
TypeError – If the input is not of a supported type.

grelu.sequence.format.intervals_to_strings(intervals: pandas.DataFrame | pandas.Series | dict, genome: str) → str | List[str][source]#

Extract DNA sequences from the specified intervals in a genome.

Parameters:

intervals – A pandas DataFrame, Series or dictionary containing the genomic interval(s) to extract.
genome – Name of the genome to use.

Returns:

A list of DNA sequences extracted from the intervals.

grelu.sequence.format.strings_to_indices(strings: str | List[str], add_batch_axis: bool = False) → numpy.ndarray[source]#

Convert DNA sequence strings into integer encoded format.

Parameters:

strings – A DNA sequence or list of sequences. If a list of multiple sequences is provided, they must all have equal length.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 1-dimensional array.

Returns:

The integer-encoded sequences.

grelu.sequence.format.indices_to_one_hot(indices: numpy.ndarray, add_batch_axis: bool = False) → torch.Tensor[source]#

Convert integer-encoded DNA sequences to one-hot encoded format.

Parameters:

indices – Integer-encoded DNA sequences.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.

Returns:

The one-hot encoded sequences.

grelu.sequence.format.strings_to_one_hot(strings: str | List[str], add_batch_axis: bool = False) → torch.Tensor[source]#

Convert a list of DNA sequences to one-hot encoded format.

Parameters:

seqs – A DNA sequence or a list of DNA sequences.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.

Returns:

The one-hot encoded DNA sequence(s).

Raises:

AssertionError – If the input sequences are not of the same length,
or if the input is not a string or a list of strings. –

grelu.sequence.format.one_hot_to_indices(one_hot: torch.Tensor) → numpy.ndarray[source]#

Convert a one-hot encoded sequence to integer encoded format

Parameters:: one_hot – A one-hot encoded DNA sequence or batch of sequences.
Returns:: The integer-encoded sequences.

grelu.sequence.format.one_hot_to_strings(one_hot: torch.Tensor) → List[str][source]#

Convert a one-hot encoded sequence to a list of strings

Parameters:: one_hot – A one-hot encoded DNA sequence or batch of sequences.
Returns:: A list of DNA sequences.

grelu.sequence.format.indices_to_strings(indices: numpy.ndarray) → List[str][source]#

Convert indices to strings. Any index outside 0:3 range will be converted to ‘N’

Parameters:: strings – A DNA sequence or list of sequences.
Returns:: The input sequences as a list of strings.

Convert input DNA sequence data into the desired format.

Parameters:

inputs – DNA sequence(s) in one of the following formats: intervals, strings, indices, or one-hot encoded.
output_type – The desired output format.
genome – The name of the genome to use if genomic intervals are provided.
add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.
input_type – Format of the input sequence (optional)

Returns:

The converted DNA sequence(s) in the desired format.

Raises:

ValueError – If the conversion is not possible between the input and output formats.

grelu.sequence.format#

Attributes#

Functions#

Module Contents#

This Page