grelu.sequence.format#

Functions related to checking the format of input DNA sequences and converting

them between accepted sequence formats.

The following are accepted sequence formats: 1. intervals: a pd.DataFrame object containing valid genomic intervals 2. strings: A string or list of strings 3. indices: A numpy array of shape (L,) or (B, L) and dtype np.int8 4. one_hot: A torch tensor of shape (4, L) or (B, 4, L) and dtype torch.float32

Attributes#

Functions#

check_intervals(→ bool)

Check if a pandas dataframe contains valid genomic intervals.

check_string_dna(→ bool)

Check if an input string or list of strings contains only valid DNA bases.

check_indices(→ bool)

Check if an input array contains valid integer-encoded DNA sequences.

check_one_hot(→ bool)

Check if an input tensor contains valid one-hot encoded DNA sequences.

get_input_type(inputs)

Given one or more DNA sequences in any accepted format, return the sequence format.

intervals_to_strings(→ Union[str, List[str]])

Extract DNA sequences from the specified intervals in a genome.

strings_to_indices(→ numpy.ndarray)

Convert DNA sequence strings into integer encoded format.

indices_to_one_hot(→ torch.Tensor)

Convert integer-encoded DNA sequences to one-hot encoded format.

strings_to_one_hot(→ torch.Tensor)

Convert a list of DNA sequences to one-hot encoded format.

one_hot_to_indices(→ numpy.ndarray)

Convert a one-hot encoded sequence to integer encoded format

one_hot_to_strings(→ List[str])

Convert a one-hot encoded sequence to a list of strings

indices_to_strings(→ List[str])

Convert indices to strings. Any index outside 0:3 range will be converted to 'N'

convert_input_type(→ Union[pandas.DataFrame, str, ...)

Convert input DNA sequence data into the desired format.

Module Contents#

grelu.sequence.format.ALLOWED_BASES: List[str] = ['A', 'C', 'G', 'T', 'N'][source]#
grelu.sequence.format.STANDARD_BASES: List[str] = ['A', 'C', 'G', 'T'][source]#
grelu.sequence.format.BASE_TO_INDEX_HASH: Dict[str, int][source]#
grelu.sequence.format.INDEX_TO_BASE_HASH: Dict[int, str][source]#
grelu.sequence.format.check_intervals(df: pandas.DataFrame) bool[source]#

Check if a pandas dataframe contains valid genomic intervals.

Parameters:

df – Dataframe to check

Returns:

Whether the dataframe contains valid genomic intervals

grelu.sequence.format.check_string_dna(strings: str | List[str]) bool[source]#

Check if an input string or list of strings contains only valid DNA bases.

Parameters:

strings – string or list of strings

Returns:

If all the provided strings are valid DNA sequences, returns True. Otherwise, returns False.

grelu.sequence.format.check_indices(indices: numpy.ndarray) bool[source]#

Check if an input array contains valid integer-encoded DNA sequences.

Parameters:

indices – Numpy array.

Returns:

If the array contains valid integer-encoded DNA sequences, returns True. Otherwise, returns False.

grelu.sequence.format.check_one_hot(one_hot: torch.Tensor) bool[source]#

Check if an input tensor contains valid one-hot encoded DNA sequences.

Parameters:

one_hot – torch tensor

Returns:

Whether the tensor is a valid one-hot encoded DNA sequence or batch of sequences.

grelu.sequence.format.get_input_type(inputs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor)[source]#

Given one or more DNA sequences in any accepted format, return the sequence format.

Parameters:

inputs – Input sequences as intervals, strings, index-encoded, or one-hot encoded

Returns:

The input format, one of “intervals”, “strings”, “indices” or “one_hot”

Raises:
  • KeyError – If the input dataframe is missing one or more of the required columns chrom, start, end.

  • ValueError – If the input sequence has non-allowed characters.

  • TypeError – If the input is not of a supported type.

grelu.sequence.format.intervals_to_strings(intervals: pandas.DataFrame | pandas.Series | dict, genome: str) str | List[str][source]#

Extract DNA sequences from the specified intervals in a genome.

Parameters:
  • intervals – A pandas DataFrame, Series or dictionary containing the genomic interval(s) to extract.

  • genome – Name of the genome to use.

Returns:

A list of DNA sequences extracted from the intervals.

grelu.sequence.format.strings_to_indices(strings: str | List[str], add_batch_axis: bool = False) numpy.ndarray[source]#

Convert DNA sequence strings into integer encoded format.

Parameters:
  • strings – A DNA sequence or list of sequences. If a list of multiple sequences is provided, they must all have equal length.

  • add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 1-dimensional array.

Returns:

The integer-encoded sequences.

grelu.sequence.format.indices_to_one_hot(indices: numpy.ndarray) torch.Tensor[source]#

Convert integer-encoded DNA sequences to one-hot encoded format.

Parameters:

indices – Integer-encoded DNA sequences.

Returns:

The one-hot encoded sequences.

grelu.sequence.format.strings_to_one_hot(strings: str | List[str], add_batch_axis: bool = False) torch.Tensor[source]#

Convert a list of DNA sequences to one-hot encoded format.

Parameters:
  • seqs – A DNA sequence or a list of DNA sequences.

  • add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.

Returns:

The one-hot encoded DNA sequence(s).

Raises:
  • AssertionError – If the input sequences are not of the same length,

  • or if the input is not a string or a list of strings.

grelu.sequence.format.one_hot_to_indices(one_hot: torch.Tensor) numpy.ndarray[source]#

Convert a one-hot encoded sequence to integer encoded format

Parameters:

one_hot – A one-hot encoded DNA sequence or batch of sequences.

Returns:

The integer-encoded sequences.

grelu.sequence.format.one_hot_to_strings(one_hot: torch.Tensor) List[str][source]#

Convert a one-hot encoded sequence to a list of strings

Parameters:

one_hot – A one-hot encoded DNA sequence or batch of sequences.

Returns:

A list of DNA sequences.

grelu.sequence.format.indices_to_strings(indices: numpy.ndarray) List[str][source]#

Convert indices to strings. Any index outside 0:3 range will be converted to ‘N’

Parameters:

strings – A DNA sequence or list of sequences.

Returns:

The input sequences as a list of strings.

grelu.sequence.format.convert_input_type(inputs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor, output_type: str = 'indices', genome: str | None = None, add_batch_axis: bool = False) pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor[source]#

Convert input DNA sequence data into the desired format.

Parameters:
  • inputs – DNA sequence(s) in one of the following formats: intervals, strings, indices, or one-hot encoded.

  • output_type – The desired output format.

  • genome – The name of the genome to use if genomic intervals are provided.

  • add_batch_axis – If True, a batch axis will be included in the output for single sequences. If False, the output for a single sequence will be a 2-dimensional tensor.

Returns:

The converted DNA sequence(s) in the desired format.

Raises:

ValueError – If the conversion is not possible between the input and output formats.