grelu.sequence.metrics#

Functions to calculate metrics based on the content of a sequence

Functions#

gc(→ Union[float, List[float]])

Calculate the GC fraction of the given DNA sequence(s).

gc_distribution(→ numpy.ndarray)

Calculate the histogram of GC content in a set of DNA sequences.

Module Contents#

grelu.sequence.metrics.gc(seqs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor, input_type: str | None = None, genome: str | None = None) float | List[float][source]#

Calculate the GC fraction of the given DNA sequence(s).

Parameters:
  • seqs – The DNA sequences whose GC content is to be calculated. These can be in any accepted format (intervals, strings, integer-encoded or one-hot encoded).

  • input_type – The format of the input sequences. Accepted values are “intervals”, “strings”, “indices” or “one_hot”. If not provided, it will be deduced from the data.

  • genome – Name of the genome to use if genomic intervals are provided.

Returns:

The fraction of the sequence comprised of G and C bases. If multiple sequences are provided, the output will be a list of values, one for each sequence.

grelu.sequence.metrics.gc_distribution(seqs: pandas.DataFrame | List[str] | numpy.ndarray | torch.Tensor, binwidth: float = 0.1, normalize: bool = False, input_type: str | None = None, genome: str | None = None) numpy.ndarray[source]#

Calculate the histogram of GC content in a set of DNA sequences.

Parameters:
  • seqs – DNA sequences, as intervals, strings, indices or one-hot.

  • binwidth – Width of the bins to use when calculating the histogram. Default is 0.1.

  • normalize – Whether to normalize the histogram so that the values sum to 1.

  • input_type – The format of the input sequences. Accepted values are intervals, strings, indices or one_hot. If not provided, it will be deduced from the data.

  • genome – Name of the genome to use if genomic intervals are supplied.

Returns:

The histogram of GC content, with length equal to 1/binwidth.