grelu.sequence.metrics#
Functions to calculate metrics based on the content of a sequence
Functions#
|
Calculate the GC fraction of the given DNA sequence(s). |
|
Calculate the histogram of GC content in a set of DNA sequences. |
Module Contents#
- grelu.sequence.metrics.gc(seqs: pandas.DataFrame | str | List[str] | numpy.ndarray | torch.Tensor, input_type: str | None = None, genome: str | None = None) float | List[float] [source]#
Calculate the GC fraction of the given DNA sequence(s).
- Parameters:
seqs – The DNA sequences whose GC content is to be calculated. These can be in any accepted format (intervals, strings, integer-encoded or one-hot encoded).
input_type – The format of the input sequences. Accepted values are “intervals”, “strings”, “indices” or “one_hot”. If not provided, it will be deduced from the data.
genome – Name of the genome to use if genomic intervals are provided.
- Returns:
The fraction of the sequence comprised of G and C bases. If multiple sequences are provided, the output will be a list of values, one for each sequence.
- grelu.sequence.metrics.gc_distribution(seqs: pandas.DataFrame | List[str] | numpy.ndarray | torch.Tensor, binwidth: float = 0.1, normalize: bool = False, input_type: str | None = None, genome: str | None = None) numpy.ndarray [source]#
Calculate the histogram of GC content in a set of DNA sequences.
- Parameters:
seqs – DNA sequences, as intervals, strings, indices or one-hot.
binwidth – Width of the bins to use when calculating the histogram. Default is 0.1.
normalize – Whether to normalize the histogram so that the values sum to 1.
input_type – The format of the input sequences. Accepted values are intervals, strings, indices or one_hot. If not provided, it will be deduced from the data.
genome – Name of the genome to use if genomic intervals are supplied.
- Returns:
The histogram of GC content, with length equal to 1/binwidth.