grelu.data.preprocess#

Functions to preprocess genomic datasets.

Functions#

filter_intervals(→ Optional[Union[pandas.DataFrame, ...)

Filter intervals by boolean mask

filter_obs(→ Optional[anndata.AnnData])

Filter the .obs dataframe in an anndata object using a boolean mask

filter_coverage(→ Optional[anndata.AnnData])

Filter genomic intervals based on their maximum or mean coverage

filter_cells(→ Optional[anndata.AnnData])

Drop cell types that are composed of few cells

filter_random(→ Optional[Union[pandas.DataFrame, ...)

Filter n randomly chosen intervals

filter_chromosomes(data[, include, exclude, inplace])

Filter to sequence elements in selected chromosomes.

clip_intervals(intervals[, start, end])

Clip the ends of intervals to the given boundaries.

filter_overlapping(data, ref_intervals[, window, ...])

Filter intervals based on their overlap with a set of reference intervals.

filter_blacklist(data, genome[, blacklist, inplace, ...])

Remove intervals that overlap with blacklist regions

filter_chrom_ends(data[, genome, pad, inplace])

Filter intervals that extend beyond the ends of the chromosome.

split(data[, train_chroms, val_chroms, test_chroms, ...])

Split Anndata object into training, validation and test samples

get_gc_matched_intervals(→ pandas.DataFrame)

Get GC-matched intervals for a set of given intervals.

add_negatives(→ Optional[anndata.AnnData])

Append negative control intervals onto an anndata object containing

extend_from_coord(→ pandas.DataFrame)

Create intervals centered on the given coordinates.

merge_intervals_by_column(→ pandas.DataFrame)

Merge intervals that have the same value in a given column. The output

make_insertion_bigwig(→ str)

Given a fragment file, create a bigwig of Tn5 insertion sites

Module Contents#

grelu.data.preprocess.filter_intervals(data: pandas.DataFrame | anndata.AnnData, keep: numpy.ndarray, inplace: bool = False) pandas.DataFrame | anndata.AnnData | None[source]#

Filter intervals by boolean mask

Parameters:
  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • keep – Boolean mask of same length as data

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format (if inplace = False)

grelu.data.preprocess.filter_obs(adata: anndata.AnnData, keep: numpy.ndarray, inplace: bool = False) anndata.AnnData | None[source]#

Filter the .obs dataframe in an anndata object using a boolean mask

Parameters:
  • adata – anndata object

  • keep – boolean mask of same length as adata.obs

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.

Returns:

Filtered anndata object (if inplace = False)

grelu.data.preprocess.filter_coverage(adata: anndata.AnnData, aggfunc: str | Callable = np.mean, method: str = 'cutoff', cutoff: int = 1, negative_frac: float = 0.0, inplace: bool = False) anndata.AnnData | None[source]#

Filter genomic intervals based on their maximum or mean coverage across cell types

Parameters:
  • adata – An Anndata object containing genomic intervals in .var

  • aggfunc – Function to aggregate coverage values

  • method – Method to use for filtering intervals. The options are “cutoff” to apply a raw coverage cutoff, “top” to select the top n intervals or “percentile” to select a top percentile of intervals

  • cutoff – the raw cutoff value (if method = “cutoff”), number of intervals (if method = “top”) or the percentile to select (if method = “percentile”)

  • negative_frac – Select a number of intervals below the cutoff, equal to the given fraction of the number of above-cutoff intervals

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.

Returns:

Filtered anndata object

grelu.data.preprocess.filter_cells(adata: anndata.AnnData, cutoff: int = 1000, count_key: str = 'n_cells', inplace: bool = False) anndata.AnnData | None[source]#

Drop cell types that are composed of few cells

Parameters:
  • adata – anndata object with intervals in .var and cell counts in .obs

  • cutoff – minimum cell count

  • count_key – key under which cell count is stored in adata.obs

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.

Returns:

Filtered anndata object

grelu.data.preprocess.filter_random(data: pandas.DataFrame | anndata.AnnData, n: int, seed: int | None = None, inplace: bool = False) pandas.DataFrame | anndata.AnnData | None[source]#

Filter n randomly chosen intervals

Parameters:
  • data – genomic intervals or anndata object with intervals in .var

  • n – Number of intervals to select

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format

grelu.data.preprocess.filter_chromosomes(data: pandas.DataFrame | anndata.AnnData, include: List[str] | None = None, exclude: List[str] | None = None, inplace: bool = False)[source]#

Filter to sequence elements in selected chromosomes.

Parameters:
  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • include – list of chromosome names to keep

  • exclude – list of chromosome names to drop

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format

grelu.data.preprocess.clip_intervals(intervals: pandas.DataFrame, start: int | None = None, end: int | None = None)[source]#

Clip the ends of intervals to the given boundaries.

Parameters:
  • intervals – Dataframe containing the genomic intervals to clip.

  • start – The minimum start coordinate. All start coordinates less than this will be clipped to this value.

  • end – The maximum start coordinate. All end coordinates greater than this will be clipped to this value.

Returns:

Dataframe containing clipped intervals.

grelu.data.preprocess.filter_overlapping(data: pandas.DataFrame | anndata.AnnData, ref_intervals: pandas.DataFrame, window: int = 0, invert: bool = False, inplace: bool = False, method: str = 'any')[source]#

Filter intervals based on their overlap with a set of reference intervals.

Parameters:
  • data – Intervals, variants or anndata object with intervals in .var.

  • ref_intervals – Reference intervals to filter the data against

  • window – Number of bases to extend the reference intervals

  • invert – if False, return intervals in data that overlap with ref_intervals. If True, return intervals in data that are non-overlapping with ref_intervals.

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

  • method – “any” or “all”. If “any”, any amount of overlap is counted. If “all”, the complete interval must fall within a reference interval.

grelu.data.preprocess.filter_blacklist(data: pandas.DataFrame | anndata.AnnData, genome: str, blacklist: str | None = None, inplace: bool = False, window: int = 0)[source]#

Remove intervals that overlap with blacklist regions

Parameters:
  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • genome – name of the genome corresponding to intervals

  • blacklist (str) – path to blacklist file. If not given, will be extracted from the package resources.

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

  • window – Number of bases to extend the reference intervals

Returns:

Filtered intervals in the same format

grelu.data.preprocess.filter_chrom_ends(data: pandas.DataFrame | anndata.AnnData, genome: str | None = None, pad: int = 0, inplace: bool = False)[source]#

Filter intervals that extend beyond the ends of the chromosome.

Parameters:
  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • genome – name of the genome corresponding to intervals

  • pad – Number of bases to ignore at each end of the chromosome

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format

grelu.data.preprocess.split(data: pandas.DataFrame | anndata.AnnData, train_chroms: List[str] | None = None, val_chroms: List[str] = ['chr10'], test_chroms: List[str] = ['chr11'], sample: List[int] = [], seed: int | None = None)[source]#

Split Anndata object into training, validation and test samples based on chromosomes

Parameters:
  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • train_chroms – chromosomes to use for training data. If None, all chromosomes except val_chroms and test_chroms will be used.

  • val_chroms – chromosomes to use for validation data. default [“chr10”]

  • test_chroms – chromosomes to use for test data. default [“chr11”].

  • sample – List of number of random intervals to subsample for each split. The order of the numbers should be [train_sample, val_sample, test_sample]. If any element of the list is None, the corresponding split will not be sampled.

  • seed – Random seed for sampling

Returns:

Anndata object containing training samples val_ad: Anndata object containing validation samples test_ad: Anndata object containing test samples

Return type:

train_ad

grelu.data.preprocess.get_gc_matched_intervals(intervals: pandas.DataFrame, genome: str, binwidth: float = 0.1, chroms: str = 'autosomes', gc_bw_file: str = None, blacklist: str = 'hg38', seed: int | None = None) pandas.DataFrame[source]#

Get GC-matched intervals for a set of given intervals.

Parameters:
  • intervals – genomic intervals

  • genome – Name of the genome corresponding to intervals

  • binwidth – Resolution of GC content

  • chroms – Chromosomes to search for matched intervals

  • gc_bw_file – Path to a bigWig file of genomewide GC content. If None, will be created.

  • blacklist – Blacklist file of regions to exclude

  • seed – Random seed

Returns:

A pandas dataframe containing GC-matched negative intervals.

grelu.data.preprocess.add_negatives(adata: anndata.AnnData, negative_intervals: pandas.DataFrame, negative_labels: int = 0, inplace: bool = False) anndata.AnnData | None[source]#

Append negative control intervals onto an anndata object containing positive intervals in .var.

Parameters:
  • adata – AnnData containing positive intervals in .var

  • negative_intervals – negative intervals

  • negative_labels – Label to be assigned to all negative intervals

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.

grelu.data.preprocess.extend_from_coord(df: pandas.DataFrame, seq_len: int, center_col: str = 'summit') pandas.DataFrame[source]#

Create intervals centered on the given coordinates.

Parameters:
  • df – A pandas dataframe

  • seq_len – Length of the output intervals.

  • center_col – Name of the column that contains the position to be centered

Returns:

Summit-extended peak coordinates

grelu.data.preprocess.merge_intervals_by_column(intervals: pandas.DataFrame, group_col: str) pandas.DataFrame[source]#

Merge intervals that have the same value in a given column. The output is a dataframe containing one interval per unique value, with the start corresponding to the minimum of all start positions for intervals with that value, and the end corresponding to the maximum of all end positions for intervals with that value.

Parameters:
  • intervals – Dataframe containing genomic intervals.

  • group_col – Column by which to group and merge intervals.

Returns:

A dataframe containing one merged interval for each value in group_col.

grelu.data.preprocess.make_insertion_bigwig(frag_file: str, genome: str, out_prefix: str | None = None, plus_shift: int = 0, minus_shift: int = 0, chroms: List[str] | str | None = None, tmp_dir: str = './', out_dir: str = './') str[source]#

Given a fragment file, create a bigwig of Tn5 insertion sites

Parameters:
  • frag_file – Path to fragment file

  • genome – Name of genome to load with genomepy

  • out_prefix – Prefix for output bigwig file

  • plus_shift – Additional shift to add to positive strand

  • minus_shift – Additional shift to add to negative strand

  • chroms – The chromosome name(s) or shortcut name(s).

  • tmp_dir – Directory for temporary file

  • out_dir – Directory for bigwig file

Returns:

Path to bigWig file

Return type:

bw_file (str)