grelu.data.preprocess#

grelu.data.preprocess contains functions to preprocess genomic datasets in standard formats, in order to produce data suitable for deep learning. This includes filtering and checking data, splitting data into sets for training and validation, and converting between data formats.

Functions#

`filter_intervals`(→ Optional[Union[pandas.DataFrame, ...)	Filter intervals by boolean mask
`filter_obs`(→ Optional[anndata.AnnData])	Filter the .obs dataframe in an anndata object using a boolean mask
`filter_coverage`(→ Optional[anndata.AnnData])	Filter genomic intervals based on their maximum or mean coverage
`filter_cells`(→ Optional[anndata.AnnData])	Drop cell types that are composed of few cells
`filter_random`(→ Optional[Union[pandas.DataFrame, ...)	Filter n randomly chosen intervals
`filter_chromosomes`(data[, include, exclude, inplace])	Filter to sequence elements in selected chromosomes.
`clip_intervals`(intervals[, start, end])	Clip the ends of intervals to the given boundaries.
`filter_overlapping`(data, ref_intervals[, window, ...])	Filter intervals based on their overlap with a set of reference intervals.
`filter_blacklist`(data[, genome, blacklist, inplace, ...])	Remove intervals that overlap with blacklist regions
`check_chrom_ends`(data[, genome])	Check that intervals do not exceed the ends of the chromosome.
`filter_chrom_ends`(data[, genome, pad, inplace])	Filter intervals that extend beyond the ends of the chromosome.
`split`(data[, train_chroms, val_chroms, test_chroms, ...])	Split Anndata object into training, validation and test samples
`get_gc_matched_intervals`(→ pandas.DataFrame)	Get GC-matched intervals for a set of given intervals.
`add_negatives`(→ Optional[anndata.AnnData])	Append negative control intervals onto an anndata object containing
`extend_from_coord`(→ pandas.DataFrame)	Create intervals centered on the given coordinates.
`merge_intervals_by_column`(→ pandas.DataFrame)	Merge intervals that have the same value in a given column. The output
`make_insertion_bigwig`(→ str)	Given a fragment file, create a bigwig of Tn5 insertion sites

Module Contents#

grelu.data.preprocess.filter_intervals(data: pandas.DataFrame | anndata.AnnData, keep: numpy.ndarray, inplace: bool = False) → pandas.DataFrame | anndata.AnnData | None[source]#

Filter intervals by boolean mask

Parameters:

data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var
keep – Boolean mask of same length as data
inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format (if inplace = False)

grelu.data.preprocess.filter_obs(adata: anndata.AnnData, keep: numpy.ndarray, inplace: bool = False) → anndata.AnnData | None[source]#

Filter the .obs dataframe in an anndata object using a boolean mask

Parameters:

adata – anndata object
keep – boolean mask of same length as adata.obs
inplace – If True, the input is modified in place. If False, a new anndata object is returned.

Returns:

Filtered anndata object (if inplace = False)

grelu.data.preprocess.filter_coverage(adata: anndata.AnnData, aggfunc: str | Callable = np.mean, method: str = 'cutoff', cutoff: int = 1, negative_frac: float = 0.0, inplace: bool = False) → anndata.AnnData | None[source]#

Filter genomic intervals based on their maximum or mean coverage across cell types

Parameters:

adata – An Anndata object containing genomic intervals in .var
aggfunc – Function to aggregate coverage values
method – Method to use for filtering intervals. The options are “cutoff” to apply a raw coverage cutoff, “top” to select the top n intervals or “percentile” to select a top percentile of intervals
cutoff – the raw cutoff value (if method = “cutoff”), number of intervals (if method = “top”) or the percentile to select (if method = “percentile”)
negative_frac – Select a number of intervals below the cutoff, equal to the given fraction of the number of above-cutoff intervals
inplace – If True, the input is modified in place. If False, a new anndata object is returned.

Returns:

Filtered anndata object

grelu.data.preprocess.filter_cells(adata: anndata.AnnData, cutoff: int = 1000, count_key: str = 'n_cells', inplace: bool = False) → anndata.AnnData | None[source]#

Drop cell types that are composed of few cells

Parameters:

adata – anndata object with intervals in .var and cell counts in .obs
cutoff – minimum cell count
count_key – key under which cell count is stored in adata.obs
inplace – If True, the input is modified in place. If False, a new anndata object is returned.

Returns:

Filtered anndata object

grelu.data.preprocess.filter_random(data: pandas.DataFrame | anndata.AnnData, n: int, seed: int | None = None, inplace: bool = False) → pandas.DataFrame | anndata.AnnData | None[source]#

Filter n randomly chosen intervals

Parameters:

data – genomic intervals or anndata object with intervals in .var
n – Number of intervals to select
inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format

grelu.data.preprocess.filter_chromosomes(data: pandas.DataFrame | anndata.AnnData, include: List[str] | None = None, exclude: List[str] | None = None, inplace: bool = False)[source]#

Filter to sequence elements in selected chromosomes.

Parameters:

data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var
include – list of chromosome names to keep
exclude – list of chromosome names to drop
inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format

grelu.data.preprocess.clip_intervals(intervals: pandas.DataFrame, start: int | None = None, end: int | None = None)[source]#

Clip the ends of intervals to the given boundaries.

Parameters:

intervals – Dataframe containing the genomic intervals to clip.
start – The minimum start coordinate. All start coordinates less than this will be clipped to this value.
end – The maximum start coordinate. All end coordinates greater than this will be clipped to this value.

Returns:

Dataframe containing clipped intervals.

grelu.data.preprocess.filter_overlapping(data: pandas.DataFrame | anndata.AnnData, ref_intervals: pandas.DataFrame, window: int = 0, invert: bool = False, inplace: bool = False, method: str = 'any')[source]#

Filter intervals based on their overlap with a set of reference intervals.

Parameters:

data – Intervals, variants or anndata object with intervals in .var.
ref_intervals – Reference intervals to filter the data against
window – Number of bases to extend the reference intervals
invert – if False, return intervals in data that overlap with ref_intervals. If True, return intervals in data that are non-overlapping with ref_intervals.
inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.
method – “any” or “all”. If “any”, any amount of overlap is counted. If “all”, the complete interval must fall within a reference interval.

grelu.data.preprocess.filter_blacklist(data: pandas.DataFrame | anndata.AnnData, genome: str | None = None, blacklist: str | None = None, inplace: bool = False, window: int = 0)[source]#

Remove intervals that overlap with blacklist regions

Parameters:

data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var
genome – name of the genome corresponding to intervals
blacklist – path to blacklist file. If not given, it will be extracted from the package resources.
inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.
window – Number of bases to extend the reference intervals

Returns:

Filtered intervals in the same format

grelu.data.preprocess.check_chrom_ends(data: pandas.DataFrame | anndata.AnnData, genome: str | None = None)[source]#

Check that intervals do not exceed the ends of the chromosome.

Parameters:

data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var
genome – name of the genome corresponding to intervals

Raises:

ValueError if any interval exceeds the chtomosome ends –

grelu.data.preprocess.filter_chrom_ends(data: pandas.DataFrame | anndata.AnnData, genome: str | None = None, pad: int = 0, inplace: bool = False)[source]#

Filter intervals that extend beyond the ends of the chromosome.

Parameters:

data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var
genome – name of the genome corresponding to intervals
pad – Number of bases to ignore at each end of the chromosome
inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

Returns:

Filtered intervals in the same format

grelu.data.preprocess.split(data: pandas.DataFrame | anndata.AnnData, train_chroms: List[str] | None = None, val_chroms: List[str] = ['chr10'], test_chroms: List[str] = ['chr11'], sample: List[int] = [], seed: int | None = None)[source]#

Split Anndata object into training, validation and test samples based on chromosomes

Parameters:

data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var
train_chroms – chromosomes to use for training data. If None, all chromosomes except val_chroms and test_chroms will be used.
val_chroms – chromosomes to use for validation data. default [“chr10”]
test_chroms – chromosomes to use for test data. default [“chr11”].
sample – List of number of random intervals to subsample for each split. The order of the numbers should be [train_sample, val_sample, test_sample]. If any element of the list is None, the corresponding split will not be sampled.
seed – Random seed for sampling

Returns:

Anndata object containing training samples val_ad: Anndata object containing validation samples test_ad: Anndata object containing test samples

Return type:

train_ad

grelu.data.preprocess.get_gc_matched_intervals(intervals: pandas.DataFrame, genome: str, binwidth: float = 0.1, chroms: str = 'autosomes', blacklist: str | None = None, seed: int | None = None) → pandas.DataFrame[source]#

Get GC-matched intervals for a set of given intervals.

Parameters:

intervals – genomic intervals
genome – Name of the genome corresponding to intervals
binwidth – Resolution of GC content
chroms – Chromosomes to search for matched intervals
blacklist – Blacklist file of regions to exclude
seed – Random seed

Returns:

A pandas dataframe containing GC-matched negative intervals.

grelu.data.preprocess.add_negatives(adata: anndata.AnnData, negative_intervals: pandas.DataFrame, negative_labels: int = 0, inplace: bool = False) → anndata.AnnData | None[source]#

Append negative control intervals onto an anndata object containing positive intervals in .var.

Parameters:

adata – AnnData containing positive intervals in .var
negative_intervals – negative intervals
negative_labels – Label to be assigned to all negative intervals
inplace – If True, the input is modified in place. If False, a new anndata object is returned.

grelu.data.preprocess.extend_from_coord(df: pandas.DataFrame, seq_len: int, center_col: str = 'summit') → pandas.DataFrame[source]#

Create intervals centered on the given coordinates.

Parameters:

df – A pandas dataframe
seq_len – Length of the output intervals.
center_col – Name of the column that contains the position to be centered

Returns:

Summit-extended peak coordinates

grelu.data.preprocess.merge_intervals_by_column(intervals: pandas.DataFrame, group_col: str) → pandas.DataFrame[source]#

Merge intervals that have the same value in a given column. The output is a dataframe containing one interval per unique value, with the start corresponding to the minimum of all start positions for intervals with that value, and the end corresponding to the maximum of all end positions for intervals with that value.

Parameters:

intervals – Dataframe containing genomic intervals.
group_col – Column by which to group and merge intervals.

Returns:

A dataframe containing one merged interval for each value in group_col.

grelu.data.preprocess.make_insertion_bigwig(frag_file: str, genome: str, out_prefix: str | None = None, plus_shift: int = 0, minus_shift: int = 0, chroms: List[str] | str | None = None, tmp_dir: str = './', out_dir: str = './') → str[source]#

Given a fragment file, create a bigwig of Tn5 insertion sites

Parameters:

frag_file – Path to fragment file
genome – Name of genome to load with genomepy
out_prefix – Prefix for output bigwig file
plus_shift – Additional shift to add to positive strand
minus_shift – Additional shift to add to negative strand
chroms – The chromosome name(s) or shortcut name(s).
tmp_dir – Directory for temporary file
out_dir – Directory for bigwig file

Returns:

Path to bigWig file

Return type:

bw_file (str)

grelu.data.preprocess#

Functions#

Module Contents#

This Page