Functions to preprocess genomic datasets.


Module Contents# pandas.DataFrame | anndata.AnnData, keep: numpy.ndarray, inplace: bool = False) pandas.DataFrame | anndata.AnnData | None[source]#

Filter intervals by boolean mask

  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • keep – Boolean mask of same length as data

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.


Filtered intervals in the same format (if inplace = False) anndata.AnnData, keep: numpy.ndarray, inplace: bool = False) anndata.AnnData | None[source]#

Filter the .obs dataframe in an anndata object using a boolean mask

  • adata – anndata object

  • keep – boolean mask of same length as adata.obs

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.


Filtered anndata object (if inplace = False) anndata.AnnData, aggfunc: str | Callable = np.mean, method: str = 'cutoff', cutoff: int = 1, negative_frac: float = 0.0, inplace: bool = False) anndata.AnnData | None[source]#

Filter genomic intervals based on their maximum or mean coverage across cell types

  • adata – An Anndata object containing genomic intervals in .var

  • aggfunc – Function to aggregate coverage values

  • method – Method to use for filtering intervals. The options are “cutoff” to apply a raw coverage cutoff, “top” to select the top n intervals or “percentile” to select a top percentile of intervals

  • cutoff – the raw cutoff value (if method = “cutoff”), number of intervals (if method = “top”) or the percentile to select (if method = “percentile”)

  • negative_frac – Select a number of intervals below the cutoff, equal to the given fraction of the number of above-cutoff intervals

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.


Filtered anndata object anndata.AnnData, cutoff: int = 1000, count_key: str = 'n_cells', inplace: bool = False) anndata.AnnData | None[source]#

Drop cell types that are composed of few cells

  • adata – anndata object with intervals in .var and cell counts in .obs

  • cutoff – minimum cell count

  • count_key – key under which cell count is stored in adata.obs

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned.


Filtered anndata object pandas.DataFrame | anndata.AnnData, n: int, seed: int | None = None, inplace: bool = False) pandas.DataFrame | anndata.AnnData | None[source]#

Filter n randomly chosen intervals

  • data – genomic intervals or anndata object with intervals in .var

  • n – Number of intervals to select

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.


Filtered intervals in the same format pandas.DataFrame | anndata.AnnData, include: List[str] | None = None, exclude: List[str] | None = None, inplace: bool = False)[source]#

Filter to sequence elements in selected chromosomes.

  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • include – list of chromosome names to keep

  • exclude – list of chromosome names to drop

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.


Filtered intervals in the same format pandas.DataFrame, start: int | None = None, end: int | None = None)[source]#

Clip the ends of intervals to the given boundaries.

  • intervals – Dataframe containing the genomic intervals to clip.

  • start – The minimum start coordinate. All start coordinates less than this will be clipped to this value.

  • end – The maximum start coordinate. All end coordinates greater than this will be clipped to this value.


Dataframe containing clipped intervals. pandas.DataFrame | anndata.AnnData, ref_intervals: pandas.DataFrame, window: int = 0, invert: bool = False, inplace: bool = False, method: str = 'any')[source]#

Filter intervals based on their overlap with a set of reference intervals.

  • data – Intervals, variants or anndata object with intervals in .var.

  • ref_intervals – Reference intervals to filter the data against

  • window – Number of bases to extend the reference intervals

  • invert – if False, return intervals in data that overlap with ref_intervals. If True, return intervals in data that are non-overlapping with ref_intervals.

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

  • method – “any” or “all”. If “any”, any amount of overlap is counted. If “all”, the complete interval must fall within a reference interval. pandas.DataFrame | anndata.AnnData, genome: str, blacklist: str | None = None, inplace: bool = False, window: int = 0)[source]#

Remove intervals that overlap with blacklist regions

  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • genome – name of the genome corresponding to intervals

  • blacklist (str) – path to blacklist file. If not given, will be extracted from the package resources.

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.

  • window – Number of bases to extend the reference intervals


Filtered intervals in the same format pandas.DataFrame | anndata.AnnData, genome: str | None = None, pad: int = 0, inplace: bool = False)[source]#

Filter intervals that extend beyond the ends of the chromosome.

  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • genome – name of the genome corresponding to intervals

  • pad – Number of bases to ignore at each end of the chromosome

  • inplace – If True, the input is modified in place. If False, a new dataframe or anndata object is returned.


Filtered intervals in the same format pandas.DataFrame | anndata.AnnData, train_chroms: List[str] | None = None, val_chroms: List[str] = ['chr10'], test_chroms: List[str] = ['chr11'], sample: List[int] = [], seed: int | None = None)[source]#

Split Anndata object into training, validation and test samples based on chromosomes

  • data – Either a pandas dataframe of genomic intervals or an Anndata object with intervals in .var

  • train_chroms – chromosomes to use for training data. If None, all chromosomes except val_chroms and test_chroms will be used.

  • val_chroms – chromosomes to use for validation data. default [“chr10”]

  • test_chroms – chromosomes to use for test data. default [“chr11”].

  • sample – List of number of random intervals to subsample for each split. The order of the numbers should be [train_sample, val_sample, test_sample]. If any element of the list is None, the corresponding split will not be sampled.

  • seed – Random seed for sampling


Anndata object containing training samples val_ad: Anndata object containing validation samples test_ad: Anndata object containing test samples

Return type:

train_ad pandas.DataFrame, genome: str, binwidth: float = 0.1, chroms: str = 'autosomes', gc_bw_file: str = None, blacklist: str = 'hg38', seed: int | None = None) pandas.DataFrame[source]#

Get GC-matched intervals for a set of given intervals.

  • intervals – genomic intervals

  • genome – Name of the genome corresponding to intervals

  • binwidth – Resolution of GC content

  • chroms – Chromosomes to search for matched intervals

  • gc_bw_file – Path to a bigWig file of genomewide GC content. If None, will be created.

  • blacklist – Blacklist file of regions to exclude

  • seed – Random seed


A pandas dataframe containing GC-matched negative intervals. anndata.AnnData, negative_intervals: pandas.DataFrame, negative_labels: int = 0, inplace: bool = False) anndata.AnnData | None[source]#

Append negative control intervals onto an anndata object containing positive intervals in .var.

  • adata – AnnData containing positive intervals in .var

  • negative_intervals – negative intervals

  • negative_labels – Label to be assigned to all negative intervals

  • inplace – If True, the input is modified in place. If False, a new anndata object is returned. pandas.DataFrame, seq_len: int, center_col: str = 'summit') pandas.DataFrame[source]#

Create intervals centered on the given coordinates.

  • df – A pandas dataframe

  • seq_len – Length of the output intervals.

  • center_col – Name of the column that contains the position to be centered


Summit-extended peak coordinates pandas.DataFrame, group_col: str) pandas.DataFrame[source]#

Merge intervals that have the same value in a given column. The output is a dataframe containing one interval per unique value, with the start corresponding to the minimum of all start positions for intervals with that value, and the end corresponding to the maximum of all end positions for intervals with that value.

  • intervals – Dataframe containing genomic intervals.

  • group_col – Column by which to group and merge intervals.


A dataframe containing one merged interval for each value in group_col. str, genome: str, out_prefix: str | None = None, plus_shift: int = 0, minus_shift: int = 0, chroms: List[str] | str | None = None, tmp_dir: str = './', out_dir: str = './') str[source]#

Given a fragment file, create a bigwig of Tn5 insertion sites

  • frag_file – Path to fragment file

  • genome – Name of genome to load with genomepy

  • out_prefix – Prefix for output bigwig file

  • plus_shift – Additional shift to add to positive strand

  • minus_shift – Additional shift to add to negative strand

  • chroms – The chromosome name(s) or shortcut name(s).

  • tmp_dir – Directory for temporary file

  • out_dir – Directory for bigwig file


Path to bigWig file

Return type:

bw_file (str)