polygraph package

Submodules

polygraph.classifier module

polygraph.classifier.groupwise_svm(ad, reference_group, group_col='Group', cv=5, is_kernel=True, max_iter=1000, use_pca=False)[source]

Train an SVM to distinguish between each non-reference group and the reference group

Parameters:
  • ad (anndata.AnnData) – Anndata object containing sequence embeddings of shape (n_seqs x n_vars)

  • reference_group (str) – ID of group to use as reference

  • group_col (str) – Name of column in .obs containing group ID

  • cv (int) – Number of cross-validation folds

  • is_kernel (bool) – Whether ad.X is a symmetric kernel matrix

  • max_iter (int) – Maximum number of iterations for SVM

  • use_pca (bool) – Whether to use PCA distances

Returns:

Modified anndata object containing each

sequence’s predicted label in .obs, as well as SVM performance metrics in ad.uns[“svm_performance”]

Return type:

ad (anndata.AnnData)

polygraph.embedding module

polygraph.evolve module

polygraph.evolve.evolve(start_seq, reference_seqs, iter, model, k=None, drop_last_layers=None, batch_size=512, device='cpu', task=None, alpha=3)[source]

Directed evolution with an additional goal to increase similarity to reference sequences.

Parameters:
  • start_seq (str) – Start sequence

  • reference_seqs (list) – Reference sequences

  • iter (int) – Number of iterations

  • model (nn.Sequential) – Torch sequential model

  • k (int) – k-mer length for k-mer embedding.

  • drop_last_layers (int) – Number of terminal layers to drop from the model for model embedding.

  • batch_size (int) – Batch size for inference

  • device (int, str) – Index of device to use for inference

  • task (int) – Model output head. If None, average all heads.

  • alpha (int) – Relative weight for similarity

Returns:

Optimized sequence

Return type:

best_seq (str)

polygraph.input module

polygraph.input.download_gtex_tpm(download_dir='/home/runner/work/polygraph/polygraph/src/polygraph/resources/gtex')[source]

Download per-tissue TPM values from GTEX.

Parameters:

download_dir (str) – Path to directory in which to download file

Returns:

Path to downloaded local file

Return type:

(str)

polygraph.input.download_jaspar(family='vertebrates', download_dir='/home/runner/work/polygraph/polygraph/src/polygraph/resources/jaspar')[source]

Download and read the JASPAR database of TF motifs

Parameters:
  • family (str) – JASPAR family. one of “fungi”, “insects”, “nematodes”, “plants”, “urochordates”, “vertebrates”

  • download_dir (str) – Path to directory in which to download motifs

Returns:

Path to downloaded local file

Return type:

(str)

polygraph.input.load_gtex_tpm(download_dir='/home/runner/work/polygraph/polygraph/src/polygraph/resources/gtex')[source]

Load per-tissue TPM values from GTEX.

Parameters:

download_dir (str) – Path to directory in which to download file

Returns:

TPM matrix.

Return type:

(pd.DataFrame)

polygraph.input.read_meme_file(file)[source]

Read a motif database in MEME format

Parameters:

file (str) – path to MEME file

Returns:

List of pymemesuite.common.Motif objects bg (pymemesuite.common.Background): Background distribution

Return type:

motifs (list)

polygraph.input.read_seqs(file, sep='\t', incl_ids=False)[source]

Read sequences and group labels into a dataframe. This creates the input dataframe for all subsequent analyses.

Parameters:
  • file (str) – path to a text file containing no header. If incl_ids=True,

  • contain (the first column should contain IDs and the next two columns should)

  • incl_ids=False (sequence and group label. If)

  • should (the first two columns)

  • label. (contain sequence and group)

  • sep (str) – Column separator

  • incl_ids (bool) – Whether the first column corresponds to sequence IDs.

Returns:

Pandas dataframe with columns Sequence, Group

and a unique index.

Return type:

df (pd.DataFrame)

polygraph.likelihood module

class polygraph.likelihood.CharDataset(seqs)[source]

Bases: Dataset

encode(seq)[source]
polygraph.likelihood.compute_likelihood(seqs, model, batch_size=32, num_workers=1, device='cpu')[source]

Function to compute the log-likelihood of each sequence in the given list using the hyenaDNA model pretrained on the human genome.

Parameters:
  • seqs (str, list, pd.DataFrame) – DNA sequence, list of DNA sequences or a dataframe containing sequences in the column “Sequence”.

  • model (ConvLMHead) – HyenaDNA model

  • batch_size (int) – Batch size for inference

  • num_workers (int) – Number of workers for inference dataloader

  • device (int, str) – Device ID for inference

Returns:

Log-likelihoods for each sequence

Return type:

LL (list)

polygraph.likelihood.load_hyenadna(hyena_path, ckpt_dir='.', model='hyenadna-small-32k-seqlen')[source]

Loads the pretrained hyenaDNA foundation model.

Parameters:
Returns:

Pretrained HyenaDNA model

Return type:

model (ConvLMHeadModel)

polygraph.models module

polygraph.models.batch(sequences, batch_size)[source]

Pad sequences to a constant length and split them into batches to pass to a model

Parameters:
  • sequences (list) – List of DNA sequences

  • batch_size (int) – Batch size

Returns:

sequence batch generator

polygraph.models.cell_type_specificity(seqs, on_target_col, off_target_cols)[source]

Calculate cell type specificity from predicted or measured output

Parameters:
  • seqs (pd.DataFrame) – Dataframe containing sequence predictions

  • on_target (str) – Column containing predictions in on-target cell type

  • off_target (list) – Columns containing predictions in off-target cell types.

Returns:

seqs with additional columns mingap, maxgap and meangap,

reporting 3 measures of cell type specificity for each sequence.

Return type:

(pd.DataFrame)

polygraph.models.enformer_embed(sequences, model)[source]

Embed a batch of sequences using pretrained or fine-tuned enformer

Parameters:
  • sequences (list) – List of sequences

  • model (Enformer) – pre-trained or fine-tuned enformer model

Returns:

np.array of shape (n_seqs x 3072)

polygraph.models.get_embeddings(seqs, model, batch_size, drop_last_layers=1, device='cpu', swapaxes=False)[source]

Get model embeddings for all sequences in a dataframe

Parameters:
  • seqs (list, pd.DataFrame) – List of sequences or dataframe containing sequences in the column “Sequence”.

  • model (nn.Sequential) – trained model

  • batch_size (int) – Batch size for inference

  • drop_last_layers (int) – Number of terminal layers to drop to get embeddings

  • device (str, int) – ID of GPU to perform inference.

  • swapaxes (bool) – If true, batches will be of shape (N, 4, L). Otherwise, shape will be (N, L, 4).

Returns:

np.array of shape (n_seqs x n_features)

polygraph.models.ism_score(model, seqs, batch_size, device='cpu', task=None)[source]

Get base-level importance scores for given sequence(s) using ISM

Parameters:
  • seqs (list, pd.DataFrame) – List of sequences or dataframe containing sequences in the column “Sequence”.

  • model (nn.Sequential) – trained model

  • batch_size (int) – Batch size for inference

  • device (str, int) – ID of GPU to perform inference.

Returns:

DataFrame of shape (n_seqs x n_outputs)

Return type:

(pd.DataFrame)

polygraph.models.load_enformer()[source]

Load pre-trained enformer model

Returns:

Pretrained model

Return type:

(Enformer)

polygraph.models.load_nucleotide_transformer(model='InstaDeepAI/nucleotide-transformer-2.5b-multi-species')[source]

Load pre-trained nucleotide transformer model

Parameters:

model (str) – Name of pretrained model to download

Returns:

Pre-trained model tokenizer (): Class to convert sequences to tokens

Return type:

model (EsmForMaskedLM)

polygraph.models.nucleotide_transformer_embed(seqs, model, tokenizer)[source]

Embed a batch of sequences using the pre-trained nucleotide transformer model

Parameters:
  • sequences (list) – List of sequences

  • model – pre-trained nucleotide transformer model

Returns:

np.array of shape (n_seqs x n_features)

polygraph.models.predict(seqs, model, batch_size, device='cpu')[source]

Predict sequence properties using a sequence-to-function model.

Parameters:
  • seqs (list, pd.DataFrame) – List of sequences or dataframe containing sequences in the column “Sequence”.

  • model (nn.Sequential) – trained model

  • batch_size (int) – Batch size for inference

  • device (str, int) – ID of GPU to perform inference.

Returns:

Array of shape (n_seqs x n_outputs)

Return type:

(np.array)

polygraph.models.robustness(model, seqs, batch_size, device='cpu', task=None, aggfunc='mean')[source]

Get robustness scores for given sequence(s) using ISM

Parameters:
  • seqs (list, pd.DataFrame) – List of sequences or dataframe containing sequences in the column “Sequence”.

  • model (nn.Sequential) – trained model

  • batch_size (int) – Batch size for inference

  • device (str, int) – ID of GPU to perform inference.

  • aggfunc (str) – Either ‘mean’ or ‘max’. Determines how to aggregate the effect of all possible single-base mutations.

Returns:

DataFrame of shape (n_seqs x n_outputs)

Return type:

(pd.DataFrame)

polygraph.models.sequential_embed(seqs, model, drop_last_layers, swapaxes=False, device='cpu')[source]

Embed a batch of sequences using a torch.nn.Sequential model

Parameters:
  • seqs (list) – List of sequences

  • model (nn.Sequential) – trained model

  • drop_last_layers (int) – Number of terminal layers to drop to get embeddings

Returns:

np.array of shape (n_seqs x n_features)

polygraph.motifs module

polygraph.motifs.get_motif_pairs(sites)[source]

List the pairs of motifs present in each sequence.

Parameters:

sites (pd.DataFrame) – Pandas dataframe containing FIMO output.

Returns:

Dataframe containing all motif pairs in each sequence with their orientation and distance.

Return type:

pairs (pd.DataFrame)

polygraph.motifs.motif_frequencies(sites, normalize=False, seqs=None)[source]

Count frequency of occurrence of motifs in a list of sequences

Parameters:
  • sites (list) – Output of scan function

  • normalize (bool) – Whether to normalize the resulting count matrix to correct for sequence length

  • seqs (pd.DataFrame) – Pandas dataframe containing DNA sequences. Needed if normalize=True.

Returns:

Count matrix with rows = sequences and columns = motifs

Return type:

cts (pd.DataFrame)

polygraph.motifs.motif_pair_differential_abundance(motif_pairs, seqs, reference_group, group_col='Group', max_prop_cutoff=0, min_prop_cutoff=0, ref_prop_cutoff=0)[source]

Compare the rate of occurence of pairwise combinations of motifs between groups

Parameters:
  • motif_pairs (pd.DataFrame) – Pandas dataframe containing the ouptut of get_motif_pairs.

  • seqs (pd.DataFrame) – Pandas dataframe containing sequences

  • reference_group (str) – ID of group to use as reference

  • group_col (str) – Name of column in seqs containing group IDs

  • max_prop_cutoff (int) – Limit to combinations with this proportion in at least one group.

  • min_prop_cutoff (float) – Limit to combinations with this proportion in in all groups.

Returns:

Pandas dataframe containing FDR-corrected significance

testing results for the occurrence of pairwise combinations between groups

Return type:

res (pd.DataFrame)

polygraph.motifs.motif_pair_differential_distance(motif_pairs, seqs, reference_group, group_col='Group', max_prop_cutoff=0, min_prop_cutoff=0, ref_prop_cutoff=0)[source]

Compare the distance between all motif pairs across groups.

Parameters:
  • motif_pairs (pd.DataFrame) – Pandas dataframe containing the ouptut of get_motif_pairs.

  • seqs (pd.DataFrame) – Pandas dataframe containing sequences

  • reference_group (str) – ID of group to use as reference

  • group_col (str) – Name of column in seqs containing group IDs

  • max_prop_cutoff (int) – Limit to combinations with this proportion in at least one group.

  • min_prop_cutoff (float) – Limit to combinations with this proportion in in all groups.

Returns:

Pandas dataframe containing FDR-corrected significance

testing results for the distance between paired motifs, between groups

Return type:

res (pd.DataFrame)

polygraph.motifs.motif_pair_differential_orientation(motif_pairs, seqs, reference_group, group_col='Group', max_prop_cutoff=0, min_prop_cutoff=0, ref_prop_cutoff=0)[source]

Compare the mutual orientation of all motif pairs between groups.

Parameters:
  • motif_pairs (pd.DataFrame) – Pandas dataframe containing the ouptut of get_motif_pairs.

  • seqs (pd.DataFrame) – Pandas dataframe containing sequences

  • reference_group (str) – ID of group to use as reference

  • group_col (str) – Name of column in seqs containing group IDs

  • max_prop_cutoff (int) – Limit to combinations with this proportion in at least one group.

  • min_prop_cutoff (float) – Limit to combinations with this proportion in in all groups.

Returns:

Pandas dataframe containing FDR-corrected significance

testing results for the mutual orientation of pairwise combinations between groups

Return type:

res (pd.DataFrame)

polygraph.motifs.nmf(counts, seqs, reference_group, group_col='Group', n_components=10)[source]

Perform NMF on motif count matrix

Parameters:
  • counts (pd.DataFrame) – motif count matrix where rows are sequences and columns are motifs.

  • seqs (pd.DataFrame) – pandas dataframe containing DNA sequences.

  • reference_group (str) – ID for the group to use as reference

  • group_col (str) – Name of the column in seqs containing group IDs

  • n_components (int) – Number of components or factors to extract using NMF

Returns:

Pandas dataframe of size sequences x factors, containing

the contribution of each factor to each sequence.

H (pd.DataFrame): Pandas dataframe of size factors x motifs, containing the

contribution of each motif to each factor.

res (pd.DataFrame): Pandas dataframe containing the FDR-corrected significance

testing results for factor contribution between groups.

Return type:

W (pd.DataFrame)

polygraph.motifs.scan(seqs, meme_file, group_col='Group', pthresh=0.001, rc=True)[source]

Scan a DNA sequence using motifs from a MEME file.

Parameters:
  • seqs (str) – Dataframe containing DNA sequences

  • meme_file (str) – Path to MEME file

  • group_col (str) – Column containing group IDs

  • pthresh (float) – p-value cutoff for binding sites

  • rc (bool) – Whether to scan the sequence reverse complement as well

Returns:

pd.DataFrame containing columns ‘MotifID’, ‘SeqID’, ‘start’, ‘end’, ‘strand’.

polygraph.motifs.score_sites(sites, seqs, scores)[source]

Calculate the average score of each motif site given base-level importance scores.

Parameters:
  • sites (pd.DataFrame) – Dataframe containing site positions

  • seqs (pd.DataFrame) – Dataframe containing sequences

  • scores (np.array) – Numpy array of shape (sequences x length)

Returns

sites (pd.DataFrame): ‘sites’ dataframe with an additional columns ‘score’

polygraph.sequence module

polygraph.sequence.ISM(seqs, drop_ref=False)[source]

Perform in-silico mutagenesis on given DNA sequence(s)

Parameters:
  • seqs (str, list, pd.DataFrame) – A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

  • drop_ref (bool) – If True, do not return the original sequence.

Returns:

A list of all possible single-base mutated sequences

derived from the original sequences.

Return type:

(list)

polygraph.sequence.bleu_similarity(seqs, reference_seqs, max_k=4)[source]

Calculate the bleu similarity score between two sets of sequences.

Parameters:
  • seqs (list) – List of DNA sequences

  • reference_seqs (list) – List of DNA sequences

  • max_k (int) – Highest k-mer length for calculation. All k-mers of length 1 to max_k inclusive will be considered.

polygraph.sequence.fastsk(seqs, k=5, m=2)[source]

Compute a gapped k-mer kernel matrix for the given sequences using FastSK.

Parameters:
  • seqs (str, list, pd.DataFrame) – A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

  • k (int) – k-mer length

  • m (int) – Number of mismatches allowed

Returns:

Array of shape (n_seqs, n_seqs) containing the gapped k-mer kernel.

Return type:

(np.array)

polygraph.sequence.gc(seqs)[source]

Calculate the GC fraction of a DNA sequence or list of sequences.

Parameters:

seqs (str, list, pd.DataFrame) – A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

Returns:

The fraction of each sequence comprised of G and C bases.

Return type:

(list, float)

polygraph.sequence.groupwise_mean_edit_dist(seqs, group_col='Group')[source]

Calculate average edit distances between all groups of sequences

polygraph.sequence.kmer_frequencies(seqs, k, normalize=False, genome='hg38')[source]

Get frequencies of all kmers of length k in a sequence or sequences.

Parameters:
  • seqs (str, list, pd.DataFrame) – A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

  • k (int) – The k-mer length.

  • normalize (bool, optional) – Whether to normalize the k-mer counts by sequence length. Default is False.

Returns:

A dataframe of shape (kmers x sequences), containing

the frequency of each k-mer in the sequence.

Return type:

(pd.DataFrame)

polygraph.sequence.kmer_positions(seq, kmer)[source]

Return all the locations of a given k-mer in a DNA sequence

Parameters:
  • seq (str) – the input DNA sequence

  • kmer (str) – the k-mer for which to search

Returns:

a numpy array containing the positions of the kmer

Return type:

(np.array)

polygraph.sequence.min_edit_distance(seqs, reference_seqs)[source]

For each sequence in a list, find the smallest edit distance between that sequence and a list of reference sequences

Parameters:
  • seqs (list) – List of sequences

  • reference_seqs (list) – List of sequences

Returns:

edit distance between each sequence in seqs and its closest reference sequence

polygraph.sequence.min_edit_distance_from_reference(seqs, reference_group, group_col='Group')[source]
For each sequence in non-reference groups, find the smallest edit distance

between that sequence and the sequences in the reference group.

Parameters:
  • seqs (pd.DataFrame) – Dataframe containing sequences in column “Sequence”

  • reference_group (str) – ID for the group to use as reference

  • group_col (str) – Name of the column containing group IDs

Returns:

list of edit distance between each sequence and its closest

reference sequence.

Set to 0 for reference sequences

Return type:

edit (np.array)

polygraph.sequence.unique_kmers(seq, k)[source]

Get all unique kmers of length k that are present in a DNA sequence.

Parameters:
  • seq (str) – the input DNA sequence

  • k (int) – length of k-mers to extract

Returns:

a set containing the unique kmers extracted from

the sequence.

Return type:

(set)

polygraph.stats module

polygraph.stats.groupwise_fishers(data, reference_group, val_col, reference_val=None, group_col='Group')[source]
Perform Fisher’s exact test for proportions between each non-reference group

and the reference group.

Parameters:
  • data (pd.DataFrame, anndata.AnnData) – Pandas dataframe with group IDs and values to compare, or an AnnData object containing this dataframe in .obs

  • val_col (str) – Name of column with values to compare

  • reference_group (str) – ID of group to use as reference

  • reference_val (str) – A specific value whose proportion is to be compared between groups

  • group_col (str) – Name of column containing group IDs

Returns:

Dataframe containing group proportions and FDR-corrected

p-values for each group.

Return type:

(pd.DataFrame)

polygraph.stats.groupwise_mann_whitney(data, val_col, reference_group, group_col='Group')[source]
Compare the mean values between each non-reference group and the

reference group using the Mann-Whitney U test.

Parameters:
  • data (pd.DataFrame, anndata.AnnData) – Pandas dataframe containing group IDs and values to compare, or an AnnData object containing this dataframe in .obs

  • val_col (str) – Name of column with values to compare

  • reference_group (str) – ID of group to use as reference

  • group_col (str) – Name of column containing group IDs

Returns:

Dataframe containing FDR-corrected p-values for each group.

Return type:

(pd.DataFrame)

polygraph.stats.kruskal_dunn(data, val_col, group_col='Group')[source]
Compare the mean values between all groups using the Kruskal-Wallis

test followed by Dunn’s post-hoc test

Parameters:
  • data (pd.DataFrame, anndata.AnnData) – Pandas dataframe with group IDs and values to compare, or an AnnData object containing this dataframe in .obs

  • val_col (str) – Name of column with values to compare

  • group_col (str) – Name of column containing group IDs

Returns:

Dictionary containing p-values for both Kruskal-Wallis and Dunn’s test.

Return type:

(dict)

polygraph.utils module

polygraph.utils.check_equal_lens(seqs)[source]

Given sequences, check whether they are all of equal length.

Parameters:

seqs (list, pd.DataFrame) – Either a list of DNA sequences, or a dataframe containing DNA sequences in the column “Sequence”.

Returns:

whether the sequences are all equal in length.

Return type:

(bool)

polygraph.utils.get_lens(seqs)[source]

Calculate the lengths of given DNA sequences.

Parameters:

seqs (str, list, pd.DataFrame) – A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

Returns:

length of each sequence

Return type:

(int, list)

polygraph.utils.integer_encode(seqs)[source]

Encode DNA sequence(s) as a numpy array of integers.

Parameters:

seqs (str, list, pd.DataFrame) – seqs (str, list): A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

Returns:

A 1-D or 2-D array containing the sequences encoded as integers.

Return type:

(np.array)

polygraph.utils.make_ids(seqs)[source]

Assign a unique index to each row of a dataframe

Parameters:

seqs (pd.DataFrame) – Pandas dataframe

Returns:

Modified database containing unique indices.

Return type:

seqs (pd.DataFrame)

polygraph.utils.pad_with_Ns(seqs, seq_len=None, end='both')[source]

Pads a sequence with Ns at the desired end until it reaches seq_len in length.

If seq_len is not provided, it is set to the length of the longest sequence.

Parameters:
  • seqs (str, list, pd.DataFrame) – DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

  • seq_len (int) – Length upto which to pad each sequence

Returns:

Padded sequences of length seq_len

Return type:

(str, list)

polygraph.utils.reverse_complement(seqs)[source]

Reverse complement DNA sequences

Parameters:

seqs (str, list, pd.DataFrame) – seqs (str, list): A DNA sequence, list of sequences or dataframe containing sequences in the column “Sequence”.

Returns:

reverse complemented sequences

Return type:

(str, list)

polygraph.visualize module

polygraph.visualize.boxplot(data, value_col, group_col='Group', fill_col=None)[source]

Plot boxplot of values in each group

Parameters:
  • data (pd.DataFrame, anndata.AnnData) – Pandas dataframe with group IDs and values to compare, or an AnnData object containing this dataframe in .obs

  • value_col (str) – Column containing values to plot

  • group_col (str) – Column containing group IDs

  • fill_col (str) – Column containing additional variable to split each group

polygraph.visualize.densityplot(data, value_col, group_col='Group')[source]

Plot density plot of values in each group

Parameters:
  • data (pd.DataFrame, anndata.AnnData) – Pandas dataframe with group IDs and values to compare, or an AnnData object containing this dataframe in .obs

  • value_col (str) – Column containing values to plot

  • group_col (str) – Column containing group IDs

polygraph.visualize.one_nn_frac_plot(ad, reference_group, group_col='Group')[source]

Plot a barplot showing the fraction of points in each group whose nearest neighbors are reference sequences.

Parameters:
  • ad (anndata.AnnData) – AnnData object containing sequence embedding.

  • reference_group (str) – Group to use as reference. This group will be plotted first.

  • group_col (str) – Column in ad.obs containing group IDs.

  • fill_col (str) – Column containing additional variable to split each group

polygraph.visualize.pca_plot(ad, group_col='Group', components=[0, 1], size=0.1, show_ellipse=True, reference_group=None)[source]

Plot PCA embeddings of sequences, colored by group.

Parameters:
  • ad (anndata.AnnData) – AnnData object containing PCA components.

  • group_col (str) – Column containing group IDs.

  • components (list) – PCA components to plot

  • size (float) – Size of points

  • show_ellipse (bool) – Outline each group with an ellipse.

  • reference_group (str) – Group to use as reference. This group will be plotted first.

polygraph.visualize.plot_factors_nmf(H, n_features=50, **kwargs)[source]

Plot heatmap of contributions of features to NMF factors

Parameters:
  • H (pd.DataFrame) – Dataframe of shape (factors, features)

  • n_features (int) – Number of features to cluster

  • **kwargs – Additional arguments to pass to sns.clustermap

polygraph.visualize.plot_seqs_nmf(W, reorder=True)[source]

Plot stacked barplot of the distribution of NMF factors among sequences, split by group

Parameters:
  • W (pd.DataFrame) – Dataframe of shape n_seqs x (n_factors+1). The last column should contain group IDs.

  • reorder (bool)

polygraph.visualize.umap_plot(ad, group_col='Group', size=0.1, show_ellipse=True, reference_group=None)[source]

Plot UMAP embeddings of sequences, colored by group.

Parameters:
  • ad (anndata.AnnData) – AnnData object containing UMAP embedding.

  • group_col (str) – Column containing group IDs.

  • size (float) – Size of points

  • show_ellipse (bool) – Outline each group with an ellipse.

  • reference_group (str) – Group to use as reference. This group will be plotted first.

polygraph.visualize.upset_plot(ad, group_col='Group')[source]

Plot UpSet plot showing the overlap between features present in different groups.

Parameters:
  • ad (anndata.AnnData) – AnnData object containing sequence embedding.

  • group_col (str) – Column in ad.obs containing group IDs.

Module contents