polygraph package¶
Submodules¶
polygraph.classifier module¶
- polygraph.classifier.groupwise_svm(ad, reference_group, group_col='Group', cv=5, is_kernel=True, max_iter=1000, use_pca=False)[source]¶
Train an SVM to distinguish between each non-reference group and the reference group
- Parameters:
ad (anndata.AnnData) – Anndata object containing sequence embeddings of shape (n_seqs x n_vars)
reference_group (str) – ID of group to use as reference
group_col (str) – Name of column in .obs containing group ID
cv (int) – Number of cross-validation folds
is_kernel (bool) – Whether ad.X is a symmetric kernel matrix
max_iter (int) – Maximum number of iterations for SVM
use_pca (bool) – Whether to use PCA distances
- Returns:
- Modified anndata object containing each
sequence’s predicted label in .obs, as well as SVM performance metrics in ad.uns[“svm_performance”]
- Return type:
ad (anndata.AnnData)
polygraph.embedding module¶
polygraph.evolve module¶
- polygraph.evolve.evolve(start_seq, reference_seqs, iter, model, k=None, drop_last_layers=None, batch_size=512, device='cpu', task=None, alpha=3)[source]¶
Directed evolution with an additional goal to increase similarity to reference sequences.
- Parameters:
start_seq (str) – Start sequence
reference_seqs (list) – Reference sequences
iter (int) – Number of iterations
model (nn.Sequential) – Torch sequential model
k (int) – k-mer length for k-mer embedding.
drop_last_layers (int) – Number of terminal layers to drop from the model for model embedding.
batch_size (int) – Batch size for inference
task (int) – Model output head. If None, average all heads.
alpha (int) – Relative weight for similarity
- Returns:
Optimized sequence
- Return type:
best_seq (str)
polygraph.input module¶
- polygraph.input.download_gtex_tpm(download_dir='/home/runner/work/polygraph/polygraph/src/polygraph/resources/gtex')[source]¶
Download per-tissue TPM values from GTEX.
- polygraph.input.download_jaspar(family='vertebrates', download_dir='/home/runner/work/polygraph/polygraph/src/polygraph/resources/jaspar')[source]¶
Download and read the JASPAR database of TF motifs
- polygraph.input.load_gtex_tpm(download_dir='/home/runner/work/polygraph/polygraph/src/polygraph/resources/gtex')[source]¶
Load per-tissue TPM values from GTEX.
- Parameters:
download_dir (str) – Path to directory in which to download file
- Returns:
TPM matrix.
- Return type:
(pd.DataFrame)
- polygraph.input.read_seqs(file, sep='\t', incl_ids=False)[source]¶
Read sequences and group labels into a dataframe. This creates the input dataframe for all subsequent analyses.
- Parameters:
file (str) – path to a text file containing no header. If incl_ids=True,
contain (the first column should contain IDs and the next two columns should)
incl_ids=False (sequence and group label. If)
should (the first two columns)
label. (contain sequence and group)
sep (str) – Column separator
incl_ids (bool) – Whether the first column corresponds to sequence IDs.
- Returns:
- Pandas dataframe with columns Sequence, Group
and a unique index.
- Return type:
df (pd.DataFrame)
polygraph.likelihood module¶
- polygraph.likelihood.compute_likelihood(seqs, model, batch_size=32, num_workers=1, device='cpu')[source]¶
Function to compute the log-likelihood of each sequence in the given list using the hyenaDNA model pretrained on the human genome.
- Parameters:
- Returns:
Log-likelihoods for each sequence
- Return type:
LL (list)
- polygraph.likelihood.load_hyenadna(hyena_path, ckpt_dir='.', model='hyenadna-small-32k-seqlen')[source]¶
Loads the pretrained hyenaDNA foundation model.
- Parameters:
hyena_path (str) – Path to the cloned hyenaDNA repo. The repo must be cloned with the recurse-submodules flag. See installation instructions at https://github.com/HazyResearch/hyena-dna/tree/main.
ckpt_dir (str) – Path to directory in which to download the model
model (str) – Name of the foundation model to download. See https://github.com/HazyResearch/hyena-dna/tree/main for options.
- Returns:
Pretrained HyenaDNA model
- Return type:
model (ConvLMHeadModel)
polygraph.models module¶
- polygraph.models.batch(sequences, batch_size)[source]¶
Pad sequences to a constant length and split them into batches to pass to a model
- polygraph.models.cell_type_specificity(seqs, on_target_col, off_target_cols)[source]¶
Calculate cell type specificity from predicted or measured output
- Parameters:
- Returns:
- seqs with additional columns mingap, maxgap and meangap,
reporting 3 measures of cell type specificity for each sequence.
- Return type:
(pd.DataFrame)
- polygraph.models.enformer_embed(sequences, model)[source]¶
Embed a batch of sequences using pretrained or fine-tuned enformer
- Parameters:
sequences (list) – List of sequences
model (Enformer) – pre-trained or fine-tuned enformer model
- Returns:
np.array of shape (n_seqs x 3072)
- polygraph.models.get_embeddings(seqs, model, batch_size, drop_last_layers=1, device='cpu', swapaxes=False)[source]¶
Get model embeddings for all sequences in a dataframe
- Parameters:
seqs (list, pd.DataFrame) – List of sequences or dataframe containing sequences in the column “Sequence”.
model (nn.Sequential) – trained model
batch_size (int) – Batch size for inference
drop_last_layers (int) – Number of terminal layers to drop to get embeddings
swapaxes (bool) – If true, batches will be of shape (N, 4, L). Otherwise, shape will be (N, L, 4).
- Returns:
np.array of shape (n_seqs x n_features)
- polygraph.models.ism_score(model, seqs, batch_size, device='cpu', task=None)[source]¶
Get base-level importance scores for given sequence(s) using ISM
- Parameters:
- Returns:
DataFrame of shape (n_seqs x n_outputs)
- Return type:
(pd.DataFrame)
- polygraph.models.load_enformer()[source]¶
Load pre-trained enformer model
- Returns:
Pretrained model
- Return type:
(Enformer)
- polygraph.models.load_nucleotide_transformer(model='InstaDeepAI/nucleotide-transformer-2.5b-multi-species')[source]¶
Load pre-trained nucleotide transformer model
- Parameters:
model (str) – Name of pretrained model to download
- Returns:
Pre-trained model tokenizer (): Class to convert sequences to tokens
- Return type:
model (EsmForMaskedLM)
- polygraph.models.nucleotide_transformer_embed(seqs, model, tokenizer)[source]¶
Embed a batch of sequences using the pre-trained nucleotide transformer model
- Parameters:
sequences (list) – List of sequences
model – pre-trained nucleotide transformer model
- Returns:
np.array of shape (n_seqs x n_features)
- polygraph.models.predict(seqs, model, batch_size, device='cpu')[source]¶
Predict sequence properties using a sequence-to-function model.
- Parameters:
- Returns:
Array of shape (n_seqs x n_outputs)
- Return type:
(np.array)
- polygraph.models.robustness(model, seqs, batch_size, device='cpu', task=None, aggfunc='mean')[source]¶
Get robustness scores for given sequence(s) using ISM
- Parameters:
seqs (list, pd.DataFrame) – List of sequences or dataframe containing sequences in the column “Sequence”.
model (nn.Sequential) – trained model
batch_size (int) – Batch size for inference
aggfunc (str) – Either ‘mean’ or ‘max’. Determines how to aggregate the effect of all possible single-base mutations.
- Returns:
DataFrame of shape (n_seqs x n_outputs)
- Return type:
(pd.DataFrame)
polygraph.motifs module¶
- polygraph.motifs.get_motif_pairs(sites)[source]¶
List the pairs of motifs present in each sequence.
- Parameters:
sites (pd.DataFrame) – Pandas dataframe containing FIMO output.
- Returns:
Dataframe containing all motif pairs in each sequence with their orientation and distance.
- Return type:
pairs (pd.DataFrame)
- polygraph.motifs.motif_frequencies(sites, normalize=False, seqs=None)[source]¶
Count frequency of occurrence of motifs in a list of sequences
- Parameters:
- Returns:
Count matrix with rows = sequences and columns = motifs
- Return type:
cts (pd.DataFrame)
- polygraph.motifs.motif_pair_differential_abundance(motif_pairs, seqs, reference_group, group_col='Group', max_prop_cutoff=0, min_prop_cutoff=0, ref_prop_cutoff=0)[source]¶
Compare the rate of occurence of pairwise combinations of motifs between groups
- Parameters:
motif_pairs (pd.DataFrame) – Pandas dataframe containing the ouptut of get_motif_pairs.
seqs (pd.DataFrame) – Pandas dataframe containing sequences
reference_group (str) – ID of group to use as reference
group_col (str) – Name of column in seqs containing group IDs
max_prop_cutoff (int) – Limit to combinations with this proportion in at least one group.
min_prop_cutoff (float) – Limit to combinations with this proportion in in all groups.
- Returns:
- Pandas dataframe containing FDR-corrected significance
testing results for the occurrence of pairwise combinations between groups
- Return type:
res (pd.DataFrame)
- polygraph.motifs.motif_pair_differential_distance(motif_pairs, seqs, reference_group, group_col='Group', max_prop_cutoff=0, min_prop_cutoff=0, ref_prop_cutoff=0)[source]¶
Compare the distance between all motif pairs across groups.
- Parameters:
motif_pairs (pd.DataFrame) – Pandas dataframe containing the ouptut of get_motif_pairs.
seqs (pd.DataFrame) – Pandas dataframe containing sequences
reference_group (str) – ID of group to use as reference
group_col (str) – Name of column in seqs containing group IDs
max_prop_cutoff (int) – Limit to combinations with this proportion in at least one group.
min_prop_cutoff (float) – Limit to combinations with this proportion in in all groups.
- Returns:
- Pandas dataframe containing FDR-corrected significance
testing results for the distance between paired motifs, between groups
- Return type:
res (pd.DataFrame)
- polygraph.motifs.motif_pair_differential_orientation(motif_pairs, seqs, reference_group, group_col='Group', max_prop_cutoff=0, min_prop_cutoff=0, ref_prop_cutoff=0)[source]¶
Compare the mutual orientation of all motif pairs between groups.
- Parameters:
motif_pairs (pd.DataFrame) – Pandas dataframe containing the ouptut of get_motif_pairs.
seqs (pd.DataFrame) – Pandas dataframe containing sequences
reference_group (str) – ID of group to use as reference
group_col (str) – Name of column in seqs containing group IDs
max_prop_cutoff (int) – Limit to combinations with this proportion in at least one group.
min_prop_cutoff (float) – Limit to combinations with this proportion in in all groups.
- Returns:
- Pandas dataframe containing FDR-corrected significance
testing results for the mutual orientation of pairwise combinations between groups
- Return type:
res (pd.DataFrame)
- polygraph.motifs.nmf(counts, seqs, reference_group, group_col='Group', n_components=10)[source]¶
Perform NMF on motif count matrix
- Parameters:
counts (pd.DataFrame) – motif count matrix where rows are sequences and columns are motifs.
seqs (pd.DataFrame) – pandas dataframe containing DNA sequences.
reference_group (str) – ID for the group to use as reference
group_col (str) – Name of the column in seqs containing group IDs
n_components (int) – Number of components or factors to extract using NMF
- Returns:
- Pandas dataframe of size sequences x factors, containing
the contribution of each factor to each sequence.
- H (pd.DataFrame): Pandas dataframe of size factors x motifs, containing the
contribution of each motif to each factor.
- res (pd.DataFrame): Pandas dataframe containing the FDR-corrected significance
testing results for factor contribution between groups.
- Return type:
W (pd.DataFrame)
- polygraph.motifs.scan(seqs, meme_file, group_col='Group', pthresh=0.001, rc=True)[source]¶
Scan a DNA sequence using motifs from a MEME file.
- Parameters:
- Returns:
pd.DataFrame containing columns ‘MotifID’, ‘SeqID’, ‘start’, ‘end’, ‘strand’.
- polygraph.motifs.score_sites(sites, seqs, scores)[source]¶
Calculate the average score of each motif site given base-level importance scores.
- Parameters:
sites (pd.DataFrame) – Dataframe containing site positions
seqs (pd.DataFrame) – Dataframe containing sequences
scores (np.array) – Numpy array of shape (sequences x length)
- Returns
sites (pd.DataFrame): ‘sites’ dataframe with an additional columns ‘score’
polygraph.sequence module¶
- polygraph.sequence.ISM(seqs, drop_ref=False)[source]¶
Perform in-silico mutagenesis on given DNA sequence(s)
- Parameters:
- Returns:
- A list of all possible single-base mutated sequences
derived from the original sequences.
- Return type:
(list)
- polygraph.sequence.bleu_similarity(seqs, reference_seqs, max_k=4)[source]¶
Calculate the bleu similarity score between two sets of sequences.
- polygraph.sequence.fastsk(seqs, k=5, m=2)[source]¶
Compute a gapped k-mer kernel matrix for the given sequences using FastSK.
- Parameters:
- Returns:
Array of shape (n_seqs, n_seqs) containing the gapped k-mer kernel.
- Return type:
(np.array)
- polygraph.sequence.gc(seqs)[source]¶
Calculate the GC fraction of a DNA sequence or list of sequences.
- polygraph.sequence.groupwise_mean_edit_dist(seqs, group_col='Group')[source]¶
Calculate average edit distances between all groups of sequences
- polygraph.sequence.kmer_frequencies(seqs, k, normalize=False, genome='hg38')[source]¶
Get frequencies of all kmers of length k in a sequence or sequences.
- Parameters:
- Returns:
- A dataframe of shape (kmers x sequences), containing
the frequency of each k-mer in the sequence.
- Return type:
(pd.DataFrame)
- polygraph.sequence.kmer_positions(seq, kmer)[source]¶
Return all the locations of a given k-mer in a DNA sequence
- polygraph.sequence.min_edit_distance(seqs, reference_seqs)[source]¶
For each sequence in a list, find the smallest edit distance between that sequence and a list of reference sequences
- polygraph.sequence.min_edit_distance_from_reference(seqs, reference_group, group_col='Group')[source]¶
- For each sequence in non-reference groups, find the smallest edit distance
between that sequence and the sequences in the reference group.
- Parameters:
- Returns:
- list of edit distance between each sequence and its closest
reference sequence.
Set to 0 for reference sequences
- Return type:
edit (np.array)
polygraph.stats module¶
- polygraph.stats.groupwise_fishers(data, reference_group, val_col, reference_val=None, group_col='Group')[source]¶
- Perform Fisher’s exact test for proportions between each non-reference group
and the reference group.
- Parameters:
data (pd.DataFrame, anndata.AnnData) – Pandas dataframe with group IDs and values to compare, or an AnnData object containing this dataframe in .obs
val_col (str) – Name of column with values to compare
reference_group (str) – ID of group to use as reference
reference_val (str) – A specific value whose proportion is to be compared between groups
group_col (str) – Name of column containing group IDs
- Returns:
- Dataframe containing group proportions and FDR-corrected
p-values for each group.
- Return type:
(pd.DataFrame)
- polygraph.stats.groupwise_mann_whitney(data, val_col, reference_group, group_col='Group')[source]¶
- Compare the mean values between each non-reference group and the
reference group using the Mann-Whitney U test.
- Parameters:
data (pd.DataFrame, anndata.AnnData) – Pandas dataframe containing group IDs and values to compare, or an AnnData object containing this dataframe in .obs
val_col (str) – Name of column with values to compare
reference_group (str) – ID of group to use as reference
group_col (str) – Name of column containing group IDs
- Returns:
Dataframe containing FDR-corrected p-values for each group.
- Return type:
(pd.DataFrame)
polygraph.utils module¶
- polygraph.utils.check_equal_lens(seqs)[source]¶
Given sequences, check whether they are all of equal length.
- polygraph.utils.make_ids(seqs)[source]¶
Assign a unique index to each row of a dataframe
- Parameters:
seqs (pd.DataFrame) – Pandas dataframe
- Returns:
Modified database containing unique indices.
- Return type:
seqs (pd.DataFrame)
- polygraph.utils.pad_with_Ns(seqs, seq_len=None, end='both')[source]¶
Pads a sequence with Ns at the desired end until it reaches seq_len in length.
If seq_len is not provided, it is set to the length of the longest sequence.
polygraph.visualize module¶
- polygraph.visualize.boxplot(data, value_col, group_col='Group', fill_col=None)[source]¶
Plot boxplot of values in each group
- Parameters:
data (pd.DataFrame, anndata.AnnData) – Pandas dataframe with group IDs and values to compare, or an AnnData object containing this dataframe in .obs
value_col (str) – Column containing values to plot
group_col (str) – Column containing group IDs
fill_col (str) – Column containing additional variable to split each group
- polygraph.visualize.densityplot(data, value_col, group_col='Group')[source]¶
Plot density plot of values in each group
- polygraph.visualize.one_nn_frac_plot(ad, reference_group, group_col='Group')[source]¶
Plot a barplot showing the fraction of points in each group whose nearest neighbors are reference sequences.
- polygraph.visualize.pca_plot(ad, group_col='Group', components=[0, 1], size=0.1, show_ellipse=True, reference_group=None)[source]¶
Plot PCA embeddings of sequences, colored by group.
- Parameters:
ad (anndata.AnnData) – AnnData object containing PCA components.
group_col (str) – Column containing group IDs.
components (list) – PCA components to plot
size (float) – Size of points
show_ellipse (bool) – Outline each group with an ellipse.
reference_group (str) – Group to use as reference. This group will be plotted first.
- polygraph.visualize.plot_factors_nmf(H, n_features=50, **kwargs)[source]¶
Plot heatmap of contributions of features to NMF factors
- Parameters:
H (pd.DataFrame) – Dataframe of shape (factors, features)
n_features (int) – Number of features to cluster
**kwargs – Additional arguments to pass to sns.clustermap
- polygraph.visualize.plot_seqs_nmf(W, reorder=True)[source]¶
Plot stacked barplot of the distribution of NMF factors among sequences, split by group
- Parameters:
W (pd.DataFrame) – Dataframe of shape n_seqs x (n_factors+1). The last column should contain group IDs.
reorder (bool)