scimilarity.cell_query#

class scimilarity.cell_query.CellQuery(model_path, use_gpu=False, filenames=None, metadata_tiledb_uri='cell_metadata', embedding_tiledb_uri='cell_embedding', knn_type='hnswlib', load_knn=True)[source]#

Bases: CellSearchKNN

A class that searches for similar cells using a cell embedding.

Parameters:

model_path (str) – Path to the model directory.
use_gpu (bool, default: False) – Use GPU instead of CPU.
filenames (dict, optional, default: None) – Use a dictionary of custom filenames for model files instead default.
metadata_tiledb_uri (str, default: "cell_metadata") – Relative path to the directory containing the tiledb cell metadata storage.
embedding_tiledb_uri (str, default: "cell_embedding") – Relative path to the directory containing the tiledb cell embedding storage.
knn_type (str, default: "hnswlib") – What type of knn to use, options are [“hnswlib”, “tiledb_vector_search”]
load_knn (bool, default: True) – Load the knn index. Set to False if knn is not needed.

Examples

>>> cq = CellQuery(model_path="/opt/data/model")

annotate_cell_index(metadata)[source]#

Annotate a metadata dataframe with the cell index in datasets at the SAMPLE level.: The cell index is the cell number, not related to the obs.index.

Parameters:: metadata (pandas.DataFrame) – A pandas dataframe containing columns: study, sample, and index. Where index is the cell query index (i.e. from cq.cell_metadata).
Returns:: A pandas dataframe containing the “cell_index” column which is the cell index per sample dataset.
Return type:: pandas.DataFrame

Examples

>>> metadata = cq.annotate_cell_index(metadata)

compile_sample_metadata(nn_idxs, levels=['study', 'sample', 'tissue', 'disease'])[source]#

Compile sample metadata for nearest neighbors.

Parameters:

nn_idx (numpy.ndarray) – A 2D numpy arrary of nearest neighbor indices [num_cells x k].
levels (list, default: ["study", "sample", "tissue", "disease"]) – Levels for aggregation. Requires “study” and “sample” in order to calculate fraction of cells that are similar to the query in the relevant studies and samples.
nn_idxs (numpy.ndarray)

Returns:

A pandas dataframe containing sample metadata for nearest neighbors.

Return type:

pandas.DataFrame

Examples

>>> embeddings = cq.get_embeddings(align_dataset(data, cq.gene_order).X)
>>> nn_idxs, nn_dists = cq.get_nearest_neighbors(embeddings, k=50)
>>> sample_metadata = cq.compile_sample_metadata(nn_idxs)

get_precomputed_embeddings(idx)[source]#

Fast get of embeddings from the cell_embedding tiledb array.

Parameters:: idx (slice, List[int]) – Cell indices.
Returns:: A 2D numpy array for the listed cells.
Return type:: numpy.ndarray

Examples

>>> array = cq.get_precomputed_embeddings([0, 1, 100])

search_centroid_exhaustive(adata, centroid_key, max_dist=0.03, metadata_filter=None, qc=True, qc_params={'k_clusters': 10}, buffer_size=100000, random_seed=4)[source]#

Performs a nearest neighbors search for a centroid constructed from marked cells.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
centroid_key (str) – The obs column key that marks cells to centroid as 1, otherwise 0.
max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.
metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.
qc (bool, default: True) – Whether to perform QC on the query
qc_params (dict, default: {'k_clusters': 10}) – Parameters for the QC: k_clusters: the number of clusters in kmeans clustering
buffer_size (int, default: 100000) – Batch size for processing cells.
random_seed (int, default: 1) – Random seed for k-means clustering

Returns:

centroid_embedding (numpy.ndarray) – A 2D numpy array of the centroid embedding.
nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.
qc_stats (dict) – A dictionary of stats for QC.

Return type:

Tuple[numpy.ndarray, List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame, dict]

Examples

>>> cells_used_in_query = adata.obs["celltype_name"] == "macrophage"
>>> adata.obs["used_in_query"] = cells_used_in_query.astype(int)
>>> centroid_embedding, nn_idxs, nn_dists, metadata, qc_stats = cq.search_centroid_exhaustive(adata, 'used_in_query')

search_centroid_nearest(adata, centroid_key, k=10000, ef=None, max_dist=None, qc=True, qc_params={'k_clusters': 10}, random_seed=4)[source]#

Performs a nearest neighbors search for a centroid constructed from marked cells.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
centroid_key (str) – The obs column key that marks cells to centroid as 1, otherwise 0.
k (int, default: 10000) – The number of nearest neighbors.
ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See nmslib/hnswlib
max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.
qc (bool, default: True) – Whether to perform QC on the query
qc_params (dict, default: {'k_clusters': 10}) – Parameters for the QC: k_clusters: the number of clusters in kmeans clustering
random_seed (int, default: 1) – Random seed for k-means clustering

Returns:

centroid_embedding (numpy.ndarray) – A 2D numpy array of the centroid embedding.
nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.
qc_stats (dict) – A dictionary of stats for QC.

Return type:

Tuple[numpy.ndarray, List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame, dict]

Examples

>>> cells_used_in_query = adata.obs["celltype_name"] == "macrophage"
>>> adata.obs["used_in_query"] = cells_used_in_query.astype(int)
>>> centroid_embedding, nn_idxs, nn_dists, metadata, qc_stats = cq.search_centroid_nearest(adata, 'used_in_query')

search_cluster_centroids_exhaustive(adata, cluster_key, cluster_label=None, max_dist=0.03, metadata_filter=None, buffer_size=100000, skip_null=True)[source]#

Performs a nearest neighbors search for cluster centroids against the knn.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
cluster_key (str) – The obs column key that contains cluster labels.
cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest
max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.
metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.
buffer_size (int, default: 100000) – Batch size for processing cells.
skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.

Returns:

centroid_embeddings (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroid embeddings.
cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.
nn_idxs (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor indices [num_cells x k].
nn_dists (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor distances [num_cells x k].
all_metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors for all centroids.

Return type:

Tuple[numpy.ndarray, list, Dict[str, numpy.ndarray], Dict[str, numpy.ndarray], pandas.DataFrame]

Examples

>>> centroid_embeddings, cluster_idx, nn_idx, nn_dists, all_metadata = cq.search_cluster_centroids_exhaustive(adata, "leidan")

search_cluster_centroids_nearest(adata, cluster_key, cluster_label=None, k=10000, ef=None, skip_null=True, max_dist=None)[source]#

Performs a nearest neighbors search for cluster centroids against the knn.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
cluster_key (str) – The obs column key that contains cluster labels.
cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest
k (int, default: 10000) – The number of nearest neighbors.
ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See nmslib/hnswlib
skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.
max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.

Returns:

centroid_embeddings (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroid embeddings.
cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.
nn_idxs (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor indices [num_cells x k].
nn_dists (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor distances [num_cells x k].
all_metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors for all centroids.

Return type:

Tuple[numpy.ndarray, list, Dict[str, numpy.ndarray], Dict[str, numpy.ndarray], pandas.DataFrame]

Examples

>>> centroid_embeddings, cluster_idx, nn_idx, nn_dists, all_metadata = cq.search_cluster_centroids_nearest(adata, "leidan")

search_exhaustive(embeddings, max_dist=0.03, metadata_filter=None, buffer_size=100000)[source]#

Performs an exhaustive search.

Parameters:

embeddings (numpy.ndarray) – Embeddings as a numpy array.
max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.
metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.
buffer_size (int, default: 100000) – Batch size for processing cells.

Returns:

nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of cell indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of cell distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata.

Return type:

Tuple[List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame]

Examples

>>> nn_idxs, nn_dists, metadata = cq.search_exhaustive(embeddings)

search_nearest(embeddings, k=10000, ef=None, max_dist=None)[source]#

Performs a nearest neighbors search against the knn.

Parameters:

embeddings (numpy.ndarray) – Embeddings as a numpy array.
k (int, default: 10000) – The number of nearest neighbors.
ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See nmslib/hnswlib
max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.

Returns:

nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.

Return type:

Tuple[List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame]

Examples

>>> nn_idxs, nn_dists, metadata = cq.search_nearest(embeddings)