scimilarity.cell_query#

class scimilarity.cell_query.CellQuery(model_path, use_gpu=False, filenames=None, metadata_tiledb_uri='cell_metadata', embedding_tiledb_uri='cell_embedding', load_knn=True)[source]#

Bases: CellSearchKNN

A class that searches for similar cells using a cell embedding.

Parameters:
  • model_path (str) –

  • use_gpu (bool) –

  • filenames (Optional[dict]) –

  • metadata_tiledb_uri (str) –

  • embedding_tiledb_uri (str) –

  • load_knn (bool) –

annotate_cell_index(metadata)[source]#

Annotate a metadata dataframe with the cell index in sample datasets.

Parameters:

metadata (pandas.DataFrame) – A pandas dataframe containing columns: study, sample, and index. Where index is the cell query index (i.e. from cq.cell_metadata).

Returns:

A pandas dataframe containing the “cell_index” column which is the cell index per sample dataset.

Return type:

pandas.DataFrame

Examples

>>> metadata = cq.annotate_cell_index(metadata)
compile_sample_metadata(nn_idxs, levels=['study', 'sample', 'tissue', 'disease'])[source]#

Compile sample metadata for nearest neighbors.

Parameters:
  • nn_idx (numpy.ndarray) – A 2D numpy arrary of nearest neighbor indices [num_cells x k].

  • levels (list, default: ["study", "sample", "tissue", "disease"]) – Levels for aggregation. Requires “study” and “sample” in order to calculate fraction of cells that are similar to the query in the relevant studies and samples.

  • nn_idxs (numpy.ndarray) –

Returns:

A pandas dataframe containing sample metadata for nearest neighbors.

Return type:

pandas.DataFrame

Examples

>>> embeddings = cq.get_embeddings(align_dataset(data, cq.gene_order).X)
>>> nn_idxs, nn_dists = cq.get_nearest_neighbors(embeddings, k=50)
>>> sample_metadata = cq.compile_sample_metadata(nn_idxs)
create_embeddings_tiledb(tiledb_uri, arr, batch_size=100000)[source]#

Create TileDB array from a numpy array of embeddings.

Parameters:
  • tiledb_uri (str) – URI for the TileDB array.

  • batch_size (int, default: 10000) – Batch size for the tiles.

  • arr (numpy.ndarray) –

get_precomputed_embeddings(idx)[source]#

Fast get of embeddings from the cell_embedding tiledb array.

Parameters:

idx (slice, List[int]) – Cell indices.

Returns:

A 2D numpy array for the listed cells.

Return type:

numpy.ndarray

Examples

>>> array = cq.get_precomputed_embeddings([0, 1, 100])
search_centroid_exhaustive(adata, centroid_key, max_dist=0.03, metadata_filter=None, qc=True, qc_params={'k_clusters': 10}, buffer_size=100000, random_seed=4)[source]#

Performs a nearest neighbors search for a centroid constructed from marked cells.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].

  • centroid_key (str) – The obs column key that marks cells to centroid as 1, otherwise 0.

  • max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.

  • metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.

  • qc (bool, default: True) – Whether to perform QC on the query

  • qc_params (dict, default: {'k_clusters': 10}) – Parameters for the QC: k_clusters: the number of clusters in kmeans clustering

  • buffer_size (int, default: 100000) – Batch size for processing cells.

  • random_seed (int, default: 1) – Random seed for k-means clustering

Returns:

  • centroid_embedding (numpy.ndarray) – A 2D numpy array of the centroid embedding.

  • nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings

  • nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings

  • metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.

  • qc_stats (dict) – A dictionary of stats for QC.

Return type:

Tuple[numpy.ndarray, List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame, dict]

Examples

>>> cells_used_in_query = adata.obs["celltype_name"] == "macrophage"
>>> adata.obs["used_in_query"] = cells_used_in_query.astype(int)
>>> centroid_embedding, nn_idxs, nn_dists, metadata, qc_stats = cq.search_centroid_exhaustive(adata, 'used_in_query')
search_centroid_nearest(adata, centroid_key, k=10000, ef=None, max_dist=None, qc=True, qc_params={'k_clusters': 10}, random_seed=4)[source]#

Performs a nearest neighbors search for a centroid constructed from marked cells.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].

  • centroid_key (str) – The obs column key that marks cells to centroid as 1, otherwise 0.

  • k (int, default: 10000) – The number of nearest neighbors.

  • ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

  • max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.

  • qc (bool, default: True) – Whether to perform QC on the query

  • qc_params (dict, default: {'k_clusters': 10}) – Parameters for the QC: k_clusters: the number of clusters in kmeans clustering

  • random_seed (int, default: 1) – Random seed for k-means clustering

Returns:

  • centroid_embedding (numpy.ndarray) – A 2D numpy array of the centroid embedding.

  • nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings

  • nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings

  • metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.

  • qc_stats (dict) – A dictionary of stats for QC.

Return type:

Tuple[numpy.ndarray, List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame, dict]

Examples

>>> cells_used_in_query = adata.obs["celltype_name"] == "macrophage"
>>> adata.obs["used_in_query"] = cells_used_in_query.astype(int)
>>> centroid_embedding, nn_idxs, nn_dists, metadata, qc_stats = cq.search_centroid_nearest(adata, 'used_in_query')
search_cluster_centroids_exhaustive(adata, cluster_key, cluster_label=None, max_dist=0.03, metadata_filter=None, buffer_size=100000, skip_null=True)[source]#

Performs a nearest neighbors search for cluster centroids against the knn.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].

  • cluster_key (str) – The obs column key that contains cluster labels.

  • cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest

  • max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.

  • metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.

  • buffer_size (int, default: 100000) – Batch size for processing cells.

  • skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.

Returns:

  • centroid_embeddings (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroid embeddings.

  • cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.

  • nn_idxs (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor indices [num_cells x k].

  • nn_dists (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor distances [num_cells x k].

  • all_metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors for all centroids.

Return type:

Tuple[numpy.ndarray, list, Dict[str, numpy.ndarray], Dict[str, numpy.ndarray], pandas.DataFrame]

Examples

>>> centroid_embeddings, cluster_idx, nn_idx, nn_dists, all_metadata = cq.search_cluster_centroids_exhaustive(adata, "leidan")
search_cluster_centroids_nearest(adata, cluster_key, cluster_label=None, k=10000, ef=None, skip_null=True, max_dist=None)[source]#

Performs a nearest neighbors search for cluster centroids against the knn.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].

  • cluster_key (str) – The obs column key that contains cluster labels.

  • cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest

  • k (int, default: 10000) – The number of nearest neighbors.

  • ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

  • skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.

  • max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.

Returns:

  • centroid_embeddings (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroid embeddings.

  • cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.

  • nn_idxs (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor indices [num_cells x k].

  • nn_dists (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor distances [num_cells x k].

  • all_metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors for all centroids.

Return type:

Tuple[numpy.ndarray, list, Dict[str, numpy.ndarray], Dict[str, numpy.ndarray], pandas.DataFrame]

Examples

>>> centroid_embeddings, cluster_idx, nn_idx, nn_dists, all_metadata = cq.search_cluster_centroids_nearest(adata, "leidan")
search_exhaustive(embeddings, max_dist=0.03, metadata_filter=None, buffer_size=100000)[source]#

Performs an exhaustive search.

Parameters:
  • embeddings (numpy.ndarray) – Embeddings as a numpy array.

  • max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.

  • metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.

  • buffer_size (int, default: 100000) – Batch size for processing cells.

Returns:

  • nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of cell indices. One entry for every cell (row) in embeddings

  • nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of cell distances. One entry for every cell (row) in embeddings

  • metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata.

Return type:

Tuple[List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame]

Examples

>>> nn_idxs, nn_dists, metadata = cq.search_exhaustive(embeddings)
search_nearest(embeddings, k=10000, ef=None, max_dist=None)[source]#

Performs a nearest neighbors search against the knn.

Parameters:
  • embeddings (numpy.ndarray) – Embeddings as a numpy array.

  • k (int, default: 10000) – The number of nearest neighbors.

  • ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

  • max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.

Returns:

  • nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings

  • nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings

  • metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.

Return type:

Tuple[List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame]

Examples

>>> nn_idxs, nn_dists, metadata = cq.search_nearest(embeddings)