scimilarity.cell_query#
- class scimilarity.cell_query.CellQuery(model_path, use_gpu=False, filenames=None, metadata_tiledb_uri='cell_metadata', embedding_tiledb_uri='cell_embedding', load_knn=True)[source]#
Bases:
CellSearchKNN
A class that searches for similar cells using a cell embedding.
- Parameters:
model_path (str) –
use_gpu (bool) –
filenames (Optional[dict]) –
metadata_tiledb_uri (str) –
embedding_tiledb_uri (str) –
load_knn (bool) –
- annotate_cell_index(metadata)[source]#
Annotate a metadata dataframe with the cell index in sample datasets.
- Parameters:
metadata (pandas.DataFrame) – A pandas dataframe containing columns: study, sample, and index. Where index is the cell query index (i.e. from cq.cell_metadata).
- Returns:
A pandas dataframe containing the “cell_index” column which is the cell index per sample dataset.
- Return type:
pandas.DataFrame
Examples
>>> metadata = cq.annotate_cell_index(metadata)
- compile_sample_metadata(nn_idxs, levels=['study', 'sample', 'tissue', 'disease'])[source]#
Compile sample metadata for nearest neighbors.
- Parameters:
nn_idx (numpy.ndarray) – A 2D numpy arrary of nearest neighbor indices [num_cells x k].
levels (list, default: ["study", "sample", "tissue", "disease"]) – Levels for aggregation. Requires “study” and “sample” in order to calculate fraction of cells that are similar to the query in the relevant studies and samples.
nn_idxs (numpy.ndarray) –
- Returns:
A pandas dataframe containing sample metadata for nearest neighbors.
- Return type:
pandas.DataFrame
Examples
>>> embeddings = cq.get_embeddings(align_dataset(data, cq.gene_order).X) >>> nn_idxs, nn_dists = cq.get_nearest_neighbors(embeddings, k=50) >>> sample_metadata = cq.compile_sample_metadata(nn_idxs)
- create_embeddings_tiledb(tiledb_uri, arr, batch_size=100000)[source]#
Create TileDB array from a numpy array of embeddings.
- Parameters:
tiledb_uri (str) – URI for the TileDB array.
batch_size (int, default: 10000) – Batch size for the tiles.
arr (numpy.ndarray) –
- get_precomputed_embeddings(idx)[source]#
Fast get of embeddings from the cell_embedding tiledb array.
- Parameters:
idx (slice, List[int]) – Cell indices.
- Returns:
A 2D numpy array for the listed cells.
- Return type:
numpy.ndarray
Examples
>>> array = cq.get_precomputed_embeddings([0, 1, 100])
- search_centroid_exhaustive(adata, centroid_key, max_dist=0.03, metadata_filter=None, qc=True, qc_params={'k_clusters': 10}, buffer_size=100000, random_seed=4)[source]#
Performs a nearest neighbors search for a centroid constructed from marked cells.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
centroid_key (str) – The obs column key that marks cells to centroid as 1, otherwise 0.
max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.
metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.
qc (bool, default: True) – Whether to perform QC on the query
qc_params (dict, default: {'k_clusters': 10}) – Parameters for the QC: k_clusters: the number of clusters in kmeans clustering
buffer_size (int, default: 100000) – Batch size for processing cells.
random_seed (int, default: 1) – Random seed for k-means clustering
- Returns:
centroid_embedding (numpy.ndarray) – A 2D numpy array of the centroid embedding.
nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.
qc_stats (dict) – A dictionary of stats for QC.
- Return type:
Tuple[numpy.ndarray, List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame, dict]
Examples
>>> cells_used_in_query = adata.obs["celltype_name"] == "macrophage" >>> adata.obs["used_in_query"] = cells_used_in_query.astype(int) >>> centroid_embedding, nn_idxs, nn_dists, metadata, qc_stats = cq.search_centroid_exhaustive(adata, 'used_in_query')
- search_centroid_nearest(adata, centroid_key, k=10000, ef=None, max_dist=None, qc=True, qc_params={'k_clusters': 10}, random_seed=4)[source]#
Performs a nearest neighbors search for a centroid constructed from marked cells.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
centroid_key (str) – The obs column key that marks cells to centroid as 1, otherwise 0.
k (int, default: 10000) – The number of nearest neighbors.
ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.
qc (bool, default: True) – Whether to perform QC on the query
qc_params (dict, default: {'k_clusters': 10}) – Parameters for the QC: k_clusters: the number of clusters in kmeans clustering
random_seed (int, default: 1) – Random seed for k-means clustering
- Returns:
centroid_embedding (numpy.ndarray) – A 2D numpy array of the centroid embedding.
nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.
qc_stats (dict) – A dictionary of stats for QC.
- Return type:
Tuple[numpy.ndarray, List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame, dict]
Examples
>>> cells_used_in_query = adata.obs["celltype_name"] == "macrophage" >>> adata.obs["used_in_query"] = cells_used_in_query.astype(int) >>> centroid_embedding, nn_idxs, nn_dists, metadata, qc_stats = cq.search_centroid_nearest(adata, 'used_in_query')
- search_cluster_centroids_exhaustive(adata, cluster_key, cluster_label=None, max_dist=0.03, metadata_filter=None, buffer_size=100000, skip_null=True)[source]#
Performs a nearest neighbors search for cluster centroids against the knn.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
cluster_key (str) – The obs column key that contains cluster labels.
cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest
max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.
metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.
buffer_size (int, default: 100000) – Batch size for processing cells.
skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.
- Returns:
centroid_embeddings (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroid embeddings.
cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.
nn_idxs (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor indices [num_cells x k].
nn_dists (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor distances [num_cells x k].
all_metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors for all centroids.
- Return type:
Tuple[numpy.ndarray, list, Dict[str, numpy.ndarray], Dict[str, numpy.ndarray], pandas.DataFrame]
Examples
>>> centroid_embeddings, cluster_idx, nn_idx, nn_dists, all_metadata = cq.search_cluster_centroids_exhaustive(adata, "leidan")
- search_cluster_centroids_nearest(adata, cluster_key, cluster_label=None, k=10000, ef=None, skip_null=True, max_dist=None)[source]#
Performs a nearest neighbors search for cluster centroids against the knn.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes. Requires a layers[“counts”].
cluster_key (str) – The obs column key that contains cluster labels.
cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest
k (int, default: 10000) – The number of nearest neighbors.
ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.
max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.
- Returns:
centroid_embeddings (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroid embeddings.
cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.
nn_idxs (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor indices [num_cells x k].
nn_dists (Dict[str, numpy.ndarray]) – A 2D numpy array of nearest neighbor distances [num_cells x k].
all_metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors for all centroids.
- Return type:
Tuple[numpy.ndarray, list, Dict[str, numpy.ndarray], Dict[str, numpy.ndarray], pandas.DataFrame]
Examples
>>> centroid_embeddings, cluster_idx, nn_idx, nn_dists, all_metadata = cq.search_cluster_centroids_nearest(adata, "leidan")
- search_exhaustive(embeddings, max_dist=0.03, metadata_filter=None, buffer_size=100000)[source]#
Performs an exhaustive search.
- Parameters:
embeddings (numpy.ndarray) – Embeddings as a numpy array.
max_dist (float, default: 0.03) – Filter for cells that are within the max distance to the query.
metadata_filter (dict, optional, default: None) – A dictionary where keys represent column names and values represent valid terms in the columns.
buffer_size (int, default: 100000) – Batch size for processing cells.
- Returns:
nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of cell indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of cell distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata.
- Return type:
Tuple[List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame]
Examples
>>> nn_idxs, nn_dists, metadata = cq.search_exhaustive(embeddings)
- search_nearest(embeddings, k=10000, ef=None, max_dist=None)[source]#
Performs a nearest neighbors search against the knn.
- Parameters:
embeddings (numpy.ndarray) – Embeddings as a numpy array.
k (int, default: 10000) – The number of nearest neighbors.
ef (int, default: None) – The size of the dynamic list for the nearest neighbors. Defaults to k if None. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
max_dist (float, optional) – Assume k=1000000, then filter for cells that are within the max distance to the query. Overwrites the k parameter.
- Returns:
nn_idxs (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor indices. One entry for every cell (row) in embeddings
nn_dists (List[numpy.ndarray]) – A list of 2D numpy array of nearest neighbor distances. One entry for every cell (row) in embeddings
metadata (pandas.DataFrame) – A pandas dataframe containing cell metadata for nearest neighbors.
- Return type:
Tuple[List[numpy.ndarray], List[numpy.ndarray], pandas.DataFrame]
Examples
>>> nn_idxs, nn_dists, metadata = cq.search_nearest(embeddings)