scimilarity.cell_annotation#
- class scimilarity.cell_annotation.CellAnnotation(model_path, use_gpu=False, filenames=None)[source]#
Bases:
CellSearchKNN
A class that annotates cells using a cell embedding and then knn search.
- Parameters:
model_path (str) –
use_gpu (bool) –
filenames (Optional[dict]) –
- annotate_dataset(data)[source]#
Annotate dataset with celltype predictions.
- Parameters:
data (anndata.AnnData) – The annotated data matrix with rows for cells and columns for genes. This function assumes the data has been log normalized (i.e. via lognorm_counts) accordingly.
- Returns:
- A data object where:
celltype predictions are in obs[“celltype_hint”]
embeddings are in obs[“X_scimilarity”].
- Return type:
anndata.AnnData
Examples
>>> ca = CellAnnotation(model_path="/opt/data/model") >>> data = annotate_dataset(data)
- blocklist_celltypes(labels)[source]#
Blocklist celltypes.
- Parameters:
labels (List[str], Set[str]) – A list or set containing blocklist labels.
Notes
Blocking a celltype will persist for this instance of the class and subsequent predictions will have this blocklist. Blocklists and safelists are mutually exclusive, setting one will clear the other.
Examples
>>> ca.blocklist_celltypes(["T cell"])
- build_knn(input_data, knn_filename='labelled_kNN.bin', celltype_labels_filename='reference_labels.tsv', obs_field='celltype_name', ef_construction=1000, M=80, target_labels=None)[source]#
Build and save a knn index from a h5ad data file or directory of aligned.zarr stores.
- Parameters:
input_data (Union[anndata.AnnData, List[str]],) – If a list, it should contain a list of zarr store locations (zarr format saved by anndata). The zarr data should contain cells that are already log normalized and gene space aligned. Otherwise, the annotated data matrix with rows for cells and columns for genes. NOTE: The data should be curated to only contain valid cell ontology labels.
knn_filename (str, default: "labelled_kNN.bin") – Filename of the knn index.
celltype_labels_filename (str, default: "reference_labels.tsv") – Filename of the cell type reference labels.
obs_field (str, default: "celltype_name") – The obs column name of celltype labels.
ef_construction (int, default: 1000) – The size of the dynamic list for the nearest neighbors. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
M (int, default: 80) – The number of bi-directional links created for every new element during construction. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
target_labels (Optional[List[str]], default: None) – Optional list of cell type names to filter the data.
Examples
>>> ca.build_knn(filename="/opt/data/train/train.h5ad")
- get_predictions_knn(embeddings, k=50, ef=100, weighting=False, disable_progress=False)[source]#
Get predictions from knn search results.
- Parameters:
embeddings (numpy.ndarray) – Embeddings as a numpy array.
k (int, default: 50) – The number of nearest neighbors.
ef (int, default: 100) – The size of the dynamic list for the nearest neighbors. See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
weighting (bool, default: False) – Use distance weighting when getting the consensus prediction.
disable_progress (bool, default: False) – Disable tqdm progress bar
- Returns:
predictions (pandas.Series) – A pandas series containing celltype label predictions.
nn_idxs (numpy.ndarray) – A 2D numpy array of nearest neighbor indices [num_cells x k].
nn_dists (numpy.ndarray) – A 2D numpy array of nearest neighbor distances [num_cells x k].
stats (pandas.DataFrame) – Prediction statistics dataframe with columns: “hits” is a json string with the count for every class in k cells. “min_dist” is the minimum distance. “max_dist” is the maximum distance “vs2nd” is sum(best) / sum(best + 2nd best). “vsAll” is sum(best) / sum(all hits). “hits_weighted” is a json string with the weighted count for every class in k cells. “vs2nd_weighted” is weighted sum(best) / sum(best + 2nd best). “vsAll_weighted” is weighted sum(best) / sum(all hits).
- Return type:
Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, pandas.DataFrame]
Examples
>>> ca = CellAnnotation(model_path="/opt/data/model") >>> embeddings = ca.get_embeddings(align_dataset(data, ca.gene_order).X) >>> predictions, nn_idxs, nn_dists, stats = ca.get_predictions_knn(embeddings)
- safelist_celltypes(labels)[source]#
Safelist celltypes.
- Parameters:
labels (List[str], Set[str]) – A list or set containing safelist labels.
Notes
Safelisting a celltype will persist for this instance of the class and subsequent predictions will have this safelist. Blocklists and safelists are mutually exclusive, setting one will clear the other.
Examples
>>> ca.safelist_celltypes(["CD4-positive, alpha-beta T cell"])