scimilarity.utils#
- scimilarity.utils.align_dataset(data, target_gene_order, keep_obsm=True, gene_overlap_threshold=5000)[source]#
Align the gene space to the target gene order.
- Parameters:
data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
target_gene_order (numpy.ndarray) – An array containing the gene space.
keep_obsm (bool, default: True) – Retain the original data’s obsm matrices in output.
gene_overlap_threshold (int, default: 5000) – The minimum number of genes in common between data and target_gene_order to be valid.
- Returns:
A data object with aligned gene space ready to be used for embedding cells.
- Return type:
anndata.AnnData
Examples
>>> ca = CellAnnotation(model_path="/opt/data/model") >>> align_dataset(data, ca.gene_order)
- scimilarity.utils.categorize_and_sort_by_score(df, name_column, score_column, ascending=False, topn=None)[source]#
Transform column into category, sort, and choose top n
- Parameters:
df ("pandas.DataFrame") – Pandas dataframe.
name_column (str) – Name of column to sort.
score_column (str) – Name of score column to sort name_column by.
ascending (bool) – Sort ascending
topn (Optional[int], default: None) – Subset to the top n diseases.
- Returns:
A sorted dataframe that is optionally subsetted to top n.
- Return type:
pandas.DataFrame
Examples
>>> df = categorize_and_sort_by_score(df, "disease", "Hit Percentage", topn=10)
- scimilarity.utils.clean_diseases(diseases)[source]#
Mapper to clean disease names.
- Parameters:
diseases (pandas.Series) – A pandas Series containing disease names.
- Returns:
A pandas Series containing cleaned disease names.
- Return type:
pandas.Series
Examples
>>> data.obs["disease_simple"] = clean_diseases(data.obs["disease"]).fillna("healthy")
- scimilarity.utils.clean_tissues(tissues)[source]#
Mapper to clean tissue names.
- Parameters:
tissues (pandas.Series) – A pandas Series containing tissue names.
- Returns:
A pandas Series containing cleaned tissue names.
- Return type:
pandas.Series
Examples
>>> data.obs["tissue_simple"] = clean_tissues(data.obs["tissue"]).fillna("other tissue")
- scimilarity.utils.consolidate_duplicate_symbols(adata)[source]#
Consolidate duplicate gene symbols with sum.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
- Returns:
AnnData object with duplicate gene symbols consolidated.
- Return type:
anndata.AnnData
Examples
>>> adata = consolidate_duplicate_symbols(adata)
- scimilarity.utils.filter_cells(data, min_genes=400, mito_prefix=None, mito_percent=30.0)[source]#
QC filter cells in the dataset from gene expression raw counts.
- Parameters:
data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
min_genes (int, default: 400) – The minimum number of expressed genes in order not to be filtered out.
mito_prefix (str, optional, default: None) – The prefix to represent mitochondria genes. Typically “MT-” or “mt-“. If None, it will try to infer whether it is either “MT-” or “mt-“.
mito_percent (float, default: 30.0) – The maximum percent allowed expressed mitochondria genes in order not to be filtered out.
- Returns:
A data object with cells filtered out based on QC metrics that is ready to be used in further processes.
- Return type:
anndata.AnnData
Examples
>>> data = filter_cells(data)
- scimilarity.utils.get_centroid(counts)[source]#
Get the centroid for a raw counts matrix.
- Parameters:
counts (scipy.sparse.csr_matrix, numpy.ndarray) – Raw gene expression counts.
- Returns:
A 2D numpy array of the log normalized (1e4) for the centroid.
- Return type:
numpy.ndarray
Examples
>>> centroid = get_centroid(data.get_matrix("counts")) >>> centroid = get_centroid(data.layers["counts"])
- scimilarity.utils.get_cluster_centroids(data, target_gene_order, cluster_key, cluster_label=None, skip_null=True)[source]#
Get centroids of clusters based on raw read counts.
- Parameters:
data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
target_gene_order (numpy.ndarray) – An array containing the gene space.
cluster_key (str) – The obs column name that contains cluster labels.
cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest
skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.
- Returns:
centroids (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroids.
cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.
- Return type:
Tuple[numpy.ndarray, list]
Examples
>>> centroids, cluster_idx = get_cluster_centroids(data, gene_order, "cluster_label")
- scimilarity.utils.get_dist2centroid(centroid_embedding, X)[source]#
Get the centroid for a raw counts matrix in sparse csr_matrix format.
- Parameters:
centroid_embedding (numpy.ndarray) – The embedding of the centroid.
X (scipy.sparse.csr_matrix, numpy.ndarray) – The embedding of SCimilarity log normalized gene expression values or SCimilarity log normalized gene expression values.
embed (bool, default: False) – Whether to embed X.
- Returns:
The mean distance of cells in X to the centroid embedding.
- Return type:
float
Examples
>>> distances = cq.get_dist2centroid(centroid_embedding, X)
- scimilarity.utils.lognorm_counts(data)[source]#
Log normalize the gene expression raw counts (per 10k).
- Parameters:
data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
- Returns:
A data object with normalized data that is ready to be used in further processes.
- Return type:
anndata.AnnData
Examples
>>> data = lognorm_counts(data)
- scimilarity.utils.optimize_tiledb_array(tiledb_array_uri, steps=100000, step_max_frags=10, buffer_size=1000000000, total_budget=200000000000, verbose=True)[source]#
Optimize TileDB Array.
- Parameters:
tiledb_array_uri (str) – URI for the TileDB array.
verbose (bool) – Boolean indicating whether to use verbose printing.
step_max_frags (int) –
buffer_size (int) –
total_budget (int) –
- scimilarity.utils.pseudobulk_anndata(adata, groupby_labels, qc_filters=None, min_num_cells=1, only_orig_genes=False)[source]#
Pseudobulk an AnnData and return a new AnnData.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
groupby_labels (Union[str, list]) – List of labels to groupby prior to pseudobulking. For example: [“sample”, “tissue”, “disease”, “celltype_name”] will groupby these columns and perform pseudobulking based on these groups.
qc_filters (dict, optional, default: None) –
- Dictionary containing cell filters to perform prior to pseudobulking:
”mito_percent”: max percent of reads in mitochondrial genes “min_counts”: min read count for cell “min_genes”: min number of genes with reads for cell “max_nn_dist”: max nearest neighbor distance to a reference label for predicted labels.
min_num_cells (int, default: 1) – The minimum number of cells in a pseudobulk in order to be considered.
only_orig_genes (bool, default: False) – Account for an aligned gene space and mask non original genes to the dataset with NaN as their pseudobulk. Assumes the original gene list is in adata.uns[“orig_genes”].
- Returns:
A data object where pseudobulk counts are in layers[“counts”] and detection rate is in layers[“detection”]
- Return type:
anndata.AnnData
Examples
>>> groupby_labels = ["sample", "tissue_raw", "celltype_name"] >>> qc_filters = {"mito_percent": 20.0, "min_counts": 1000, "min_genes": 500, "max_nn_dist": 0.03, "max_nn_dist_col": "min_dist"} >>> pseudobulk = pseudobulk_anndata(adata, groupby_labels, qc_filters=qc_filters, only_orig_genes=True)
- scimilarity.utils.query_tiledb_df(tdb, query_condition, attrs=None)[source]#
Query TileDB DataFrame.
- Parameters:
tdb (tiledb.libtiledb.DenseArrayImpl) – TileDB dataframe.
query_condition (str) – Query condition.
attrs (list, optional, default: None) – Columns to return in results
- Return type:
pandas.DataFrame
- scimilarity.utils.subset_by_frequency(df, group_columns, n)[source]#
Subset the DataFrame to only columns where the group appears at least n times.
- Parameters:
df ("pandas.DataFrame") – Pandas dataframe
group_columns (Union[List[str], str]) – Columns to group by.
n (int) – Minimum number of values to be included.
- Returns:
A subsetted dataframe.
- Return type:
pandas.DataFrame
Examples
>>> df = subset_by_frequency(df, ["disease", "prediction"], 10)
- scimilarity.utils.subset_by_unique_values(df, group_columns, value_column, n)[source]#
Subset a pandas dataframe to only include rows where there are at least n unique values from value_column, for each grouping of group_column.
- Parameters:
df ("pandas.DataFrame") – Pandas dataframe.
group_columns (Union[List[str], str]) – Columns to group by.
value_column (str) – Column value from which to check the number of instances.
n (int) – Minimum number of values to be included.
- Returns:
A subsetted dataframe.
- Return type:
pandas.DataFrame
Examples
>>> df = subset_by_unique_values(df, "disease", "sample", 10)
- scimilarity.utils.write_array_to_tiledb(tdb, arr, value_type, row_start=0, batch_size=100000)[source]#
Write numpy array to TileDB.
- Parameters:
tdb (tiledb.libtiledb.DenseArrayImpl) – TileDB array.
arr (numpy.ndarray) – Dense numpy array.
value_type (type) – The type of the value, typically np.float32.
row_start (int, default: 0) – The starting row in the TileDB array.
batch_size (int, default: 100000) – Batch size for the tiles.
- scimilarity.utils.write_csr_to_tiledb(tdb, matrix, value_type, row_start=0, batch_size=25000)[source]#
Write csr_matrix to TileDB.
- Parameters:
tdb (tiledb.libtiledb.SparseArrayImpl) – TileDB array.
arr (numpy.ndarray) – Dense numpy array.
value_type (type) – The type of the value, typically np.float32.
row_start (int, default: 0) – The starting row in the TileDB array.
batch_size (int, default: 100000) – Batch size for the tiles.
matrix (scipy.sparse.csr_matrix) –