scimilarity.utils#

scimilarity.utils.adata_from_tiledb(cell_idx, tiledb_base_path, gene_order=None, sample_uri='sample_metadata', gene_uri='gene_annotation', cell_uri='cell_metadata', counts_uri='counts', lognorm=True, target_sum=10000.0, config=None)[source]#

Constructs an AnnData object from cells in tiledb.

Parameters:

cell_idx (Union[list, "numpy.ndarray"]) – Cell indices in the tiledb.
tiledb_base_path (str) – Base path of tiledb store
gene_order (List[str], optional, default: None) – Gene order
sample_uri (str, default: "sample_metadata") – Relative path of sample metadata store
gene_uri (str, default: "gene_annotation") – Relative path of gene annotation store
cell_uri (str, default: "cell_metadata") – Relative path of cell metadata store
counts_uri (str, default: "counts") – Relative path of count matrix store
lognorm (bool, default: True) – Whether to return log normalized expression instead of raw counts.
target_sum (float, default: 1e4) – Target sum for log normalization.
config (tiledb.ctx.Config, optional, default: None) – Custom tiledb config

Returns:

A data object where counts are in layers[“counts”] and X is the lognorm expression

Return type:

anndata.AnnData

Examples

>>> adata = adata_from_tiledb(cell_idx, gene_order, tiledb_base_path)

scimilarity.utils.align_dataset(data, target_gene_order, keep_obsm=True, gene_overlap_threshold=5000)[source]#

Align the gene space to the target gene order.

Parameters:

data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
target_gene_order (list) – A list containing the gene space.
keep_obsm (bool, default: True) – Retain the original data’s obsm matrices in output.
gene_overlap_threshold (int, default: 5000) – The minimum number of genes in common between data and target_gene_order to be valid.

Returns:

A data object with aligned gene space ready to be used for embedding cells.

Return type:

anndata.AnnData

Examples

>>> ca = CellAnnotation(model_path="/opt/data/model")
>>> data = align_dataset(data, ca.gene_order)

scimilarity.utils.categorize_and_sort_by_score(df, name_column, score_column, ascending=False, topn=None)[source]#

Transform column into category, sort, and choose top n

Parameters:

df ("pandas.DataFrame") – Pandas dataframe.
name_column (str) – Name of column to sort.
score_column (str) – Name of score column to sort name_column by.
ascending (bool) – Sort ascending
topn (Optional[int], default: None) – Subset to the top n diseases.

Returns:

A sorted dataframe that is optionally subsetted to top n.

Return type:

pandas.DataFrame

Examples

>>> df = categorize_and_sort_by_score(df, "disease", "Hit Percentage", topn=10)

scimilarity.utils.clean_diseases(diseases)[source]#

Mapper to clean disease names.

Parameters:: diseases (pandas.Series) – A pandas Series containing disease names.
Returns:: A pandas Series containing cleaned disease names.
Return type:: pandas.Series

Examples

>>> data.obs["disease_simple"] = clean_diseases(data.obs["disease"]).fillna("healthy")

scimilarity.utils.clean_tissues(tissues)[source]#

Mapper to clean tissue names.

Parameters:: tissues (pandas.Series) – A pandas Series containing tissue names.
Returns:: A pandas Series containing cleaned tissue names.
Return type:: pandas.Series

Examples

>>> data.obs["tissue_simple"] = clean_tissues(data.obs["tissue"]).fillna("other tissue")

scimilarity.utils.consolidate_duplicate_symbols(adata)[source]#

Consolidate duplicate gene symbols with sum.

Parameters:: adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
Returns:: AnnData object with duplicate gene symbols consolidated.
Return type:: anndata.AnnData

Examples

>>> adata = consolidate_duplicate_symbols(adata)

scimilarity.utils.convert_id2symbol(adata, mapping_table)[source]#

Convert EnsembleIDs to gene symbols via a mapping table.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
mapping_table (str) – A tsv file in Ensemble format with gene names in column “Gene stable ID”.

Returns:

AnnData object with gene symbols as the adata.var.index.

Return type:

anndata.AnnData

Examples

>>> adata = consolidate_duplicate_symbols(adata)

scimilarity.utils.embedding_from_tiledb(cell_idx, embedding_tdb_uri, config=None)[source]#

Get embeddings from a precomputed tiledb.

Parameters:

cell_idx (Union[list, "numpy.ndarray"]) – Cell indices in the tiledb.
embedding_tdb_uri (str) – Path of tiledb store
config (tiledb.ctx.Config, optional, default: None) – Custom tiledb config

Returns:

Array of embeddings

Return type:

numpy.ndarrary

Examples

>>> embedding = embedding_from_tiledb(cell_idx, embedding_tdb_uri)

scimilarity.utils.filter_cells(data, min_genes=400, mito_prefix=None, mito_percent=30.0)[source]#

QC filter cells in the dataset from gene expression raw counts.

Parameters:

data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
min_genes (int, default: 400) – The minimum number of expressed genes in order not to be filtered out.
mito_prefix (str, optional, default: None) – The prefix to represent mitochondria genes. Typically “MT-” or “mt-“. If None, it will try to infer whether it is either “MT-” or “mt-“.
mito_percent (float, default: 30.0) – The maximum percent allowed expressed mitochondria genes in order not to be filtered out.

Returns:

A data object with cells filtered out based on QC metrics that is ready to be used in further processes.

Return type:

anndata.AnnData

Examples

>>> data = filter_cells(data)

scimilarity.utils.get_centroid(counts)[source]#

Get the centroid for a raw counts matrix.

Parameters:: counts (scipy.sparse.csr_matrix, numpy.ndarray) – Raw gene expression counts.
Returns:: A 2D numpy array of the log normalized (1e4) for the centroid.
Return type:: numpy.ndarray

Examples

>>> centroid = get_centroid(data.get_matrix("counts"))
>>> centroid = get_centroid(data.layers["counts"])

scimilarity.utils.get_cluster_centroids(data, target_gene_order, cluster_key, cluster_label=None, skip_null=True)[source]#

Get centroids of clusters based on raw read counts.

Parameters:

data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
target_gene_order (numpy.ndarray) – An array containing the gene space.
cluster_key (str) – The obs column name that contains cluster labels.
cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest
skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.

Returns:

centroids (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroids.
cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.

Return type:

Tuple[numpy.ndarray, list]

Examples

>>> centroids, cluster_idx = get_cluster_centroids(data, gene_order, "cluster_label")

scimilarity.utils.get_dist2centroid(centroid_embedding, X)[source]#

Get the centroid for a raw counts matrix in sparse csr_matrix format.

Parameters:

centroid_embedding (numpy.ndarray) – The embedding of the centroid.
X (scipy.sparse.csr_matrix, numpy.ndarray) – The embedding of SCimilarity log normalized gene expression values or SCimilarity log normalized gene expression values.
embed (bool, default: False) – Whether to embed X.

Returns:

The mean distance of cells in X to the centroid embedding.

Return type:

float

Examples

>>> distances = cq.get_dist2centroid(centroid_embedding, X)

scimilarity.utils.lognorm_counts(data)[source]#

Log normalize the gene expression raw counts (per 10k).

Parameters:: data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
Returns:: A data object with normalized data that is ready to be used in further processes.
Return type:: anndata.AnnData

Examples

>>> data = lognorm_counts(data)

scimilarity.utils.optimize_tiledb_array(tiledb_array_uri, config=None, verbose=True)[source]#

Optimize TileDB Array.

Parameters:

tiledb_array_uri (str) – URI for the TileDB array.
verbose (bool) – Boolean indicating whether to use verbose printing.
config (tiledb.ctx.Config | None)

scimilarity.utils.pseudobulk_anndata(adata, groupby_labels, qc_filters=None, min_num_cells=1, only_orig_genes=False)[source]#

Pseudobulk an AnnData and return a new AnnData.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
groupby_labels (Union[str, list]) – List of labels to groupby prior to pseudobulking. For example: [“sample”, “tissue”, “disease”, “celltype_name”] will groupby these columns and perform pseudobulking based on these groups.
qc_filters (dict, optional, default: None) –

Dictionary containing cell filters to perform prior to pseudobulking:
”mito_percent”: max percent of reads in mitochondrial genes “min_counts”: min read count for cell “min_genes”: min number of genes with reads for cell “max_nn_dist”: max nearest neighbor distance to a reference label for predicted labels.
min_num_cells (int, default: 1) – The minimum number of cells in a pseudobulk in order to be considered.
only_orig_genes (bool, default: False) – Account for an aligned gene space and mask non original genes to the dataset with NaN as their pseudobulk. Assumes the original gene list is in adata.uns[“orig_genes”].

Returns:

A data object where pseudobulk counts are in layers[“counts”] and detection rate is in layers[“detection”]

Return type:

anndata.AnnData

Examples

>>> groupby_labels = ["sample", "tissue_raw", "celltype_name"]
>>> qc_filters = {"mito_percent": 20.0, "min_counts": 1000, "min_genes": 500, "max_nn_dist": 0.03, "max_nn_dist_col": "min_dist"}
>>> pseudobulk = pseudobulk_anndata(adata, groupby_labels, qc_filters=qc_filters, only_orig_genes=True)

scimilarity.utils.query_tiledb_df(tdb, query_condition, attrs=None)[source]#

Query TileDB DataFrame.

Parameters:

tdb (tiledb.libtiledb.DenseArrayImpl) – TileDB dataframe.
query_condition (str) – Query condition.
attrs (list, optional, default: None) – Columns to return in results

Return type:

pandas.DataFrame

scimilarity.utils.subset_by_frequency(df, group_columns, n)[source]#

Subset the DataFrame to only columns where the group appears at least n times.

Parameters:

df ("pandas.DataFrame") – Pandas dataframe
group_columns (Union[List[str], str]) – Columns to group by.
n (int) – Minimum number of values to be included.

Returns:

A subsetted dataframe.

Return type:

pandas.DataFrame

Examples

>>> df = subset_by_frequency(df, ["disease", "prediction"], 10)

scimilarity.utils.subset_by_unique_values(df, group_columns, value_column, n)[source]#

Subset a pandas dataframe to only include rows where there are at least n unique values from value_column, for each grouping of group_column.

Parameters:

df ("pandas.DataFrame") – Pandas dataframe.
group_columns (Union[List[str], str]) – Columns to group by.
value_column (str) – Column value from which to check the number of instances.
n (int) – Minimum number of values to be included.

Returns:

A subsetted dataframe.

Return type:

pandas.DataFrame

Examples

>>> df = subset_by_unique_values(df, "disease", "sample", 10)

scimilarity.utils.write_array_to_tiledb(tdb, arr, value_type, row_start=0, batch_size=100000)[source]#

Write numpy array to TileDB.

Parameters:

tdb (tiledb.libtiledb.DenseArrayImpl) – TileDB array.
arr (numpy.ndarray) – Dense 2D numpy array.
value_type (type) – The type of the value, typically np.float32.
row_start (int, default: 0) – The starting row in the TileDB array.
batch_size (int, default: 100000) – Batch size for the tiles.

scimilarity.utils.write_csr_to_tiledb(tdb, matrix, value_type, row_start=0, batch_size=25000)[source]#

Write csr_matrix to TileDB.

Parameters:

tdb (tiledb.libtiledb.SparseArrayImpl) – TileDB array.
arr (numpy.ndarray) – Dense numpy array.
value_type (type) – The type of the value, typically np.float32.
row_start (int, default: 0) – The starting row in the TileDB array.
batch_size (int, default: 100000) – Batch size for the tiles.
matrix (scipy.sparse.csr_matrix)