scimilarity.utils#

scimilarity.utils.align_dataset(data, target_gene_order, keep_obsm=True, gene_overlap_threshold=5000)[source]#

Align the gene space to the target gene order.

Parameters:
  • data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

  • target_gene_order (numpy.ndarray) – An array containing the gene space.

  • keep_obsm (bool, default: True) – Retain the original data’s obsm matrices in output.

  • gene_overlap_threshold (int, default: 5000) – The minimum number of genes in common between data and target_gene_order to be valid.

Returns:

A data object with aligned gene space ready to be used for embedding cells.

Return type:

anndata.AnnData

Examples

>>> ca = CellAnnotation(model_path="/opt/data/model")
>>> align_dataset(data, ca.gene_order)
scimilarity.utils.categorize_and_sort_by_score(df, name_column, score_column, ascending=False, topn=None)[source]#

Transform column into category, sort, and choose top n

Parameters:
  • df ("pandas.DataFrame") – Pandas dataframe.

  • name_column (str) – Name of column to sort.

  • score_column (str) – Name of score column to sort name_column by.

  • ascending (bool) – Sort ascending

  • topn (Optional[int], default: None) – Subset to the top n diseases.

Returns:

A sorted dataframe that is optionally subsetted to top n.

Return type:

pandas.DataFrame

Examples

>>> df = categorize_and_sort_by_score(df, "disease", "Hit Percentage", topn=10)
scimilarity.utils.clean_diseases(diseases)[source]#

Mapper to clean disease names.

Parameters:

diseases (pandas.Series) – A pandas Series containing disease names.

Returns:

A pandas Series containing cleaned disease names.

Return type:

pandas.Series

Examples

>>> data.obs["disease_simple"] = clean_diseases(data.obs["disease"]).fillna("healthy")
scimilarity.utils.clean_tissues(tissues)[source]#

Mapper to clean tissue names.

Parameters:

tissues (pandas.Series) – A pandas Series containing tissue names.

Returns:

A pandas Series containing cleaned tissue names.

Return type:

pandas.Series

Examples

>>> data.obs["tissue_simple"] = clean_tissues(data.obs["tissue"]).fillna("other tissue")
scimilarity.utils.consolidate_duplicate_symbols(adata)[source]#

Consolidate duplicate gene symbols with sum.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

Returns:

AnnData object with duplicate gene symbols consolidated.

Return type:

anndata.AnnData

Examples

>>> adata = consolidate_duplicate_symbols(adata)
scimilarity.utils.filter_cells(data, min_genes=400, mito_prefix=None, mito_percent=30.0)[source]#

QC filter cells in the dataset from gene expression raw counts.

Parameters:
  • data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

  • min_genes (int, default: 400) – The minimum number of expressed genes in order not to be filtered out.

  • mito_prefix (str, optional, default: None) – The prefix to represent mitochondria genes. Typically “MT-” or “mt-“. If None, it will try to infer whether it is either “MT-” or “mt-“.

  • mito_percent (float, default: 30.0) – The maximum percent allowed expressed mitochondria genes in order not to be filtered out.

Returns:

A data object with cells filtered out based on QC metrics that is ready to be used in further processes.

Return type:

anndata.AnnData

Examples

>>> data = filter_cells(data)
scimilarity.utils.get_centroid(counts)[source]#

Get the centroid for a raw counts matrix.

Parameters:

counts (scipy.sparse.csr_matrix, numpy.ndarray) – Raw gene expression counts.

Returns:

A 2D numpy array of the log normalized (1e4) for the centroid.

Return type:

numpy.ndarray

Examples

>>> centroid = get_centroid(data.get_matrix("counts"))
>>> centroid = get_centroid(data.layers["counts"])
scimilarity.utils.get_cluster_centroids(data, target_gene_order, cluster_key, cluster_label=None, skip_null=True)[source]#

Get centroids of clusters based on raw read counts.

Parameters:
  • data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

  • target_gene_order (numpy.ndarray) – An array containing the gene space.

  • cluster_key (str) – The obs column name that contains cluster labels.

  • cluster_label (str, optional, default: None) – The cluster label of interest. If None, then get the centroids of all clusters, otherwise get only the centroid for the cluster of interest

  • skip_null (bool, default: True) – Whether to skip cells with null/nan cluster labels.

Returns:

  • centroids (numpy.ndarray) – A 2D numpy array of the log normalized (1e4) cluster centroids.

  • cluster_idx (list) – A list of cluster labels corresponding to the order returned in centroids.

Return type:

Tuple[numpy.ndarray, list]

Examples

>>> centroids, cluster_idx = get_cluster_centroids(data, gene_order, "cluster_label")
scimilarity.utils.get_dist2centroid(centroid_embedding, X)[source]#

Get the centroid for a raw counts matrix in sparse csr_matrix format.

Parameters:
  • centroid_embedding (numpy.ndarray) – The embedding of the centroid.

  • X (scipy.sparse.csr_matrix, numpy.ndarray) – The embedding of SCimilarity log normalized gene expression values or SCimilarity log normalized gene expression values.

  • embed (bool, default: False) – Whether to embed X.

Returns:

The mean distance of cells in X to the centroid embedding.

Return type:

float

Examples

>>> distances = cq.get_dist2centroid(centroid_embedding, X)
scimilarity.utils.lognorm_counts(data)[source]#

Log normalize the gene expression raw counts (per 10k).

Parameters:

data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

Returns:

A data object with normalized data that is ready to be used in further processes.

Return type:

anndata.AnnData

Examples

>>> data = lognorm_counts(data)
scimilarity.utils.optimize_tiledb_array(tiledb_array_uri, steps=100000, step_max_frags=10, buffer_size=1000000000, total_budget=200000000000, verbose=True)[source]#

Optimize TileDB Array.

Parameters:
  • tiledb_array_uri (str) – URI for the TileDB array.

  • verbose (bool) – Boolean indicating whether to use verbose printing.

  • step_max_frags (int) –

  • buffer_size (int) –

  • total_budget (int) –

scimilarity.utils.pseudobulk_anndata(adata, groupby_labels, qc_filters=None, min_num_cells=1, only_orig_genes=False)[source]#

Pseudobulk an AnnData and return a new AnnData.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

  • groupby_labels (Union[str, list]) – List of labels to groupby prior to pseudobulking. For example: [“sample”, “tissue”, “disease”, “celltype_name”] will groupby these columns and perform pseudobulking based on these groups.

  • qc_filters (dict, optional, default: None) –

    Dictionary containing cell filters to perform prior to pseudobulking:

    ”mito_percent”: max percent of reads in mitochondrial genes “min_counts”: min read count for cell “min_genes”: min number of genes with reads for cell “max_nn_dist”: max nearest neighbor distance to a reference label for predicted labels.

  • min_num_cells (int, default: 1) – The minimum number of cells in a pseudobulk in order to be considered.

  • only_orig_genes (bool, default: False) – Account for an aligned gene space and mask non original genes to the dataset with NaN as their pseudobulk. Assumes the original gene list is in adata.uns[“orig_genes”].

Returns:

A data object where pseudobulk counts are in layers[“counts”] and detection rate is in layers[“detection”]

Return type:

anndata.AnnData

Examples

>>> groupby_labels = ["sample", "tissue_raw", "celltype_name"]
>>> qc_filters = {"mito_percent": 20.0, "min_counts": 1000, "min_genes": 500, "max_nn_dist": 0.03, "max_nn_dist_col": "min_dist"}
>>> pseudobulk = pseudobulk_anndata(adata, groupby_labels, qc_filters=qc_filters, only_orig_genes=True)
scimilarity.utils.query_tiledb_df(tdb, query_condition, attrs=None)[source]#

Query TileDB DataFrame.

Parameters:
  • tdb (tiledb.libtiledb.DenseArrayImpl) – TileDB dataframe.

  • query_condition (str) – Query condition.

  • attrs (list, optional, default: None) – Columns to return in results

Return type:

pandas.DataFrame

scimilarity.utils.subset_by_frequency(df, group_columns, n)[source]#

Subset the DataFrame to only columns where the group appears at least n times.

Parameters:
  • df ("pandas.DataFrame") – Pandas dataframe

  • group_columns (Union[List[str], str]) – Columns to group by.

  • n (int) – Minimum number of values to be included.

Returns:

A subsetted dataframe.

Return type:

pandas.DataFrame

Examples

>>> df = subset_by_frequency(df, ["disease", "prediction"], 10)
scimilarity.utils.subset_by_unique_values(df, group_columns, value_column, n)[source]#

Subset a pandas dataframe to only include rows where there are at least n unique values from value_column, for each grouping of group_column.

Parameters:
  • df ("pandas.DataFrame") – Pandas dataframe.

  • group_columns (Union[List[str], str]) – Columns to group by.

  • value_column (str) – Column value from which to check the number of instances.

  • n (int) – Minimum number of values to be included.

Returns:

A subsetted dataframe.

Return type:

pandas.DataFrame

Examples

>>> df = subset_by_unique_values(df, "disease", "sample", 10)
scimilarity.utils.write_array_to_tiledb(tdb, arr, value_type, row_start=0, batch_size=100000)[source]#

Write numpy array to TileDB.

Parameters:
  • tdb (tiledb.libtiledb.DenseArrayImpl) – TileDB array.

  • arr (numpy.ndarray) – Dense numpy array.

  • value_type (type) – The type of the value, typically np.float32.

  • row_start (int, default: 0) – The starting row in the TileDB array.

  • batch_size (int, default: 100000) – Batch size for the tiles.

scimilarity.utils.write_csr_to_tiledb(tdb, matrix, value_type, row_start=0, batch_size=25000)[source]#

Write csr_matrix to TileDB.

Parameters:
  • tdb (tiledb.libtiledb.SparseArrayImpl) – TileDB array.

  • arr (numpy.ndarray) – Dense numpy array.

  • value_type (type) – The type of the value, typically np.float32.

  • row_start (int, default: 0) – The starting row in the TileDB array.

  • batch_size (int, default: 100000) – Batch size for the tiles.

  • matrix (scipy.sparse.csr_matrix) –