SIGnature.utils#

SIGnature.utils.align_dataset(data, target_gene_order, keep_obsm=True, gene_overlap_threshold=5000)[source]#

Align the gene space to the target gene order.

Parameters:
  • data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

  • target_gene_order (list) – A list containing the gene space.

  • keep_obsm (bool, default: True) – Retain the original data’s obsm matrices in output.

  • gene_overlap_threshold (int, default: 5000) – The minimum number of genes in common between data and target_gene_order to be valid.

Returns:

A data object with aligned gene space ready to be used for embedding cells.

Return type:

anndata.AnnData

Examples

>>> data = align_dataset(data, gene_order)
SIGnature.utils.categorize_and_sort_by_score(df, name_column, score_column, ascending=False, topn=None)[source]#

Transform column into category, sort, and choose top n

Parameters:
  • df ("pandas.DataFrame") – Pandas dataframe.

  • name_column (str) – Name of column to sort.

  • score_column (str) – Name of score column to sort name_column by.

  • ascending (bool) – Sort ascending

  • topn (Optional[int], default: None) – Subset to the top n diseases.

Returns:

A sorted dataframe that is optionally subsetted to top n.

Return type:

pandas.DataFrame

Examples

>>> df = categorize_and_sort_by_score(df, "disease", "Hit Percentage", topn=10)
SIGnature.utils.lognorm_counts(data)[source]#

Log normalize the gene expression raw counts (per 10k).

Parameters:

data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.

Returns:

A data object with normalized data that is ready to be used in further processes.

Return type:

anndata.AnnData

Examples

>>> data = lognorm_counts(data)
SIGnature.utils.optimize_tiledb_array(tiledb_array_uri, steps=100000, step_max_frags=10, buffer_size=1000000000, total_budget=200000000000, verbose=True)[source]#

Optimize TileDB Array.

Parameters:
  • tiledb_array_uri (str) – URI for the TileDB array.

  • verbose (bool) – Boolean indicating whether to use verbose printing.

  • step_max_frags (int)

  • buffer_size (int)

  • total_budget (int)

SIGnature.utils.subset_by_frequency(df, group_columns, n)[source]#

Subset the DataFrame to only columns where the group appears at least n times.

Parameters:
  • df ("pandas.DataFrame") – Pandas dataframe

  • group_columns (Union[List[str], str]) – Columns to group by.

  • n (int) – Minimum number of values to be included.

Returns:

A subsetted dataframe.

Return type:

pandas.DataFrame

Examples

>>> df = subset_by_frequency(df, ["disease", "prediction"], 10)
SIGnature.utils.subset_by_unique_values(df, group_columns, value_column, n)[source]#

Subset a pandas dataframe to only include rows where there are at least n unique values from value_column, for each grouping of group_column.

Parameters:
  • df ("pandas.DataFrame") – Pandas dataframe.

  • group_columns (Union[List[str], str]) – Columns to group by.

  • value_column (str) – Column value from which to check the number of instances.

  • n (int) – Minimum number of values to be included.

Returns:

A subsetted dataframe.

Return type:

pandas.DataFrame

Examples

>>> df = subset_by_unique_values(df, "disease", "sample", 10)
SIGnature.utils.title_name(s)[source]#

Return string if all upper case, otherwise return a title version.

Examples

>>> s = title_name('multiple myeloma') # outputs Multiple Myeloma
Parameters:

s (str)

Return type:

str

SIGnature.utils.write_csr_to_tiledb(tdb, matrix, value_type, row_start=0, batch_size=25000)[source]#

Write csr_matrix to TileDB.

Parameters:
  • tdb (tiledb.libtiledb.SparseArrayImpl) – TileDB array.

  • arr (numpy.ndarray) – Dense numpy array.

  • value_type (type) – The type of the value, typically np.float32.

  • row_start (int, default: 0) – The starting row in the TileDB array.

  • batch_size (int, default: 100000) – Batch size for the tiles.

  • matrix (scipy.sparse.csr_matrix)