SIGnature.utils#
- SIGnature.utils.align_dataset(data, target_gene_order, keep_obsm=True, gene_overlap_threshold=5000)[source]#
Align the gene space to the target gene order.
- Parameters:
data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
target_gene_order (list) – A list containing the gene space.
keep_obsm (bool, default: True) – Retain the original data’s obsm matrices in output.
gene_overlap_threshold (int, default: 5000) – The minimum number of genes in common between data and target_gene_order to be valid.
- Returns:
A data object with aligned gene space ready to be used for embedding cells.
- Return type:
anndata.AnnData
Examples
>>> data = align_dataset(data, gene_order)
- SIGnature.utils.categorize_and_sort_by_score(df, name_column, score_column, ascending=False, topn=None)[source]#
Transform column into category, sort, and choose top n
- Parameters:
df ("pandas.DataFrame") – Pandas dataframe.
name_column (str) – Name of column to sort.
score_column (str) – Name of score column to sort name_column by.
ascending (bool) – Sort ascending
topn (Optional[int], default: None) – Subset to the top n diseases.
- Returns:
A sorted dataframe that is optionally subsetted to top n.
- Return type:
pandas.DataFrame
Examples
>>> df = categorize_and_sort_by_score(df, "disease", "Hit Percentage", topn=10)
- SIGnature.utils.lognorm_counts(data)[source]#
Log normalize the gene expression raw counts (per 10k).
- Parameters:
data (anndata.AnnData) – Annotated data matrix with rows for cells and columns for genes.
- Returns:
A data object with normalized data that is ready to be used in further processes.
- Return type:
anndata.AnnData
Examples
>>> data = lognorm_counts(data)
- SIGnature.utils.optimize_tiledb_array(tiledb_array_uri, steps=100000, step_max_frags=10, buffer_size=1000000000, total_budget=200000000000, verbose=True)[source]#
Optimize TileDB Array.
- Parameters:
tiledb_array_uri (str) – URI for the TileDB array.
verbose (bool) – Boolean indicating whether to use verbose printing.
step_max_frags (int)
buffer_size (int)
total_budget (int)
- SIGnature.utils.subset_by_frequency(df, group_columns, n)[source]#
Subset the DataFrame to only columns where the group appears at least n times.
- Parameters:
df ("pandas.DataFrame") – Pandas dataframe
group_columns (Union[List[str], str]) – Columns to group by.
n (int) – Minimum number of values to be included.
- Returns:
A subsetted dataframe.
- Return type:
pandas.DataFrame
Examples
>>> df = subset_by_frequency(df, ["disease", "prediction"], 10)
- SIGnature.utils.subset_by_unique_values(df, group_columns, value_column, n)[source]#
Subset a pandas dataframe to only include rows where there are at least n unique values from value_column, for each grouping of group_column.
- Parameters:
df ("pandas.DataFrame") – Pandas dataframe.
group_columns (Union[List[str], str]) – Columns to group by.
value_column (str) – Column value from which to check the number of instances.
n (int) – Minimum number of values to be included.
- Returns:
A subsetted dataframe.
- Return type:
pandas.DataFrame
Examples
>>> df = subset_by_unique_values(df, "disease", "sample", 10)
- SIGnature.utils.title_name(s)[source]#
Return string if all upper case, otherwise return a title version.
Examples
>>> s = title_name('multiple myeloma') # outputs Multiple Myeloma
- Parameters:
s (str)
- Return type:
str
- SIGnature.utils.write_csr_to_tiledb(tdb, matrix, value_type, row_start=0, batch_size=25000)[source]#
Write csr_matrix to TileDB.
- Parameters:
tdb (tiledb.libtiledb.SparseArrayImpl) – TileDB array.
arr (numpy.ndarray) – Dense numpy array.
value_type (type) – The type of the value, typically np.float32.
row_start (int, default: 0) – The starting row in the TileDB array.
batch_size (int, default: 100000) – Batch size for the tiles.
matrix (scipy.sparse.csr_matrix)