scimilarity.tiledb_data_models#

class scimilarity.tiledb_data_models.CellMultisetDataModule(*args, **kwargs)[source]#

Bases: LightningDataModule

A class to encapsulate cells in TileDB to train the model.

Parameters:

dataset_path (str) – Path to the directory containing the TileDB stores.
cell_metadata_uri (str, default: "cell_metadata") – Relative path to the cell metadata store.
gene_annotation_uri (str, default: "gene_annotation") – Relative path to the gene annotation store.
counts_uri (str, default: "counts") – Relative path to the counts matrix store.
gene_order (str) – Use a given gene order as described in the specified file. One gene symbol per line.
val_studies (List[str], optional, default: None) – List of studies to use as validation and test.
exclude_studies (List[str], optional, default: None) – List of studies to exclude.
exclude_samples (Dict[str, List[str]], optional, default: None) – Dict of samples to exclude in the form {study: [list of samples]}.
label_id_column (str, default: "cellTypeOntologyID") – Cell ontology ID column name.
study_column (str, default: "datasetID") – Study column name.
sample_column (str, default: "sampleID") – Sample column name.
filter_condition (str, optional, default: None) – A TileDB query string that describes conditions to select valid cells to use in training. If None, it will default to: “{self.label_id_column}!=’{self.nan_string}’”
batch_size (int, default: 1000) – Batch size.
num_workers (int, default: 5) – The number of worker threads for dataloaders
n_batches (int, default: 100) – Number of batches to create in batch sampler. Should correspond to number of batches per epoch, as we are sampling with replacement.
lognorm (bool, default: True) – Whether to return log normalized expression instead of raw counts.
target_sum (float, default: 1e4) – Target sum for log normalization.
sparse (bool, default: False) – Use sparse matrices.
remove_singleton_classes (bool, default: True) – Exclude cells with classes that exist in only one study.
nan_string (str, default: "nan") – A string representing NaN.
sampler_cls (Sampler, default: scSampler) – Sampler class to use for batching.
dataset_cls (Dataset, default: scDataset) – Base Dataset class to use.
collator_cls (type, default: scCollator) – Collator class to use to collect data from tiledb.
pin_memory (bool, default: False) – If True, uses pin memory in the DataLoaders.
persistent_workers (bool, default: False) – If True, uses persistent workers in the DataLoaders. False if num_workers is 0.
multiprocessing_context (str, default: "fork") – Multiprocessing context for dataloaders: [“spawn”, “fork”].

Examples

>>> datamodule = MetricLearningZarrDataModule(
        dataset_path="/opt/cellarr_dataset"
        label_id_column="id",
        study_column="study",
        batch_size=1000,
        num_workers=1,
    )

get_data(filter_condition)[source]#

Filter the tiledb cell metadata according to some filter condition and: return the valid cells.

Parameters:: filter_condition (str) – A string that describes the filter condition according to tiledb search syntax.

get_sampler_weights(data_df, use_study=True, class_target_sum=10000.0, study_target_sum=1000000.0)[source]#

Get sampling weights and add to dataframe.

Parameters:

data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.
use_study (bool, default: False) – Incorporate studies in sampler weights.
class_target_sum (float, default: 1e4) – Target sum for normalization of class counts.
study_target_sum (float, default: 1e6) – Target sum for normalization of study counts.

harmonize_cell_types(data_df)[source]#

Manual harmonization of some cell types.

Parameters:: data_df (pandas.DataFrame) – DataFrame with a label id column.

map_cell_type_id2name(data_df)[source]#

Map cell type ontology ID to name.

Parameters:: data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.

remove_singleton_label_ids(data_df, n_studies=2)[source]#

Ensure labels exist in at least a minimum number of studies.

Parameters:

data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.
n_studies (int, default: 2) – The number of studies a label must exist in to be valid.

test_dataloader()[source]#

Load the test dataset.

Returns:: A DataLoader object containing the test dataset.
Return type:: DataLoader

train_dataloader()[source]#

Load the training dataset.

Returns:: A DataLoader object containing the training dataset.
Return type:: DataLoader

val_dataloader()[source]#

Load the validation dataset.

Returns:: A DataLoader object containing the validation dataset.
Return type:: DataLoader

class scimilarity.tiledb_data_models.scCollator(counts_uri, gene_indices, study_column, lognorm=True, target_sum=10000.0, sparse=False)[source]#

Bases: object

A class to collate data by retrieving from tiledb.

Parameters:

counts_uri (str) – Path to the counts matrix store.
gene_indices (List[int]) – The indices of a gene order relative to gene annotation store.
study_column (str) – Study column name.
lognorm (bool, default: True) – Whether to return log normalized expression instead of raw counts.
target_sum (float, default: 1e4) – Target sum for log normalization.
sparse (bool, default: False) – Use sparse matrices.

cfg#

TileDB configuration to increase memory budget and turn off tiledb multithreading

Type:: “tiledb.ctx.Config”

class scimilarity.tiledb_data_models.scDataset(data_df)[source]#

Bases: Dataset

A class that represents cells in TileDB.

Parameters:: data_df (pandas.DataFrame) – Pandas dataframe of valid cells.

class scimilarity.tiledb_data_models.scSampler(data_df, batch_size, n_batches, dynamic_weights=False, weight_decay=0.5, **kwargs)[source]#

Bases: Sampler[int]

Sampler class for composition of cells in minibatch.

Parameters:

data_df (pandas.DataFrame) – DataFrame with column “sampling_weight”
batch_size (int) – Batch size
n_batches (int) – Number of batches to create. Should correspond to number of batches per epoch, as we are sampling with replacement.
dynamic_weights (bool, default: False) – Dynamically lower the sampling weight of seen cells.
weight_decay (float, default: 0.5) – Weight decay factor.