scimilarity.tiledb_data_models#

class scimilarity.tiledb_data_models.CellMultisetDataModule(*args, **kwargs)[source]#

Bases: LightningDataModule

A class to encapsulate cells in TileDB to train the model.

Parameters:
  • dataset_path (str) – Path to the directory containing the TileDB stores.

  • cell_metadata_uri (str, default: "cell_metadata") – Relative path to the cell metadata store.

  • gene_annotation_uri (str, default: "gene_annotation") – Relative path to the gene annotation store.

  • counts_uri (str, default: "counts") – Relative path to the counts matrix store.

  • gene_order (str) – Use a given gene order as described in the specified file. One gene symbol per line.

  • val_studies (List[str], optional, default: None) – List of studies to use as validation and test.

  • exclude_studies (List[str], optional, default: None) – List of studies to exclude.

  • exclude_samples (Dict[str, List[str]], optional, default: None) – Dict of samples to exclude in the form {study: [list of samples]}.

  • label_id_column (str, default: "cellTypeOntologyID") – Cell ontology ID column name.

  • study_column (str, default: "datasetID") – Study column name.

  • sample_column (str, default: "sampleID") – Sample column name.

  • batch_size (int, default: 1000) – Batch size.

  • num_workers (int, default: 1) – The number of worker threads for dataloaders

  • lognorm (bool, default: True) – Whether to return log normalized expression instead of raw counts.

  • target_sum (float, default: 1e4) – Target sum for log normalization.

  • sparse (bool, default: False) – Use sparse matrices.

  • remove_singleton_classes (bool, default: True) – Exclude cells with classes that exist in only one study.

  • nan_string (str, default: "nan") – A string representing NaN.

  • sampler_cls (Sampler, default: CellSampler) – Sampler class to use for batching.

  • dataset_cls (Dataset, default: scDataset) – Base Dataset class to use.

  • n_batches (int, default: 100) – Number of batches to create in batch sampler. Should correspond to number of batches per epoch, as we are sampling with replacement.

  • pin_memory (bool, default: False) – If True, uses pin memory in the DataLoaders.

  • persistent_workers (bool, default: False) – If True, uses persistent workers in the DataLoaders.

  • filter_condition (str | None)

Examples

>>> datamodule = MetricLearningZarrDataModule(
        dataset_path="/opt/cellarr_dataset"
        label_id_column="id",
        study_column="study",
        batch_size=1000,
        num_workers=1,
    )
collate(batch)[source]#

Collate tensors.

Parameters:

batch – Batch to collate.

Returns:

Gene expression, labels, and studies

Return type:

tuple

get_data(filter_condition)[source]#
Filter the tiledb cell metadata according to some filter condition and

return the valid cells.

Parameters:

filter_condition (str) – A string that describes the filter condition according to tiledb search syntax.

get_sampler_weights(data_df, use_study=True, class_target_sum=10000.0, study_target_sum=1000000.0)[source]#

Get sampling weights and add to dataframe.

Parameters:
  • data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.

  • use_study (bool, default: False) – Incorporate studies in sampler weights

  • class_target_sum (float, default: 1e4) – Target sum for normalization of class counts.

  • study_target_sum (float, default: 1e6) – Target sum for normalization of study counts.

harmonize_cell_types(data_df)[source]#

Manual harmonization of some cell types.

Parameters:

data_df (pandas.DataFrame) – DataFrame with a label id column.

map_cell_type_id2name(data_df)[source]#

Map cell type ontology ID to name.

Parameters:

data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.

remove_singleton_label_ids(data_df, n_studies=2)[source]#

Ensure labels exist in at least a minimum number of studies.

Parameters:
  • data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.

  • n_studies (int, default: 2) – The number of studies a label must exist in to be valid.

test_dataloader()[source]#

Load the test dataset.

Returns:

A DataLoader object containing the test dataset.

Return type:

DataLoader

train_dataloader()[source]#

Load the training dataset.

Returns:

A DataLoader object containing the training dataset.

Return type:

DataLoader

val_dataloader()[source]#

Load the validation dataset.

Returns:

A DataLoader object containing the validation dataset.

Return type:

DataLoader

class scimilarity.tiledb_data_models.CellSampler(data_df, batch_size, n_batches, dynamic_weights=False, weight_decay=0.5, **kwargs)[source]#

Bases: Sampler[int]

Sampler class for composition of cells in minibatch.

Parameters:
  • data_df (pandas.DataFrame) – DataFrame with column “sampling_weight”

  • batch_size (int) – Batch size

  • n_batches (int) – Number of batches to create. Should correspond to number of batches per epoch, as we are sampling with replacement.

  • dynamic_weights (bool, default: False) – Dynamically lower the sampling weight of seen cells.

  • weight_decay (float, default: 0.5) – Weight decay factor.

class scimilarity.tiledb_data_models.scDataset(data_df)[source]#

Bases: Dataset

A class that represents cells in TileDB.

Parameters:

data_df (pandas.DataFrame) – Pandas dataframe of valid cells.