scimilarity.tiledb_data_models#
- class scimilarity.tiledb_data_models.CellMultisetDataModule(*args, **kwargs)[source]#
Bases:
LightningDataModule
A class to encapsulate cells in TileDB to train the model.
- Parameters:
dataset_path (str) – Path to the directory containing the TileDB stores.
cell_metadata_uri (str, default: "cell_metadata") – Relative path to the cell metadata store.
gene_annotation_uri (str, default: "gene_annotation") – Relative path to the gene annotation store.
counts_uri (str, default: "counts") – Relative path to the counts matrix store.
gene_order (str) – Use a given gene order as described in the specified file. One gene symbol per line.
val_studies (List[str], optional, default: None) – List of studies to use as validation and test.
exclude_studies (List[str], optional, default: None) – List of studies to exclude.
exclude_samples (Dict[str, List[str]], optional, default: None) – Dict of samples to exclude in the form {study: [list of samples]}.
label_id_column (str, default: "cellTypeOntologyID") – Cell ontology ID column name.
study_column (str, default: "datasetID") – Study column name.
sample_column (str, default: "sampleID") – Sample column name.
batch_size (int, default: 1000) – Batch size.
num_workers (int, default: 1) – The number of worker threads for dataloaders
lognorm (bool, default: True) – Whether to return log normalized expression instead of raw counts.
target_sum (float, default: 1e4) – Target sum for log normalization.
sparse (bool, default: False) – Use sparse matrices.
remove_singleton_classes (bool, default: True) – Exclude cells with classes that exist in only one study.
nan_string (str, default: "nan") – A string representing NaN.
sampler_cls (Sampler, default: CellSampler) – Sampler class to use for batching.
dataset_cls (Dataset, default: scDataset) – Base Dataset class to use.
n_batches (int, default: 100) – Number of batches to create in batch sampler. Should correspond to number of batches per epoch, as we are sampling with replacement.
pin_memory (bool, default: False) – If True, uses pin memory in the DataLoaders.
persistent_workers (bool, default: False) – If True, uses persistent workers in the DataLoaders.
filter_condition (str | None)
Examples
>>> datamodule = MetricLearningZarrDataModule( dataset_path="/opt/cellarr_dataset" label_id_column="id", study_column="study", batch_size=1000, num_workers=1, )
- collate(batch)[source]#
Collate tensors.
- Parameters:
batch – Batch to collate.
- Returns:
Gene expression, labels, and studies
- Return type:
tuple
- get_data(filter_condition)[source]#
- Filter the tiledb cell metadata according to some filter condition and
return the valid cells.
- Parameters:
filter_condition (str) – A string that describes the filter condition according to tiledb search syntax.
- get_sampler_weights(data_df, use_study=True, class_target_sum=10000.0, study_target_sum=1000000.0)[source]#
Get sampling weights and add to dataframe.
- Parameters:
data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.
use_study (bool, default: False) – Incorporate studies in sampler weights
class_target_sum (float, default: 1e4) – Target sum for normalization of class counts.
study_target_sum (float, default: 1e6) – Target sum for normalization of study counts.
- harmonize_cell_types(data_df)[source]#
Manual harmonization of some cell types.
- Parameters:
data_df (pandas.DataFrame) – DataFrame with a label id column.
- map_cell_type_id2name(data_df)[source]#
Map cell type ontology ID to name.
- Parameters:
data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.
- remove_singleton_label_ids(data_df, n_studies=2)[source]#
Ensure labels exist in at least a minimum number of studies.
- Parameters:
data_df (pandas.DataFrame) – DataFrame with a label id column and optionally a study column.
n_studies (int, default: 2) – The number of studies a label must exist in to be valid.
- test_dataloader()[source]#
Load the test dataset.
- Returns:
A DataLoader object containing the test dataset.
- Return type:
DataLoader
- class scimilarity.tiledb_data_models.CellSampler(data_df, batch_size, n_batches, dynamic_weights=False, weight_decay=0.5, **kwargs)[source]#
Bases:
Sampler
[int
]Sampler class for composition of cells in minibatch.
- Parameters:
data_df (pandas.DataFrame) – DataFrame with column “sampling_weight”
batch_size (int) – Batch size
n_batches (int) – Number of batches to create. Should correspond to number of batches per epoch, as we are sampling with replacement.
dynamic_weights (bool, default: False) – Dynamically lower the sampling weight of seen cells.
weight_decay (float, default: 0.5) – Weight decay factor.