Preprocessing API Reference¶
This module provides comprehensive data preprocessing functions for spatial transcriptomics data analysis in SPEX.
Functions¶
preprocess¶
preprocess(adata, scale_max=10, size_factor=None, do_QC=False)
Comprehensive preprocessing pipeline for AnnData objects.
Performs quality control, normalization, feature selection, and scaling in a single function call.
Parameters:
Parameter | Type | Default | Description |
---|---|---|---|
adata |
AnnData | - | AnnData object to preprocess |
scale_max |
float | 10 | Maximum value for scaling (clips larger values) |
size_factor |
float | None | Size factor for normalization. If None, uses median library size |
do_QC |
bool | False | Whether to perform quality control filtering |
Returns:
Type | Description |
---|---|
AnnData | Preprocessed AnnData object |
Notes:
The preprocessing pipeline includes the following steps:
- Quality Control (if
do_QC=True
): - Filter cells with low gene counts using MAD threshold
- Filter genes with low cell counts
-
Remove cells with low total counts
-
Normalization:
- Calculate size factors (median library size if not provided)
- Normalize total counts per cell
-
Apply log1p transformation
-
Feature Selection:
- Identify highly variable genes
-
Account for batch effects if present
-
Scaling:
- Z-transform data
- Clip values to
scale_max
Examples:
import spex as sp
import scanpy as sc
# Load data
adata = sc.read_h5ad("data.h5ad")
# Basic preprocessing
adata = sp.preprocess(adata)
# Preprocessing with quality control
adata = sp.preprocess(
adata,
scale_max=10,
do_QC=True
)
# Custom size factor
adata = sp.preprocess(
adata,
size_factor=10000,
do_QC=True
)
Storage:
Results are stored in the AnnData object:
- adata.raw
: Original data backup
- adata.uns['prepro']
: Preprocessing parameters
- adata.var.highly_variable
: Highly variable gene flags
- adata.layers['counts']
: Normalized count data
MAD_threshold¶
MAD_threshold(variable, ndevs=2.5)
Calculate threshold using Median Absolute Deviation (MAD).
This function calculates a robust threshold value using the MAD method, which is less sensitive to outliers than standard deviation-based methods.
Parameters:
Parameter | Type | Default | Description |
---|---|---|---|
variable |
array-like | - | Input data array |
ndevs |
float | 2.5 | Number of MAD deviations from the median |
Returns:
Type | Description |
---|---|
float | Threshold value calculated as median - ndevs * MAD |
Notes:
- MAD is calculated as
median(|x - median(x)|)
- The threshold is computed as
median - ndevs * MAD
- This method is robust to outliers and non-normal distributions
- Commonly used for quality control filtering in single-cell data
Examples:
import spex as sp
import numpy as np
# Calculate threshold for total counts
total_counts = adata.obs['total_counts']
threshold = sp.MAD_threshold(total_counts, ndevs=2.5)
# Filter cells below threshold
filtered_adata = adata[adata.obs['total_counts'] > threshold, :]
# Use different threshold for genes
n_genes = adata.obs['n_genes_by_counts']
gene_threshold = sp.MAD_threshold(n_genes, ndevs=1.5)
should_batch_correct¶
should_batch_correct(adata)
Check if batch correction should be performed.
This function checks if batch correction is needed by looking for batch information in the AnnData object.
Parameters:
Parameter | Type | Description |
---|---|---|
adata |
AnnData | AnnData object to check for batch information |
Returns:
Type | Description |
---|---|
bool | True if batch correction should be performed, False otherwise |
Notes:
- Checks for 'batch_key' in
adata.uns
- Returns True if batch_key exists and is not None
- Used internally by preprocessing functions to determine batch correction strategy
Examples:
import spex as sp
# Check if batch correction is needed
needs_batch_correction = sp.should_batch_correct(adata)
if needs_batch_correction:
print(f"Batch correction will be performed using: {adata.uns['batch_key']}")
else:
print("No batch correction needed")
reduce_dimensionality¶
reduce_dimensionality(adata, prefilter=False, method='pca', mdist=0.5, n_neighbors=None, latent_dim=None)
Reduce dimensionality of AnnData object.
This function performs dimensionality reduction using various methods including PCA, UMAP, t-SNE, and scVI. It also constructs neighborhood graphs for downstream analysis.
Parameters:
Parameter | Type | Default | Description |
---|---|---|---|
adata |
AnnData | - | AnnData object to reduce dimensionality |
prefilter |
bool | False | Whether to prefilter highly variable genes |
method |
str | 'pca' | Dimensionality reduction method. Options: 'pca', 'umap', 'tsne', 'scvi', 'diff_map' |
mdist |
float | 0.5 | Minimum distance for UMAP |
n_neighbors |
int | None | Number of neighbors for UMAP/t-SNE. If None, uses sqrt(n_cells) |
latent_dim |
int | None | Number of latent dimensions. If None, uses automatic estimation |
Returns:
Type | Description |
---|---|
AnnData | Updated AnnData object with dimensionality reduction results |
Notes:
Supported Methods:
- PCA: Principal Component Analysis (fastest, most common)
- UMAP: Uniform Manifold Approximation and Projection (good for visualization)
- t-SNE: t-Distributed Stochastic Neighbor Embedding (good for visualization)
- scVI: Single-Cell Variational Inference (good for batch correction)
- diff_map: Diffusion maps (good for trajectory analysis)
Automatic Parameter Estimation:
n_neighbors
: If None, usessqrt(n_cells)
latent_dim
: If None, uses elbow method on singular values
Batch Correction:
- If batch information is present, applies Harmony integration
- Results stored in
adata.obsm['X_pca_harmony']
Examples:
import spex as sp
# Basic PCA reduction
adata = sp.reduce_dimensionality(adata, method='pca')
# UMAP with custom parameters
adata = sp.reduce_dimensionality(
adata,
method='umap',
n_neighbors=30,
latent_dim=50,
mdist=0.3
)
# scVI for batch correction
adata = sp.reduce_dimensionality(
adata,
method='scvi',
latent_dim=20
)
Storage:
Results are stored in the AnnData object:
- adata.obsm['X_pca']
: PCA coordinates
- adata.obsm['X_umap']
: UMAP coordinates
- adata.obsm['X_tsne']
: t-SNE coordinates (if computed)
- adata.obsm['X_scvi']
: scVI latent space (if computed)
- adata.obsp['connectivities']
: Neighborhood graph
- adata.obsm['distances']
: Distance matrix
load_anndata¶
load_anndata(path=None, files=None)
Load AnnData objects from file(s).
This function loads AnnData objects from single or multiple files. When loading multiple files, they are combined into a single AnnData object.
Parameters:
Parameter | Type | Default | Description |
---|---|---|---|
path |
str | None | Path to a single AnnData file (.h5ad) |
files |
list | None | List of file paths to load and combine |
Returns:
Type | Description |
---|---|
dict | Dictionary containing the loaded AnnData object(s): 'adata': AnnData object |
Notes:
- If both
path
andfiles
are provided,files
takes precedence - When combining multiple files, a 'filename' column is added to
adata.obs
- Files are combined using
scanpy.concat
with outer join - Supports .h5ad format files
Examples:
import spex as sp
# Load single file
result = sp.load_anndata(path="data.h5ad")
adata = result['adata']
# Load multiple files
files = ["sample1.h5ad", "sample2.h5ad", "sample3.h5ad"]
result = sp.load_anndata(files=files)
combined_adata = result['adata']
# Check combined data
print(f"Combined shape: {combined_adata.shape}")
print(f"Files included: {combined_adata.obs['filename'].unique()}")
Best Practices¶
Quality Control¶
-
Always perform QC before analysis:
adata = sp.preprocess(adata, do_QC=True)
-
Check data quality metrics:
import scanpy as sc sc.pp.calculate_qc_metrics(adata, inplace=True) print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}")
-
Use appropriate thresholds:
# Conservative thresholds for high-quality data threshold_counts = sp.MAD_threshold(adata.obs['total_counts'], ndevs=2.0) threshold_genes = sp.MAD_threshold(adata.obs['n_genes_by_counts'], ndevs=1.5)
Dimensionality Reduction¶
- Choose appropriate method:
- PCA: For fast preprocessing and clustering
- UMAP: For visualization and exploration
-
scVI: For batch correction and integration
-
Estimate parameters automatically:
# Let the function estimate optimal parameters adata = sp.reduce_dimensionality(adata, method='pca')
-
Check results:
# Verify dimensionality reduction worked print(f"PCA components: {adata.obsm['X_pca'].shape}") print(f"UMAP coordinates: {adata.obsm['X_umap'].shape}")
Batch Correction¶
-
Set up batch information:
adata.uns['batch_key'] = 'batch'
-
Use batch-aware preprocessing:
# Preprocessing will automatically detect and handle batches adata = sp.preprocess(adata, do_QC=True)
-
Verify batch correction:
# Check if batch correction was applied if 'X_pca_harmony' in adata.obsm: print("Batch correction applied successfully")
Troubleshooting¶
Common Issues¶
- Memory errors during preprocessing:
- Reduce
scale_max
parameter - Use data subsampling
-
Increase available RAM
-
Slow dimensionality reduction:
- Use
prefilter=True
to reduce gene number - Choose faster methods (PCA over UMAP)
-
Reduce
latent_dim
parameter -
Batch correction not working:
- Ensure
adata.uns['batch_key']
is set - Check that batch column exists in
adata.obs
-
Verify batch column contains valid values
-
File loading errors:
- Check file paths are correct
- Ensure files are in .h5ad format
- Verify file permissions
Performance Optimization¶
-
Use appropriate data types:
# Convert to float32 for memory efficiency adata.X = adata.X.astype('float32')
-
Subsample large datasets:
# Random subsampling import scanpy as sc sc.pp.subsample(adata, n_obs=10000, random_state=42)
-
Save intermediate results:
# Save after preprocessing adata.write("preprocessed_data.h5ad")
Integration with Other Modules¶
With Clustering¶
# Complete preprocessing for clustering
adata = sp.preprocess(adata, do_QC=True)
adata = sp.reduce_dimensionality(adata, method='pca')
adata = sp.cluster(adata, method='leiden')
With Spatial Analysis¶
# Preprocessing for spatial analysis
adata = sp.preprocess(adata, do_QC=True)
adata = sp.reduce_dimensionality(adata, method='pca')
# Spatial analysis functions can now use preprocessed data
With Segmentation Results¶
# Load and preprocess feature extraction results
adata = sp.load_anndata(path="segmentation_features.h5ad")['adata']
adata = sp.preprocess(adata, do_QC=True)