Preprocessing API Reference¶
This module provides comprehensive data preprocessing functions for spatial omics data analysis (proteomics and transcriptomics) in SPEX.
Functions¶
preprocess¶
preprocess(adata, scale_max=10, size_factor=None, do_QC=False)
Comprehensive preprocessing pipeline for AnnData objects.
Performs quality control, normalization, feature selection, and scaling in a single function call.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
adata |
AnnData | - | AnnData object to preprocess |
scale_max |
float | 10 | Maximum value for scaling (clips larger values) |
size_factor |
float | None | Size factor for normalization. If None, uses median library size |
do_QC |
bool | False | Whether to perform quality control filtering |
Returns:
| Type | Description |
|---|---|
| AnnData | Preprocessed AnnData object |
Notes:
The preprocessing pipeline includes the following steps:
- Quality Control (if
do_QC=True): - Filter cells with low gene counts using MAD threshold
- Filter genes with low cell counts
-
Remove cells with low total counts
-
Normalization:
- Calculate size factors (median library size if not provided)
- Normalize total counts per cell
-
Apply log1p transformation
-
Feature Selection:
- Identify highly variable genes
-
Account for batch effects if present
-
Scaling:
- Z-transform data
- Clip values to
scale_max
Examples:
import spex as sp
import scanpy as sc
# Load data
adata = sc.read_h5ad("data.h5ad")
# Basic preprocessing
adata = sp.preprocess(adata)
# Preprocessing with quality control
adata = sp.preprocess(
adata,
scale_max=10,
do_QC=True
)
# Custom size factor
adata = sp.preprocess(
adata,
size_factor=10000,
do_QC=True
)
Storage:
Results are stored in the AnnData object:
- adata.raw: Original data backup
- adata.uns['prepro']: Preprocessing parameters
- adata.var.highly_variable: Highly variable gene flags
- adata.layers['counts']: Normalized count data
MAD_threshold¶
MAD_threshold(variable, ndevs=2.5)
Calculate threshold using Median Absolute Deviation (MAD).
This function calculates a robust threshold value using the MAD method, which is less sensitive to outliers than standard deviation-based methods.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
variable |
array-like | - | Input data array |
ndevs |
float | 2.5 | Number of MAD deviations from the median |
Returns:
| Type | Description |
|---|---|
| float | Threshold value calculated as median - ndevs * MAD |
Notes:
- MAD is calculated as
median(|x - median(x)|) - The threshold is computed as
median - ndevs * MAD - This method is robust to outliers and non-normal distributions
- Commonly used for quality control filtering in single-cell data
Examples:
import spex as sp
import numpy as np
# Calculate threshold for total counts
total_counts = adata.obs['total_counts']
threshold = sp.MAD_threshold(total_counts, ndevs=2.5)
# Filter cells below threshold
filtered_adata = adata[adata.obs['total_counts'] > threshold, :]
# Use different threshold for genes
n_genes = adata.obs['n_genes_by_counts']
gene_threshold = sp.MAD_threshold(n_genes, ndevs=1.5)
should_batch_correct¶
should_batch_correct(adata)
Check if batch correction should be performed.
This function checks if batch correction is needed by looking for batch information in the AnnData object.
Parameters:
| Parameter | Type | Description |
|---|---|---|
adata |
AnnData | AnnData object to check for batch information |
Returns:
| Type | Description |
|---|---|
| bool | True if batch correction should be performed, False otherwise |
Notes:
- Checks for 'batch_key' in
adata.uns - Returns True if batch_key exists and is not None
- Used internally by preprocessing functions to determine batch correction strategy
Examples:
import spex as sp
# Check if batch correction is needed
needs_batch_correction = sp.should_batch_correct(adata)
if needs_batch_correction:
print(f"Batch correction will be performed using: {adata.uns['batch_key']}")
else:
print("No batch correction needed")
reduce_dimensionality¶
reduce_dimensionality(adata, prefilter=False, method='pca', mdist=0.5, n_neighbors=None, latent_dim=None)
Reduce dimensionality of AnnData object.
This function performs dimensionality reduction using various methods including PCA, UMAP, t-SNE, and scVI. It also constructs neighborhood graphs for downstream analysis.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
adata |
AnnData | - | AnnData object to reduce dimensionality |
prefilter |
bool | False | Whether to prefilter highly variable genes |
method |
str | 'pca' | Dimensionality reduction method. Options: 'pca', 'umap', 'tsne', 'scvi', 'diff_map' |
mdist |
float | 0.5 | Minimum distance for UMAP |
n_neighbors |
int | None | Number of neighbors for UMAP/t-SNE. If None, uses sqrt(n_cells) |
latent_dim |
int | None | Number of latent dimensions. If None, uses automatic estimation |
Returns:
| Type | Description |
|---|---|
| AnnData | Updated AnnData object with dimensionality reduction results |
Notes:
Supported Methods:
- PCA: Principal Component Analysis (fastest, most common)
- UMAP: Uniform Manifold Approximation and Projection (good for visualization)
- t-SNE: t-Distributed Stochastic Neighbor Embedding (good for visualization)
- scVI: Single-Cell Variational Inference (good for batch correction)
- diff_map: Diffusion maps (good for trajectory analysis)
Automatic Parameter Estimation:
n_neighbors: If None, usessqrt(n_cells)latent_dim: If None, uses elbow method on singular values
Batch Correction:
- If batch information is present, applies Harmony integration
- Results stored in
adata.obsm['X_pca_harmony']
Examples:
import spex as sp
# Basic PCA reduction
adata = sp.reduce_dimensionality(adata, method='pca')
# UMAP with custom parameters
adata = sp.reduce_dimensionality(
adata,
method='umap',
n_neighbors=30,
latent_dim=50,
mdist=0.3
)
# scVI for batch correction
adata = sp.reduce_dimensionality(
adata,
method='scvi',
latent_dim=20
)
Storage:
Results are stored in the AnnData object:
- adata.obsm['X_pca']: PCA coordinates
- adata.obsm['X_umap']: UMAP coordinates
- adata.obsm['X_tsne']: t-SNE coordinates (if computed)
- adata.obsm['X_scvi']: scVI latent space (if computed)
- adata.obsp['connectivities']: Neighborhood graph
- adata.obsm['distances']: Distance matrix
load_anndata¶
load_anndata(path=None, files=None)
Load AnnData objects from file(s).
This function loads AnnData objects from single or multiple files. When loading multiple files, they are combined into a single AnnData object.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str | None | Path to a single AnnData file (.h5ad) |
files |
list | None | List of file paths to load and combine |
Returns:
| Type | Description |
|---|---|
| dict | Dictionary containing the loaded AnnData object(s): 'adata': AnnData object |
Notes:
- If both
pathandfilesare provided,filestakes precedence - When combining multiple files, a 'filename' column is added to
adata.obs - Files are combined using
scanpy.concatwith outer join - Supports .h5ad format files
Examples:
import spex as sp
# Load single file
result = sp.load_anndata(path="data.h5ad")
adata = result['adata']
# Load multiple files
files = ["sample1.h5ad", "sample2.h5ad", "sample3.h5ad"]
result = sp.load_anndata(files=files)
combined_adata = result['adata']
# Check combined data
print(f"Combined shape: {combined_adata.shape}")
print(f"Files included: {combined_adata.obs['filename'].unique()}")
Best Practices¶
Quality Control¶
-
Always perform QC before analysis:
adata = sp.preprocess(adata, do_QC=True) -
Check data quality metrics:
import scanpy as sc sc.pp.calculate_qc_metrics(adata, inplace=True) print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}") -
Use appropriate thresholds:
# Conservative thresholds for high-quality data threshold_counts = sp.MAD_threshold(adata.obs['total_counts'], ndevs=2.0) threshold_genes = sp.MAD_threshold(adata.obs['n_genes_by_counts'], ndevs=1.5)
Dimensionality Reduction¶
- Choose appropriate method:
- PCA: For fast preprocessing and clustering
- UMAP: For visualization and exploration
-
scVI: For batch correction and integration
-
Estimate parameters automatically:
# Let the function estimate optimal parameters adata = sp.reduce_dimensionality(adata, method='pca') -
Check results:
# Verify dimensionality reduction worked print(f"PCA components: {adata.obsm['X_pca'].shape}") print(f"UMAP coordinates: {adata.obsm['X_umap'].shape}")
Batch Correction¶
-
Set up batch information:
adata.uns['batch_key'] = 'batch' -
Use batch-aware preprocessing:
# Preprocessing will automatically detect and handle batches adata = sp.preprocess(adata, do_QC=True) -
Verify batch correction:
# Check if batch correction was applied if 'X_pca_harmony' in adata.obsm: print("Batch correction applied successfully")
Troubleshooting¶
Common Issues¶
- Memory errors during preprocessing:
- Reduce
scale_maxparameter - Use data subsampling
-
Increase available RAM
-
Slow dimensionality reduction:
- Use
prefilter=Trueto reduce gene number - Choose faster methods (PCA over UMAP)
-
Reduce
latent_dimparameter -
Batch correction not working:
- Ensure
adata.uns['batch_key']is set - Check that batch column exists in
adata.obs -
Verify batch column contains valid values
-
File loading errors:
- Check file paths are correct
- Ensure files are in .h5ad format
- Verify file permissions
Performance Optimization¶
-
Use appropriate data types:
# Convert to float32 for memory efficiency adata.X = adata.X.astype('float32') -
Subsample large datasets:
# Random subsampling import scanpy as sc sc.pp.subsample(adata, n_obs=10000, random_state=42) -
Save intermediate results:
# Save after preprocessing adata.write("preprocessed_data.h5ad")
Integration with Other Modules¶
With Clustering¶
# Complete preprocessing for clustering
adata = sp.preprocess(adata, do_QC=True)
adata = sp.reduce_dimensionality(adata, method='pca')
adata = sp.cluster(adata, method='leiden')
With Spatial Analysis¶
# Preprocessing for spatial analysis
adata = sp.preprocess(adata, do_QC=True)
adata = sp.reduce_dimensionality(adata, method='pca')
# Spatial analysis functions can now use preprocessed data
With Segmentation Results¶
# Load and preprocess feature extraction results
adata = sp.load_anndata(path="segmentation_features.h5ad")['adata']
adata = sp.preprocess(adata, do_QC=True)