Example application of quipcell to human lung cell atlas¶

In this vignette, we demonstrate the application of quipcell to bulk deconvolution, replicating some analyses from the manuscript.

For our dataset, we use the Human Lung Cell Atlas, which consists of several studies. We hold out one of the studies as a test set, which we use to generate (pseudo)bulk samples with known ground truth.

To make this analysis runnable on a laptop, the data has been preprocessed into a smaller form (just the raw counts of the highly variable genes). See the vignette readme for instructions on how to download or generate the preprocessed data before running this notebook.

[1]:

from collections import Counter

import logging
logging.basicConfig(level=logging.INFO)

import numpy as np
import pandas as pd

import plotnine as gg
import mizani
import mizani.transforms as scales

import anndata as ad
import scanpy as sc

import scipy.sparse

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import OneHotEncoder

import quipcell as qpc

WARNING:matplotlib.font_manager:Matplotlib is building the font cache; this may take a moment.
INFO:matplotlib.font_manager:Failed to extract font properties from /System/Library/Fonts/Supplemental/NISC18030.ttf: In FT2Font: Could not set the fontsize (invalid pixel size; error code 0x17)
INFO:matplotlib.font_manager:Failed to extract font properties from /System/Library/Fonts/LastResort.otf: tuple indices must be integers or slices, not str
INFO:matplotlib.font_manager:Failed to extract font properties from /System/Library/Fonts/Apple Color Emoji.ttc: In FT2Font: Could not set the fontsize (invalid pixel size; error code 0x17)
INFO:matplotlib.font_manager:generated new fontManager

Load single cell data¶

First, we load the (preprocessed) Human Lung Cell Atlas data.

[2]:

adata = ad.read_h5ad("data/hlca_hvgs.h5ad")

Next, we subset to 1/8th of the dataset to improve speed. We also exclude one of the studies as a validation test set to benchmark bulk deconvolution.

[3]:

validation_study = 'Krasnow_2020'

rng = np.random.default_rng(12345)
n = adata.obs.shape[0]

keep = adata.obs['study'] != validation_study
keep = keep & (rng.random(size=n) < .125)

adata_ref = adata[keep,:]
print(f"{adata_ref.shape[0]} cells in reference set")

65531 cells in reference set

Normalize and transform single cell data¶

Next, normalize the raw counts. The preprocessing script saved the total UMIs per cell before filtering to highly variable genes, so we divide it out, and normalize to counts per 1000 UMIs.

[4]:

# Normalize
normalize_sum = 1000
adata_ref.X = scipy.sparse.diags(normalize_sum / adata_ref.obs['n_umi'].values) @ adata_ref.X

Next, we run PCA, and then perform Linear Discriminant Analysis on the PCs and celltype labels. This will be the embedding space we use for quipcell.

[5]:

sc.pp.pca(adata_ref, n_comps=100)

# since the PCA is zero-centered, we need to save the gene-wise means
# to apply PCA rotation to pseudobulks
gene_means = adata_ref.X.mean(axis=0)
gene_means = np.squeeze(np.asarray(gene_means))
adata_ref.var['x_mean'] = gene_means

lda = LinearDiscriminantAnalysis(n_components=15)

X = adata_ref.obsm['X_pca']
y = adata_ref.obs['ann_finest_level']

lda.fit(X, y)

adata_ref.obsm['X_lda'] = lda.transform(adata_ref.obsm['X_pca'])

/Users/kammj2/miniforge3/envs/quipcell/lib/python3.12/site-packages/scanpy/preprocessing/_pca.py:314: ImplicitModificationWarning: Setting element `.obsm['X_pca']` of view, initializing view as actual.

Load and transform (pseudo)bulk data¶

Next, we read in the pseudobulk counts (generated in preprocessing.py), and subset to the samples in the validation/test set.

We then normalize the counts, and subset to the same genes as the reference single cell set.

Then, we apply a linear transformation to map the bulk samples to the embedding space.

[6]:

adata_pseudobulk = ad.read_h5ad("data/pseudobulks.h5ad")

# Subset to validation set
keep = adata_pseudobulk.obs['study'] == validation_study
adata_pseudobulk = adata_pseudobulk[keep,:]

# Normalize on the same scale as the single cell reference set
adata_pseudobulk.raw = adata_pseudobulk
sc.pp.normalize_total(adata_pseudobulk, target_sum=normalize_sum)

# Subset to same genes as the single cell reference set
adata_pseudobulk = adata_pseudobulk[:, adata.var.index]

# Apply PCA rotation to pseudobulks
X = adata_pseudobulk.X - adata_ref.var['x_mean'].values
X = np.asarray(X @ adata_ref.varm['PCs'])
adata_pseudobulk.obsm['X_pca'] = X
# Apply LDA rotation to pseudobulks
adata_pseudobulk.obsm['X_lda'] = lda.transform(X)

/var/folders/rw/hpynwn2500q6m7_3rdsw2zcc0000gp/T/ipykernel_23386/4126115140.py:17: ImplicitModificationWarning: Setting element `.obsm['X_pca']` of view, initializing view as actual.

Estimate weights for read-level abundances¶

Now that the single cell reference set and the bulk samples are in the same embedding space, we can estimate the weights with quipcell.

[7]:

w_umi = qpc.estimate_weights_multisample(adata_ref.obsm['X_lda'],
                                         adata_pseudobulk.obsm['X_lda'])

INFO:quipcell._solvers:Solving for sample i=0
INFO:quipcell._solvers:objective=2.208068696366285, optimal
INFO:quipcell._solvers:Solving for sample i=1
INFO:quipcell._solvers:objective=1.3962415069956384, optimal
INFO:quipcell._solvers:Solving for sample i=2
INFO:quipcell._solvers:objective=2.1952911740312, optimal
INFO:quipcell._solvers:Solving for sample i=3
INFO:quipcell._solvers:objective=1.071730756157023, optimal
INFO:quipcell._solvers:Solving for sample i=4
INFO:quipcell._solvers:objective=0.4401995928507547, optimal

Convert read-level abundance to cell-level abundance¶

Note that the gene counts of bulk samples don’t normalize for the fact that different cell types have different number of reads. Therefore, the weights represent a probability distribution of a random read drawn from the bulk sample, rather than a probability distribution over a random cell from the bulk sample.

We can convert the read-level weights to cell-level weights by estimating size factors for each of the reference cells, and normalizing the weights by those.

If the reference set doesn’t have too much technical variation, then we can just use the number of UMIs per cell as the normalizing factor. However, the HLCA is a highly heterogeneous dataset, consisting of many samples from different studies, sequenced to different depths, and containing different celltypes. Therefore, we need to control for sample when estimating the size factors. quipcell provides a method to estimate size factors while controlling for sample by using Poisson regression.

[8]:

# Estimate size factors using Poisson regression
size_factors = qpc.estimate_size_factors(
    adata_ref.obsm['X_lda'],
    adata_ref.obs['n_umi'].values,
    adata_ref.obs['sample'].values,
    #verbose=True
)

# Convert read-level weights to cell-level weights
w_cell = qpc.renormalize_weights(w_umi, size_factors)

/Users/kammj2/miniforge3/envs/quipcell/lib/python3.12/site-packages/sklearn/linear_model/_linear_loss.py:294: RuntimeWarning: invalid value encountered in matmul
/Users/kammj2/miniforge3/envs/quipcell/lib/python3.12/site-packages/sklearn/linear_model/_linear_loss.py:294: RuntimeWarning: invalid value encountered in matmul

Plot inferred weights¶

Below, we plot the read-level and cell-level weights on the reference UMAP for each of the 5 validation bulk samples.

[9]:

# Dataframe for plotting UMI-level weights on UMAP
df = pd.DataFrame(
    np.hstack([adata_ref.obsm['X_umap'], w_umi]),
    columns = ['UMAP1', 'UMAP2'] + list(adata_pseudobulk.obs.index)
).melt( # pivot to long format
    id_vars=['UMAP1', 'UMAP2'],
    var_name='sample',
    value_name='weight'
)

# set very small weights to 0 for prettier plotting
weight_trunc = df['weight'].values.copy()
weight_trunc[weight_trunc < 1e-9] = 0
df['weight_trunc'] = weight_trunc

(gg.ggplot(df, gg.aes(x="UMAP1", y="UMAP2", color="weight_trunc")) +
    gg.geom_point(size=.25, alpha=.5) +
    gg.facet_wrap("~sample") +
    gg.scale_color_cmap(trans=scales.log_trans(base=10)) +
    gg.ggtitle("Read-level weights"))

/Users/kammj2/miniforge3/envs/quipcell/lib/python3.12/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log10
INFO:matplotlib.font_manager:Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt

[10]:

 # Dataframe for plotting cell-level weights on UMAP
df = pd.DataFrame(
    np.hstack([adata_ref.obsm['X_umap'], w_cell]),
    columns = ['UMAP1', 'UMAP2'] + list(adata_pseudobulk.obs.index)
).melt( # pivot to long format
    id_vars=['UMAP1', 'UMAP2'],
    var_name='sample',
    value_name='weight'
)

# set very small weights to 0 for prettier plotting
weight_trunc = df['weight'].values.copy()
weight_trunc[weight_trunc < 1e-9] = 0
df['weight_trunc'] = weight_trunc

(gg.ggplot(df, gg.aes(x="UMAP1", y="UMAP2", color="weight_trunc")) +
    gg.geom_point(size=.25, alpha=.5) +
    gg.facet_wrap("~sample") +
    gg.scale_color_cmap(trans=scales.log_trans(base=10)) +
    gg.ggtitle("Cell-level weights"))

/Users/kammj2/miniforge3/envs/quipcell/lib/python3.12/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log10
INFO:matplotlib.font_manager:Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt

Plot true vs estimated abundances¶

Next, we compare the estimated weights to the true celltype proportions. In the preprocessing script, we precomputed the fraction of reads and cells coming from each celltype, at different levels of celltype annotation. We read in these precomputed abundances below.

[11]:

df_abundance = pd.read_csv('data/abundances.csv')
# filter to samples in the validation set
df_abundance = df_abundance[df_abundance['sample'].isin(adata_pseudobulk.obs.index)]
# convert sample name to string to prevent missing categorical levels
df_abundance['sample'] = df_abundance['sample'].astype(str)
df_abundance

[11]:

	sample	ann_level	celltype	n_umi	frac_umi	n_cell	frac_cell
644	distal 1a	lvl1	Endothelial	4461681	0.060048	1495	0.198724
645	distal 1a	lvl1	Epithelial	8394719	0.112981	853	0.113386
646	distal 1a	lvl1	Immune	59114660	0.795600	4719	0.627276
647	distal 1a	lvl1	Stroma	2330951	0.031371	456	0.060614
648	distal 2	lvl1	Endothelial	30078564	0.205619	8449	0.455693
...	...	...	...	...	...	...	...
25559	proximal 3	lvl5	Subpleural fibroblasts	0	0.000000	0	0.000000
25560	proximal 3	lvl5	Suprabasal	10395727	0.130773	482	0.062041
25561	proximal 3	lvl5	T cells proliferating	34048	0.000428	1	0.000129
25562	proximal 3	lvl5	Tuft	0	0.000000	0	0.000000
25563	proximal 3	lvl5	pre-TB secretory	1892948	0.023812	144	0.018535

770 rows × 7 columns

Next, we compute estimated celltype abundances, by summing over the weights fit by quipcell. We add the estimated number of reads and cells coming from each celltype to the dataframe of abundances.

[12]:

est_frac_umi = []
est_frac_cells = []

# TODO use a more descriptive variable name than i
for i in range(1, 6):
    enc = OneHotEncoder()
    mat_onehot = enc.fit_transform(
        adata_ref.obs[f'celltype_lvl{i}'].values.reshape(-1,1)
    )

    # Function to sum weights by celltype, then pivot to long dataframe
    def aggregate_and_reshape(w):
        return pd.DataFrame(
            w.T @ mat_onehot,
            columns = enc.categories_[0],
            index = adata_pseudobulk.obs.index
        ).reset_index(names='sample').melt(
            id_vars=['sample'], var_name='celltype', value_name='weight'
        ).assign(ann_level=f'lvl{i}')

    est_frac_umi.append(aggregate_and_reshape(w_umi))
    est_frac_cells.append(aggregate_and_reshape(w_cell))


# Function to combine the summed weights across annotation levels,
# returning a vector to be added to df_abundance
def concat_aggregated_weights(agg_weight_list):
    idx_keys = ['ann_level', 'sample', 'celltype']
    return (pd.concat(agg_weight_list)
            .set_index(idx_keys)
            .loc[zip(*[df_abundance[k] for k in idx_keys])]
            .values)

df_abundance['est_frac_umi'] = concat_aggregated_weights(est_frac_umi)
df_abundance['est_frac_cell'] = concat_aggregated_weights(est_frac_cells)

df_abundance

[12]:

	sample	ann_level	celltype	n_umi	frac_umi	n_cell	frac_cell	est_frac_umi	est_frac_cell
644	distal 1a	lvl1	Endothelial	4461681	0.060048	1495	0.198724	0.055587	0.184202
645	distal 1a	lvl1	Epithelial	8394719	0.112981	853	0.113386	0.085862	0.085786
646	distal 1a	lvl1	Immune	59114660	0.795600	4719	0.627276	0.843633	0.710039
647	distal 1a	lvl1	Stroma	2330951	0.031371	456	0.060614	0.014919	0.019973
648	distal 2	lvl1	Endothelial	30078564	0.205619	8449	0.455693	0.177143	0.379683
...	...	...	...	...	...	...	...	...	...
25559	proximal 3	lvl5	Subpleural fibroblasts	0	0.000000	0	0.000000	0.000464	0.000488
25560	proximal 3	lvl5	Suprabasal	10395727	0.130773	482	0.062041	0.036868	0.029626
25561	proximal 3	lvl5	T cells proliferating	34048	0.000428	1	0.000129	0.000636	0.001184
25562	proximal 3	lvl5	Tuft	0	0.000000	0	0.000000	0.000277	0.000255
25563	proximal 3	lvl5	pre-TB secretory	1892948	0.023812	144	0.018535	0.007603	0.006743

770 rows × 9 columns

Finally, we plot the estimated vs actual abundances, at both the read and cell level.

At the finest resolution (annotation level 5), there is some inaccuracy because a common celltype in the validation set (CCL3+ Alveolar macrophages) is very rare in the reference samples, and cannot be distinguished from other Alveolar macrophages in the embedding space. See the manuscript for more details.

[13]:

 (gg.ggplot(df_abundance.sample(frac=1, random_state=42),
           gg.aes(x="frac_umi", y="est_frac_umi", color="ann_level", shape="sample")) +
    gg.geom_point() +
    gg.geom_abline(linetype='dashed') +
    gg.scale_x_sqrt(breaks=[.01,.1,.2,.4,.6,.8]) +
    gg.scale_y_sqrt(breaks=[.01,.1,.2,.4,.6,.8]) +
    gg.theme_bw(base_size=16) +
    gg.ggtitle("Read-level accuracy"))

INFO:matplotlib.font_manager:Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt

[14]:

 (gg.ggplot(df_abundance.sample(frac=1, random_state=42),
           gg.aes(x="frac_cell", y="est_frac_cell", color="ann_level", shape="sample")) +
    gg.geom_point() +
    gg.geom_abline(linetype='dashed') +
    gg.scale_x_sqrt(breaks=[.01,.1,.2,.4,.6,.8]) +
    gg.scale_y_sqrt(breaks=[.01,.1,.2,.4,.6,.8]) +
    gg.theme_bw(base_size=16) +
    gg.ggtitle("Cell-level accuracy"))

INFO:matplotlib.font_manager:Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt

[ ]: