Tutorial 1: Querying Attributions#

SIGnature enables querying gene set activation across multiple cell types.

This tutorial goes through the steps for generating cell-level gene set scores for any gene signature of interest using our pre-defined cell type atlases.

Before running this tutorial, you need to:

  1. Download the SCimilarity model from https://zenodo.org/records/15729925

  2. Download the cell type-level attributions you are interested in from https://zenodo.org/communities/signature/. For this tutorial, you will need two attribution folders from Zenodo (https://zenodo.org/records/15794176):

    1. celltype_attributions_lung: contains data for the first part of the tutorial showing a simple querying example. This folder can be extracted with the command tar -xzf celltype_attributions_lung.tar.gz

    2. celltype_attributions: contains data from myeloid cells to replicate the MS1 signature analysis from the paper. This folder can be extracted with the command tar -xzf celltype_attributions.tar.gz

[1]:
import matplotlib.pyplot as plt
import numpy as np
import os
from os.path import join
import pandas as pd
import seaborn as sns
from SIGnature import SIGnature, SCimilarity, Meta
import tiledb
import time
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore")

Lung Example#

Compare type I pneumocyte marker genes in that cell type and compare to ionocytes.

Load SCimilarity Model and Attributions TileDB#

[2]:
## Path to Scimilarity Model Downloaded from Zenodo (https://zenodo.org/records/15729925)
scim_model_path = "/home/scimilarity_model"
## Set use_gpu false when querying attributions
scim = SCimilarity(model_path=scim_model_path, use_gpu=False)
[3]:
## Folder where the cell type attributions are stored
## Downloaded from Zenodo (https://zenodo.org/records/15794176)
ct_base = "/home/celltype_attributions_lung"
all_ct = os.listdir(ct_base)

Check Genes#

[4]:
## Check genes (AT1 associated genes from CellMarker)
goi = ['CAV1', 'CAV2', 'CLIC5']
sig = SIGnature(gene_order=scim.gene_order, model=scim)
gene_list = sig.check_genes(goi)

Calculate Mean Attribution Scores#

[5]:
ctoi = ["type I pneumocyte", "pulmonary ionocyte"]
score_name = "AT1"
[6]:
meta_list = []
for ct in tqdm(ctoi):
    ## Get specific tiledb path for location of interest
    ct_path = join(ct_base, ct)
    ## Load attributions tiledb path for given cell type
    att_tdb_path = join(ct_path, "attributions")
    ## Calculate
    results = sig.query_attributions(
        gene_list,
        return_aggregate=True,
        aggregate_type="mean",
        attribution_tiledb_uri=att_tdb_path,
    )

    ## Example including weights
    # np.random.seed(114)
    # weight_list = np.random.choice([0.5, 1, 2], len(gene_list)).tolist()
    # results = sig.query_attributions(gene_list, return_aggregate=True,
    #               aggregate_type='mean', weights = weight_list,
    #               attribution_tiledb_uri=att_tdb_path)

    ## Example of getting individual genes
    # results = sig.query_attributions(gene_list, return_aggregate=False)

    ## Load corresponding cell metadata
    ct_path = join(ct_base, ct)
    meta = pd.read_csv(join(ct_path, "cell_metadata.csv.gz"), index_col=0)

    ## Put score
    meta[score_name] = results
    meta_list.append(meta.copy())
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.88it/s]

Analyze scores#

Compare mean attribution scores for AT1 epithelial genes between AT1 cells and pulmonary ionocytes.

[7]:
big_meta = pd.concat(meta_list)
[8]:
fig, ax = plt.subplots(figsize=(4, 4))
sns.violinplot(
    data=big_meta,
    x="prediction",
    y="AT1",
    palette=["tab:purple", "tab:green"],
    density_norm='width'
)
ax.set_ylabel("Mean Attribution Score")
ax.set_xlabel("Cell Type")
[8]:
Text(0.5, 0, 'Cell Type')
../_images/notebooks_1-QueryingAttributions_13_1.png

MS1 Example#

Replicate analysis in paper by calculating mean attributions for MS1 gene set and comparing across disease.

Load SCimilarity Model and Attributions TileDB#

[9]:
## Path to Scimilarity Model Downloaded from Zenodo (https://zenodo.org/records/15794176)
scim_model_path = "/home/scimilarity_model"
## Set use_gpu false when querying attributions
scim = SCimilarity(model_path=scim_model_path, use_gpu=False)
[10]:
## This should be the folder where the attributions are stored
## Path to cell type attributions downloaded from Zenodo (https://zenodo.org/records/15794176)
ct_base = "/home/celltype_attributions"
all_ct = os.listdir(ct_base)
[11]:
# Analyze all monocyte/macrophage cell populations in database
ctoi = [
    "alveolar macrophage",
    "classical monocyte",
    "intermediate monocyte",
    "macrophage",
    "non-classical monocyte",
]
[12]:
ms1_genes = ['S100A8', 'S100A12', 'RETN', 'CLU', 'MCEMP1', 'IL1R2', 'CYP1B1', 'SELL',
             'ALOX5AP', 'SLC39A8', 'PLAC8', 'ACSL1', 'CD163', 'VCAN', 'HP', 'CTSD',
             'LGALS1', 'THBS1', 'CES1', 'S100P', 'ANXA6', 'VNN2', 'NAMPT', 'HAMP',
             'DYSF', 'SDF2L1', 'NFE2', 'SLC2A3', 'BASP1', 'ADGRG3', 'SOD2', 'CTSA',
             'PADI4', 'CALR', 'SOCS3', 'NKG7', 'FLOT1', 'IL1RN', 'ZDHHC19', 'LILRA5',
             'ASGR2', 'FAM65B', 'MNDA', 'STEAP4', 'NCF4', 'LBR', 'RP11-295G20.2',
             'UBR4', 'PADI2', 'NCF1', 'LINC00482', 'RUNX1', 'RRP12', 'HSPA1A',
             'FLOT2', 'ANPEP', 'CXCR1', 'ECE1', 'ADAM19', 'RP11-196G18.3', 'IL4R',
             'DNAJB11', 'FES', 'MBOAT7', 'SNHG25', 'RP1-55C23.7', 'CPEB4', 'PRR34-AS1',
             'HSPA1B', 'LINC01001', 'C1QC', 'SBNO2', 'GTSE1', 'FOLR3', 'STAB1', 'PLK1',
             'HYI-AS1', 'LINC01281', 'TNNT1', 'AC097495.2', 'CTB-35F21.5', 'C19orf35',
             'AC109826.1', 'RP11-800A3.7', 'LILRA6', 'PDLIM7', 'NPLOC4', 'C15orf48',
             'APOBR', 'CSF2RB', 'CTD-2105E13.14', 'C1QB', 'RP11-123K3.9', 'IQGAP3',
             'GAPLINC', 'CTC-490G23.2', 'JAK3', 'CTC-246B18.10', 'MYO5B']

Check genes#

Subset the genes of interest to ones that are included in the SCimilarity model.

[13]:
sig = SIGnature(gene_order=scim.gene_order, model=scim)
gene_list = sig.check_genes(ms1_genes)
The following genes are not included: C19orf35,AC097495.2,CTD-2105E13.14,CTC-246B18.10,FAM65B,CTC-490G23.2,HYI-AS1,RP11-295G20.2,RP11-800A3.7,RP1-55C23.7,RP11-123K3.9,CTB-35F21.5,RP11-196G18.3
[14]:
len(gene_list)
[14]:
86

Calculate Mean Attribution Scores for each Cell Type#

Iterate through tiledb of each cell type, calculate score, concatenate the metadata.

[15]:
score_name = "MS1"
[16]:
meta_list = []
for ct in tqdm(ctoi):
    ## Get specific tiledb path for location of interest
    ct_path = join(ct_base, ct)
    ## Load attributions tiledb path for given cell type
    att_tdb_path = join(ct_path, "attributions")
    ## Calculate
    results = sig.query_attributions(
        gene_list,
        return_aggregate=True,
        aggregate_type="mean",
        attribution_tiledb_uri=att_tdb_path,
    )

    ## Example including weights
    # np.random.seed(114)
    # weight_list = np.random.choice([0.5, 1, 2], len(gene_list)).tolist()
    # results = sig.query_attributions(gene_list, return_aggregate=True,
    #               aggregate_type='mean', weights = weight_list,
    #               attribution_tiledb_uri=att_tdb_path)

    ## Example of getting individual genes
    # results = sig.query_attributions(gene_list, return_aggregate=False,
    #               attribution_tiledb_uri=att_tdb_path)

    ## Load corresponding cell metadata
    ct_path = join(ct_base, ct)
    meta = pd.read_csv(join(ct_path, "cell_metadata.csv.gz"), index_col=0)

    ## Put score
    meta[score_name] = results
    meta_list.append(meta.copy())
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:26<00:00,  5.32s/it]
[17]:
big_meta = pd.concat(meta_list)
[18]:
big_meta.head(5)
[18]:
study sample prediction fm_signature_score n_genes_by_counts total_counts total_counts_mt pct_counts_mt prediction_nn_dist data_type ... ESC-derived xenograft tissue cell_line_cleaned disease FACS_sort sex_cleaned author_label celltype_index MS1
298997 GSE160794 GSM4879379 alveolar macrophage 0.346716 3197 21991.0 959.0 4.360875 0.007285 unlabeled ... False False lung NaN NaN NaN NaN NaN 0 0.460732
298998 GSE160794 GSM4879379 alveolar macrophage -0.290656 1167 3657.0 182.0 4.976757 0.011017 unlabeled ... False False lung NaN NaN NaN NaN NaN 1 0.703938
298999 GSE160794 GSM4879379 alveolar macrophage 0.695440 3515 21938.0 598.0 2.725864 0.021277 unlabeled ... False False lung NaN NaN NaN NaN NaN 2 0.433652
299001 GSE160794 GSM4879379 alveolar macrophage -0.030102 2906 14594.0 729.0 4.995203 0.011253 unlabeled ... False False lung NaN NaN NaN NaN NaN 3 0.485425
299002 GSE160794 GSM4879379 alveolar macrophage 0.146310 3367 17256.0 536.0 3.106166 0.012170 unlabeled ... False False lung NaN NaN NaN NaN NaN 4 0.561418

5 rows × 26 columns

Analyze Scores#

Load metadata into “Meta” object to perform gene signature analysis.

Load and Filter Metadata#

[19]:
## Create a Meta object to manipulate the dataframe and perform standardized actions
momac_meta = Meta(big_meta)
## Subset to high quality cells (SCimilarity prediction dist <0.02)
momac_meta.subset_hq()
## Subset to only in vivo cells
momac_meta.subset_invivo()
[20]:
ncell = momac_meta.df.shape[0]
nsamp = momac_meta.df["sample"].nunique()
nstudy = momac_meta.df["study"].nunique()
print(
    f"Analyzing {ncell} monocytes and macrophages from {nsamp} samples in {nstudy} studies"
)
Analyzing 2338447 monocytes and macrophages from 3443 samples in 244 studies
[21]:
### Clarify some disease names for downstream figure
figure_disname_dict = {
    "atopic eczema": "Atopic Dermatitis",
    "COVID-19;healthy": "COVID-19 & Healthy (Mixed)",
    "Epstein-Barr virus infection;hemophagocytic syndrome": "EBV-Associated HLH",
    "Immune dysregulation-polyendocrinopathy-enteropathy-X-linked syndrome": "IPEX",
    "mucocutaneous lymph node syndrome": "Kawasaki Disease",
    "systemic scleroderma;interstitial lung disease": "SSC-ILD",
    "thrombocytopenia": "Severe Fever Thrombocytopenia Syndrome",
}

momac_meta.df["disease"] = [
    figure_disname_dict.get(x, x) for x in momac_meta.df["disease"]
]

Define “Hits” Based on Cutoff#

Show multiple options for how to define hits and add to metadata (e.g., 90th percentile, 0.95 quantile, 2 standard deviations above the mean).

[22]:
hit_col = "MS1"
## 90th percentile
momac_meta.add_hits(
    column_name=hit_col,
    mode="percentile",
    cut_val=90,
    hit_type="above",
    string_append="__hit90p",
)

## Above 0.95 quantile
# momac_meta.add_hits(
#     column_name=hit_col,
#     mode="quantile",
#     cut_val=0.95,
#     hit_type="above",
#     string_append="__hit95p",
# )

## Above set value (2 stdev above mean)
# mean = np.mean(momac_meta[hit_col].values)
# std = np.std(momac_meta[hit_col].values)
# hit_val = mean + (2 * std)
# momac_meta.add_hits(
#     column_name=hit_col,
#     mode="value",
#     cut_val=hit_val,
#     hit_type="above",
#     string_append="__2zscore",
# )

Calculate hit percentage per sample#

For all the monocytes/macrophages in each sample, calculate the percentage that are hits. Only consider samples with at least 25 monocytes/macrophages and diseases that have at least 3 samples. Then show the top 10 diseases by mean hit percentage.

[23]:
samphit_df_90 = momac_meta.samphit_df(
    cell_min=25, samp_min=3, hit_col="MS1__hit90p", num_dis=10
)

Generate sample boxplot (like figure 4a)#

[24]:
title = "MS1-like Cells in SCimilarity Database"
[25]:
Meta.samphit_boxplot(
    samphit_df_90,
    hit_label="MS1__hit90p",
    swarm=True,
    figsize=(11.5, 8.5),
    fe=1.7,
    title=title,
    dotsize=9,
)
../_images/notebooks_1-QueryingAttributions_39_0.png
[ ]: