Tutorial 1: Querying Attributions#
SIGnature enables querying gene set activation across multiple cell types.
This tutorial goes through the steps for generating cell-level gene set scores for any gene signature of interest using our pre-defined cell type atlases.
Before running this tutorial, you need to:
Download the SCimilarity model from https://zenodo.org/records/15729925
Download the cell type-level attributions you are interested in from https://zenodo.org/communities/signature/. For this tutorial, you will need two attribution folders from Zenodo (https://zenodo.org/records/15794176):
celltype_attributions_lung
: contains data for the first part of the tutorial showing a simple querying example. This folder can be extracted with the commandtar -xzf celltype_attributions_lung.tar.gz
celltype_attributions
: contains data from myeloid cells to replicate the MS1 signature analysis from the paper. This folder can be extracted with the commandtar -xzf celltype_attributions.tar.gz
[1]:
import matplotlib.pyplot as plt
import numpy as np
import os
from os.path import join
import pandas as pd
import seaborn as sns
from SIGnature import SIGnature, SCimilarity, Meta
import tiledb
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
Lung Example#
Compare type I pneumocyte marker genes in that cell type and compare to ionocytes.
Load SCimilarity Model and Attributions TileDB#
[2]:
## Path to Scimilarity Model Downloaded from Zenodo (https://zenodo.org/records/15729925)
scim_model_path = "/home/scimilarity_model"
## Set use_gpu false when querying attributions
scim = SCimilarity(model_path=scim_model_path, use_gpu=False)
[3]:
## Folder where the cell type attributions are stored
## Downloaded from Zenodo (https://zenodo.org/records/15794176)
ct_base = "/home/celltype_attributions_lung"
all_ct = os.listdir(ct_base)
Check Genes#
[4]:
## Check genes (AT1 associated genes from CellMarker)
goi = ['CAV1', 'CAV2', 'CLIC5']
sig = SIGnature(gene_order=scim.gene_order, model=scim)
gene_list = sig.check_genes(goi)
Calculate Mean Attribution Scores#
[5]:
ctoi = ["type I pneumocyte", "pulmonary ionocyte"]
score_name = "AT1"
[6]:
meta_list = []
for ct in tqdm(ctoi):
## Get specific tiledb path for location of interest
ct_path = join(ct_base, ct)
## Load attributions tiledb path for given cell type
att_tdb_path = join(ct_path, "attributions")
## Calculate
results = sig.query_attributions(
gene_list,
return_aggregate=True,
aggregate_type="mean",
attribution_tiledb_uri=att_tdb_path,
)
## Example including weights
# np.random.seed(114)
# weight_list = np.random.choice([0.5, 1, 2], len(gene_list)).tolist()
# results = sig.query_attributions(gene_list, return_aggregate=True,
# aggregate_type='mean', weights = weight_list,
# attribution_tiledb_uri=att_tdb_path)
## Example of getting individual genes
# results = sig.query_attributions(gene_list, return_aggregate=False)
## Load corresponding cell metadata
ct_path = join(ct_base, ct)
meta = pd.read_csv(join(ct_path, "cell_metadata.csv.gz"), index_col=0)
## Put score
meta[score_name] = results
meta_list.append(meta.copy())
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8.88it/s]
Analyze scores#
Compare mean attribution scores for AT1 epithelial genes between AT1 cells and pulmonary ionocytes.
[7]:
big_meta = pd.concat(meta_list)
[8]:
fig, ax = plt.subplots(figsize=(4, 4))
sns.violinplot(
data=big_meta,
x="prediction",
y="AT1",
palette=["tab:purple", "tab:green"],
density_norm='width'
)
ax.set_ylabel("Mean Attribution Score")
ax.set_xlabel("Cell Type")
[8]:
Text(0.5, 0, 'Cell Type')

MS1 Example#
Replicate analysis in paper by calculating mean attributions for MS1 gene set and comparing across disease.
Load SCimilarity Model and Attributions TileDB#
[9]:
## Path to Scimilarity Model Downloaded from Zenodo (https://zenodo.org/records/15794176)
scim_model_path = "/home/scimilarity_model"
## Set use_gpu false when querying attributions
scim = SCimilarity(model_path=scim_model_path, use_gpu=False)
[10]:
## This should be the folder where the attributions are stored
## Path to cell type attributions downloaded from Zenodo (https://zenodo.org/records/15794176)
ct_base = "/home/celltype_attributions"
all_ct = os.listdir(ct_base)
[11]:
# Analyze all monocyte/macrophage cell populations in database
ctoi = [
"alveolar macrophage",
"classical monocyte",
"intermediate monocyte",
"macrophage",
"non-classical monocyte",
]
[12]:
ms1_genes = ['S100A8', 'S100A12', 'RETN', 'CLU', 'MCEMP1', 'IL1R2', 'CYP1B1', 'SELL',
'ALOX5AP', 'SLC39A8', 'PLAC8', 'ACSL1', 'CD163', 'VCAN', 'HP', 'CTSD',
'LGALS1', 'THBS1', 'CES1', 'S100P', 'ANXA6', 'VNN2', 'NAMPT', 'HAMP',
'DYSF', 'SDF2L1', 'NFE2', 'SLC2A3', 'BASP1', 'ADGRG3', 'SOD2', 'CTSA',
'PADI4', 'CALR', 'SOCS3', 'NKG7', 'FLOT1', 'IL1RN', 'ZDHHC19', 'LILRA5',
'ASGR2', 'FAM65B', 'MNDA', 'STEAP4', 'NCF4', 'LBR', 'RP11-295G20.2',
'UBR4', 'PADI2', 'NCF1', 'LINC00482', 'RUNX1', 'RRP12', 'HSPA1A',
'FLOT2', 'ANPEP', 'CXCR1', 'ECE1', 'ADAM19', 'RP11-196G18.3', 'IL4R',
'DNAJB11', 'FES', 'MBOAT7', 'SNHG25', 'RP1-55C23.7', 'CPEB4', 'PRR34-AS1',
'HSPA1B', 'LINC01001', 'C1QC', 'SBNO2', 'GTSE1', 'FOLR3', 'STAB1', 'PLK1',
'HYI-AS1', 'LINC01281', 'TNNT1', 'AC097495.2', 'CTB-35F21.5', 'C19orf35',
'AC109826.1', 'RP11-800A3.7', 'LILRA6', 'PDLIM7', 'NPLOC4', 'C15orf48',
'APOBR', 'CSF2RB', 'CTD-2105E13.14', 'C1QB', 'RP11-123K3.9', 'IQGAP3',
'GAPLINC', 'CTC-490G23.2', 'JAK3', 'CTC-246B18.10', 'MYO5B']
Check genes#
Subset the genes of interest to ones that are included in the SCimilarity model.
[13]:
sig = SIGnature(gene_order=scim.gene_order, model=scim)
gene_list = sig.check_genes(ms1_genes)
The following genes are not included: C19orf35,AC097495.2,CTD-2105E13.14,CTC-246B18.10,FAM65B,CTC-490G23.2,HYI-AS1,RP11-295G20.2,RP11-800A3.7,RP1-55C23.7,RP11-123K3.9,CTB-35F21.5,RP11-196G18.3
[14]:
len(gene_list)
[14]:
86
Calculate Mean Attribution Scores for each Cell Type#
Iterate through tiledb of each cell type, calculate score, concatenate the metadata.
[15]:
score_name = "MS1"
[16]:
meta_list = []
for ct in tqdm(ctoi):
## Get specific tiledb path for location of interest
ct_path = join(ct_base, ct)
## Load attributions tiledb path for given cell type
att_tdb_path = join(ct_path, "attributions")
## Calculate
results = sig.query_attributions(
gene_list,
return_aggregate=True,
aggregate_type="mean",
attribution_tiledb_uri=att_tdb_path,
)
## Example including weights
# np.random.seed(114)
# weight_list = np.random.choice([0.5, 1, 2], len(gene_list)).tolist()
# results = sig.query_attributions(gene_list, return_aggregate=True,
# aggregate_type='mean', weights = weight_list,
# attribution_tiledb_uri=att_tdb_path)
## Example of getting individual genes
# results = sig.query_attributions(gene_list, return_aggregate=False,
# attribution_tiledb_uri=att_tdb_path)
## Load corresponding cell metadata
ct_path = join(ct_base, ct)
meta = pd.read_csv(join(ct_path, "cell_metadata.csv.gz"), index_col=0)
## Put score
meta[score_name] = results
meta_list.append(meta.copy())
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:26<00:00, 5.32s/it]
[17]:
big_meta = pd.concat(meta_list)
[18]:
big_meta.head(5)
[18]:
study | sample | prediction | fm_signature_score | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | prediction_nn_dist | data_type | ... | ESC-derived | xenograft | tissue | cell_line_cleaned | disease | FACS_sort | sex_cleaned | author_label | celltype_index | MS1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
298997 | GSE160794 | GSM4879379 | alveolar macrophage | 0.346716 | 3197 | 21991.0 | 959.0 | 4.360875 | 0.007285 | unlabeled | ... | False | False | lung | NaN | NaN | NaN | NaN | NaN | 0 | 0.460732 |
298998 | GSE160794 | GSM4879379 | alveolar macrophage | -0.290656 | 1167 | 3657.0 | 182.0 | 4.976757 | 0.011017 | unlabeled | ... | False | False | lung | NaN | NaN | NaN | NaN | NaN | 1 | 0.703938 |
298999 | GSE160794 | GSM4879379 | alveolar macrophage | 0.695440 | 3515 | 21938.0 | 598.0 | 2.725864 | 0.021277 | unlabeled | ... | False | False | lung | NaN | NaN | NaN | NaN | NaN | 2 | 0.433652 |
299001 | GSE160794 | GSM4879379 | alveolar macrophage | -0.030102 | 2906 | 14594.0 | 729.0 | 4.995203 | 0.011253 | unlabeled | ... | False | False | lung | NaN | NaN | NaN | NaN | NaN | 3 | 0.485425 |
299002 | GSE160794 | GSM4879379 | alveolar macrophage | 0.146310 | 3367 | 17256.0 | 536.0 | 3.106166 | 0.012170 | unlabeled | ... | False | False | lung | NaN | NaN | NaN | NaN | NaN | 4 | 0.561418 |
5 rows × 26 columns
Analyze Scores#
Load metadata into “Meta” object to perform gene signature analysis.
Load and Filter Metadata#
[19]:
## Create a Meta object to manipulate the dataframe and perform standardized actions
momac_meta = Meta(big_meta)
## Subset to high quality cells (SCimilarity prediction dist <0.02)
momac_meta.subset_hq()
## Subset to only in vivo cells
momac_meta.subset_invivo()
[20]:
ncell = momac_meta.df.shape[0]
nsamp = momac_meta.df["sample"].nunique()
nstudy = momac_meta.df["study"].nunique()
print(
f"Analyzing {ncell} monocytes and macrophages from {nsamp} samples in {nstudy} studies"
)
Analyzing 2338447 monocytes and macrophages from 3443 samples in 244 studies
[21]:
### Clarify some disease names for downstream figure
figure_disname_dict = {
"atopic eczema": "Atopic Dermatitis",
"COVID-19;healthy": "COVID-19 & Healthy (Mixed)",
"Epstein-Barr virus infection;hemophagocytic syndrome": "EBV-Associated HLH",
"Immune dysregulation-polyendocrinopathy-enteropathy-X-linked syndrome": "IPEX",
"mucocutaneous lymph node syndrome": "Kawasaki Disease",
"systemic scleroderma;interstitial lung disease": "SSC-ILD",
"thrombocytopenia": "Severe Fever Thrombocytopenia Syndrome",
}
momac_meta.df["disease"] = [
figure_disname_dict.get(x, x) for x in momac_meta.df["disease"]
]
Define “Hits” Based on Cutoff#
Show multiple options for how to define hits and add to metadata (e.g., 90th percentile, 0.95 quantile, 2 standard deviations above the mean).
[22]:
hit_col = "MS1"
## 90th percentile
momac_meta.add_hits(
column_name=hit_col,
mode="percentile",
cut_val=90,
hit_type="above",
string_append="__hit90p",
)
## Above 0.95 quantile
# momac_meta.add_hits(
# column_name=hit_col,
# mode="quantile",
# cut_val=0.95,
# hit_type="above",
# string_append="__hit95p",
# )
## Above set value (2 stdev above mean)
# mean = np.mean(momac_meta[hit_col].values)
# std = np.std(momac_meta[hit_col].values)
# hit_val = mean + (2 * std)
# momac_meta.add_hits(
# column_name=hit_col,
# mode="value",
# cut_val=hit_val,
# hit_type="above",
# string_append="__2zscore",
# )
Calculate hit percentage per sample#
For all the monocytes/macrophages in each sample, calculate the percentage that are hits. Only consider samples with at least 25 monocytes/macrophages and diseases that have at least 3 samples. Then show the top 10 diseases by mean hit percentage.
[23]:
samphit_df_90 = momac_meta.samphit_df(
cell_min=25, samp_min=3, hit_col="MS1__hit90p", num_dis=10
)
Generate sample boxplot (like figure 4a)#
[24]:
title = "MS1-like Cells in SCimilarity Database"
[25]:
Meta.samphit_boxplot(
samphit_df_90,
hit_label="MS1__hit90p",
swarm=True,
figsize=(11.5, 8.5),
fe=1.7,
title=title,
dotsize=9,
)

[ ]: