1 Introduction

Isosceles (Isoform Single-Cell and Long-read Expression Suite) is an R package dedicated to transcript detection and quantification from long read RNA-seq data, supporting both bulk and single-cell technologies.

Isosceles can be installed using the following commands:

install.packages(c("BiocManager", "devtools"))
BiocManager::install(c("scran", "scater", "uwot", "dittoSeq", "DEXSeq", 
                       "Nebulosa", "ggbio", "BiocStyle"))
devtools::install_github("Genentech/Isosceles", dependencies = TRUE,
                         upgrade = TRUE, INSTALL_opts = "--install-tests")

We found that some versions of Isosceles’ dependencies don’t work together well, which might cause problems with testing the package or building the vignettes. If you encounter such issues, re-installing certain packages might be helpful:

install.packages("irlba") 
devtools::install_github("powellgenomicslab/Nebulosa", upgrade = FALSE) 

Load the Isosceles package:

library(Isosceles)

2 Input data

Isosceles requires the following input files:

  • BAM file(s) containing aligned long reads
    • We recommend minimap2 for all long-read alignments.
    • As Isosceles doesn’t perform post-alignment splice junction correction, it is necessary to run minimap2 with the ‘--junc-bed’ flag. The intron position BED file required by it can be easily created using the gtf_to_intron_bed function.
    • Isosceles requires the BAM file for single-cell data to contain a tag storing the cell barcode - you can find more information on the topic in the Preparing single-cell data section.
  • FASTA file containing the genomic sequences
  • GTF file containing the genomic annotations
    • Isosceles requires the chromosome names to match between the FASTA file and the GTF file. No automatic conversion between alternative chromosome naming conventions (e.g. ‘1’ and ‘chr1’) is performed.

3 Preparing single-cell data

When analyzing scRNA-Seq data, Isosceles requires its input BAM files to contain a tag storing the cell barcode (BC tag by default, but it can be changed using the barcode_tag argument of the bam_to_tcc function). Several pipelines are suitable for this purpose, and the most popular options are described below. Depending on the pipeline, some additional steps (such as UMI deduplication) might need to be undertaken in order to make the output BAM file compatible with Isosceles.

3.1 Sicelore

Sicelore is suite of tools designed to analyze long read single-cell sequencing data, utilizing Illumina short reads to guide cell barcode and UMI assignment. A tagged molecule BAM file created by following steps 1-8 of the workflow described on Sicelore’s webpage can be used directly by Isosceles. Cell barcodes are stored in the BC tag, which is the default setting of the bam_to_tcc function.

3.2 Sicelore v2.1

Sicelore v2.1 is the next release of the Sicelore suite that doesn’t rely on Illumina short read data for cell barcode and UMI assignment. A BAM file suitable for Isosceles analysis can be created by following steps 1-4b of the workflow described on the project’s webpage. As with the original version of Sicelore, cell barcodes are stored in the BC tag.

3.3 wf-single-cell

wf-single-cell is a Nextflow-based pipeline for cell barcode and UMI identification in Oxford Nanopore long read data. Detailed instructions on running the workflow can be found on its webpage - we recommend running it with the ‘--merge_bam True’ flag in order to obtain one, merged BAM file for the processed data. As the output BAM file is not deduplicated by UMI, some additional steps need to be taken before it can be used with Isosceles.

First, aligned reads tagged with UMI sequences that are shorter than expected (and therefore interfering with UMI deduplication) need to be filtered out. This can be achieved with samtools (‘$alias’ should to be replaced with a prefix appropriate for the given analysis):

samtools view -b \
  --expr 'length([UB]) == 12' \
  $alias/bams/$alias.tagged.sorted.bam \
  > filtered.bam
samtools index filtered.bam

In this case we used UMI length of 12 bp (appropriate for 3’ 10xv3 data), but that value might vary depending on the used protocol.

Read deduplication by UMI and mapping coordinates can be performed using UMI-tools:

umi_tools dedup \
  --extract-umi-method=tag \
  --method=directional \
  --per-gene \
  --per-cell \
  --umi-tag=UB \
  --cell-tag=CB \
  --gene-tag=GN \
  --stdin filtered.bam \
  --stdout dedup.bam
samtools index dedup.bam

The deduplicated BAM file can now be used by Isosceles. As cell barcodes are stored in the CB tag in this case, it needs to be specified using the barcode_tag argument of the bam_to_tcc function.

4 Run modes

Isosceles can used in of the following run modes, specified when the bam_to_tcc function is run:

  • strict - only annotated reference transcripts will be quantified
  • de_novo_strict - de novo transcripts enabled, but all splice sites must be known (i.e. found in reference annotations or provided by the user).
  • de_novo_loose - de novo transcripts enabled, but all splice sites must be known or reproducibly passing filters in the aligned reads

The default settings of de novo transcript detection used by the bam_to_tcc function should work well for a eukaryotic transcriptome, but for the analysis of spike-in data, such as SIRVs, we recommend increasing the read count threshold (the min_read_count argument) to a higher value (e.g. 50).

5 Genome annotations

Isosceles selects the appropriate set of transcripts to quantify, depending on the given run mode, using the genomic annotations that are prepared with the prepare_transcripts function. This function needs a GTF file with reference annotations, and, optionally, transcript structures extracted from BAM files using the bam_to_read_structures function (which is necessary for detecting de novo transcripts). If de novo transcript detection is enabled, the numbers of spliced reads assigned to different splicing compatibility levels (described in detail in the Isosceles paper) can be plotted using the plot_splicing_support_levels function.

There are several important aspects of the prepare_transcripts function to consider:

  • Reference annotations from Ensembl or GENCODE are generally recommended. However, Isosceles can also work with merged GTF files that combine both reference and de novo detected transcripts from an external source. Examples of programs that produce these merged files include StringTie and IsoQuant. Utilizing merged annotations as reference can improve the sensitivity of discovering new transcripts without significantly increasing the false discovery rate. More details about the benchmarks can be found in the Isosceles paper and the Isosceles_Paper repository.
  • Isosceles assigns new IDs to all transcripts (whether they are reference or de novo detected) based on stable hash identifiers. This simplifies the process of matching de novo transcripts across different data sets from the same genome build, irrespective of other experimental variables. The corresponding reference IDs compatible with each transcript can be found in transcript level SummarizedExperiment object’s rowData (rowData(se)$compatible_tx).
  • Genes and transcripts are merged according to specific rules:
    • If annotated genes share introns, they are merged and assigned a new ID and symbol, comprising the original IDs and symbols separated by commas.
    • Annotated transcripts with identical intron structure and identical transcript start and end bins (default: 50 bp) are considered identical by Isosceles and also merged, and given a unique transcript ID.

6 Isosceles transcript IDs

The Isosceles transcript ID (e.g. ‘ISOT-2a9c-c3db-71b4-c3f2:s100628650:e100636850:AP:FL’) consists of several parts, separated by colons:

  • Stable hash identifier, starting with the constant ‘ISOT’ prefix. For spliced transcripts, the identifier contains the first 16 characters of the MD5 hash of transcript’s comma-separated intron positions, while for the unspliced transcripts it consists of 12 zeroes and the first 4 characters of the MD5 hash of transcript’s chromosome and strand. For convenience, the hash identifier is separated into 4 character segments by dashes.
  • The transcript start bin position, prefixed by the ‘s’ character. The default bin size is 50 bp, but it can be changed using the bin_size argument of the prepare_transcripts function. The start of the bin is always used, ensuring that it’s smaller or equal to the transcript start position.
  • The transcript end bin position, prefixed by the ‘e’ character. The default bin size is 50 bp, but it can be changed using the bin_size argument of the prepare_transcripts function. Unlike the start positions, the end of the bin is always used, ensuring that it’s bigger or equal to the transcript end position.
  • Splicing compatibility code of the transcript. The full list of these codes can be found in the Isosceles paper, but the following ones are actually used in at least one run mode:
    • AP (Annotated Paths) - reference transcripts
    • EC (Edge Compatible) - de novo transcripts where all introns are known (i.e. found in reference annotations or provided by the user)
    • NC (Node Compatible) - de novo transcripts where all splice sites are known
    • DN (De-novo Node) - de novo transcripts where all splice sites are known or reproducibly passing filters in the aligned reads
  • Truncation status code of the transcript. The full list of these codes can be found in the Isosceles paper, but only the FL (Full-Length) transcripts are used by Isosceles for quantification.

The comma separated list of compatible reference transcript IDs can be found in transcript level SummarizedExperiment object’s rowData (rowData(se)$compatible_tx).

7 Output

The main output of Isosceles consists of SummarizedExperiment objects, created at various resolution levels:

  • TCC (Transcript Compatibility Counts) - This is an intermediate quantification stage of the data, formed directly from the BAM files with the bam_to_tcc function. The values here are used to derive transcript quantification estimates via the EM algorithm. The pseudobulk_tcc function (for summarizing data at the pseudo-bulk level), the pseudotime_tcc function (for summarizing data at the pseudo-time window level) and the neighborhood_tcc function (for combining counts from neighboring cells) utilize TCC SummarizedExperiment objects as both input and output.
  • transcript - At the transcript level, expression values are computed using the tcc_to_transcript function, applying the EM (Expectation–Maximization) algorithm. By default, normalization is done using effective transcript lengths, but it can be turned off for UMI-based protocol data (commonly in scRNA-Seq datasets) by setting the use_length_normalization argument to FALSE. You can export transcript annotations from the SummarizedExperiment object to a GTF file using the export_gtf function.
  • gene - Gene-level expression values are derived using the tcc_to_gene function. This process is typically very swift as it doesn’t necessitate running the EM algorithm.
  • PSI (Percent Spliced In) - PSI values are computed for unique transcriptomic regions of each gene (such as transcription start sites or core exons) using the transcript_to_psi function. This requires a transcript level SummarizedExperiment object as input.

Each of these levels provides specific insights and functionalities within the Isosceles framework, catering to different analytical requirements.

Data in the SummarizedExperiment objects can be accessed using the following commands:

  • assayNames(se) - assay names available at the given resolution level
  • assay(se, “counts”) - raw count data matrix
  • assay(se, “tpm”) - TPM (Transcripts Per Million) normalized expression value matrix
  • assay(se, “relative_expression”) - relative expression value matrix. All the values are in the 0 to 1 range, and represent the ratios of the TPM values of the analyzed features (e.g. transcripts) and the TPM values of genes they belong to
  • assay(se, “psi”) - PSI (Percent Spliced In) value matrix
  • rowData(se) - additional data regarding the analyzed features (e.g. reference transcript IDs for transcript level data)
  • rowRanges(se) - a GRangesList object (transcript level data) or a GRanges object (PSI level data) containing the genomic positions of the analyzed features

8 Parallelization

Many functions in the Isosceles package can make use of multiple CPU cores, specified using the ncpu argument. The following tips should help you find the optimal ncpu value for your analysis:

  • Functions that take BAM files as an input (bam_to_read_structures and bam_to_tcc) can process each of them on a separate CPU core.
  • The tcc_to_transcript function parallelizes the EM algorithm computations at the SummarizedExperiment column level (samples for bulk RNA-Seq, cells for scRNA-Seq data).
  • In case of the transcript_to_psi function, PSI region detection can be parallelized at the gene level.

9 Bulk RNA-Seq analysis

9.1 Preparing input data

Get the input file paths for a small bulk RNA-Seq dataset:

bam_file <- system.file(
  "extdata", "bulk_rnaseq.bam",
  package = "Isosceles"
)
gtf_file <- system.file(
  "extdata", "bulk_rnaseq.gtf",
  package = "Isosceles"
)
genome_fasta_file <- system.file(
  "extdata", "bulk_rnaseq.fa",
  package = "Isosceles"
)

9.2 Basic Isosceles analysis

Extract read data from the BAM files:

bam_files <- c(Sample = bam_file)
bam_parsed <- bam_to_read_structures(
    bam_files = bam_files
)

Prepare transcript data for the analysis:

transcript_data <- prepare_transcripts(
    gtf_file = gtf_file,
    genome_fasta_file = genome_fasta_file,
    bam_parsed = bam_parsed,
    min_bam_splice_read_count = 2,
    min_bam_splice_fraction = 0.01
)

Create a TCC (Transcript Compatibility Counts) SummarizedExperiment object:

se_tcc <- bam_to_tcc(
    bam_files = bam_files,
    transcript_data = transcript_data,
    run_mode = "de_novo_loose",
    min_read_count = 1,
    min_relative_expression = 0
)
se_tcc
## class: SummarizedExperiment 
## dim: 26 1 
## metadata(4): compatibility_matrix transcript_df
##   transcript_exon_granges_list mean_read_length
## assays(3): counts tpm relative_expression
## rownames(26): 20,22,24,25,26,27 20,23,24,25,26,27,28,30 ... 69 70
## rowData names(3): ec_id gene_id gene_name
## colnames(1): Sample
## colData names(0):

Create a transcript-level SummarizedExperiment object using the EM algorithm:

se_transcript <- tcc_to_transcript(
    se_tcc = se_tcc,
    use_length_normalization = TRUE
)
se_transcript
## class: RangedSummarizedExperiment 
## dim: 112 1 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(112): ISOT-5c13-e73a-5003-3c93:s976950:e991750:AP:FL
##   ISOT-0000-0000-0000-0762:s989800:e990350:AP:FL ...
##   ISOT-fd89-a94a-cc11-ef44:s72650:e75300:AP:FL
##   ISOT-0000-0000-0000-0762:s30100:e30300:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(1): Sample
## colData names(0):

Create a gene-level SummarizedExperiment object:

se_gene <- tcc_to_gene(
    se_tcc = se_tcc
)
se_gene
## class: SummarizedExperiment 
## dim: 23 1 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(23): ENSG00000064218 ENSG00000107099 ... ENSG00000277631
##   ENSG00000283921
## rowData names(2): gene_id gene_name
## colnames(1): Sample
## colData names(0):

10 scRNA-Seq analysis

10.1 Preparing input data

Get the input file paths for a small scRNA-Seq dataset:

bam_file <- system.file(
    "extdata", "scrnaseq.bam",
    package = "Isosceles"
)
gtf_file <- system.file(
    "extdata", "scrnaseq.gtf.gz",
    package = "Isosceles"
)
genome_fasta_file <- system.file(
    "extdata", "scrnaseq.fa.gz",
    package = "Isosceles"
)

10.2 Basic Isosceles analysis

Extract read data from the BAM files:

bam_files <- c(Sample = bam_file)
bam_parsed <- bam_to_read_structures(
    bam_files = bam_files
)

Prepare transcript data for the analysis:

transcript_data <- prepare_transcripts(
    gtf_file = gtf_file,
    genome_fasta_file = genome_fasta_file,
    bam_parsed = bam_parsed,
    min_bam_splice_read_count = 1,
    min_bam_splice_fraction = 0.01
)

Create a TCC (Transcript Compatibility Counts) SummarizedExperiment object:

se_tcc <- bam_to_tcc(
    bam_files = bam_files,
    transcript_data = transcript_data,
    run_mode = "de_novo_loose",
    min_read_count = 1,
    min_relative_expression = 0,
    is_single_cell = TRUE,
    barcode_tag = "BC"
)
se_tcc
## class: SummarizedExperiment 
## dim: 8 57 
## metadata(4): compatibility_matrix transcript_df
##   transcript_exon_granges_list mean_read_length
## assays(3): counts tpm relative_expression
## rownames(8): 470 470,471,472,473,474,475 ... 476 477
## rowData names(3): ec_id gene_id gene_name
## colnames(57): Sample.CGTAGGCAGAAGGCCT Sample.CTAGCCTTCCTATTCA ...
##   Sample.CCGTTCAAGAGTAAGG Sample.CGGCTAGTCCGTTGCT
## colData names(0):

Create a transcript-level SummarizedExperiment object using the EM algorithm:

se_transcript <- tcc_to_transcript(
    se_tcc = se_tcc,
    use_length_normalization = FALSE
)
se_transcript
## class: RangedSummarizedExperiment 
## dim: 4111 57 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(4111): ISOT-7438-e6a0-1dea-329e:s120453200:e120530200:AP:FL
##   ISOT-85c9-224f-fb90-8b49:s120447100:e120530200:AP:FL ...
##   ISOT-0000-0000-0000-3df5:s43834750:e43836550:AP:FL
##   ISOT-f090-cb81-2f39-4fa9:s156254500:e156255400:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(57): Sample.CGTAGGCAGAAGGCCT Sample.CTAGCCTTCCTATTCA ...
##   Sample.CCGTTCAAGAGTAAGG Sample.CGGCTAGTCCGTTGCT
## colData names(0):

Create a gene-level SummarizedExperiment object:

se_gene <- tcc_to_gene(
    se_tcc = se_tcc
)
se_gene
## class: SummarizedExperiment 
## dim: 2855 57 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(2855): ENSMUSG00000000085 ENSMUSG00000000409 ...
##   ENSMUSG00000111611 ENSMUSG00000111895
## rowData names(2): gene_id gene_name
## colnames(57): Sample.CGTAGGCAGAAGGCCT Sample.CTAGCCTTCCTATTCA ...
##   Sample.CCGTTCAAGAGTAAGG Sample.CGGCTAGTCCGTTGCT
## colData names(0):

10.3 Pseudo-bulk analysis

Create a pseudobulk TCC SummarizedExperiment object:

# Randomly choose one of two labels ('1' or '2') for each cell
set.seed(42)
cell_labels <- sample(1:2, ncol(se_tcc), replace = TRUE)
# Prepare pseudobulk TCC data for given labels
se_pseudobulk_tcc <- pseudobulk_tcc(
    se_tcc = se_tcc,
    cell_labels = cell_labels
)
se_pseudobulk_tcc
## class: SummarizedExperiment 
## dim: 8 2 
## metadata(4): compatibility_matrix transcript_df
##   transcript_exon_granges_list mean_read_length
## assays(3): counts tpm relative_expression
## rownames(8): 470 470,471,472,473,474,475 ... 476 477
## rowData names(3): ec_id gene_id gene_name
## colnames(2): 1 2
## colData names(0):

Create a pseudobulk transcript-level SummarizedExperiment object using the EM algorithm:

se_pseudobulk_transcript <- tcc_to_transcript(
    se_tcc = se_pseudobulk_tcc,
    use_length_normalization = FALSE
)
se_pseudobulk_transcript
## class: RangedSummarizedExperiment 
## dim: 4111 2 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(4111): ISOT-7438-e6a0-1dea-329e:s120453200:e120530200:AP:FL
##   ISOT-85c9-224f-fb90-8b49:s120447100:e120530200:AP:FL ...
##   ISOT-0000-0000-0000-3df5:s43834750:e43836550:AP:FL
##   ISOT-f090-cb81-2f39-4fa9:s156254500:e156255400:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(2): 1 2
## colData names(0):

Create a pseudobulk gene-level SummarizedExperiment object:

se_pseudobulk_gene <- tcc_to_gene(
    se_tcc = se_pseudobulk_tcc
)
se_pseudobulk_gene
## class: SummarizedExperiment 
## dim: 2855 2 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(2855): ENSMUSG00000000085 ENSMUSG00000000409 ...
##   ENSMUSG00000111611 ENSMUSG00000111895
## rowData names(2): gene_id gene_name
## colnames(2): 1 2
## colData names(0):

10.4 Merging neighboring cell data

Isosceles also allows adding counts from neighboring cells to each cell, potentially improving the quantification results:

# Get PCA coordinates of each cell from transcript expression data
sce_transcript <- methods::as(se_transcript, "SingleCellExperiment")
sce_transcript <- scuttle::computeLibraryFactors(sce_transcript)
sce_transcript <- scuttle::logNormCounts(sce_transcript)
set.seed(42)
sce_transcript <- scater::runPCA(sce_transcript, ncomponents = 2)
pca_mat <- SingleCellExperiment::reducedDim(sce_transcript, "PCA")
# Add TCC values from two nearest cells to each cell
se_neighbor_tcc <- neighborhood_tcc(
    se_tcc = se_tcc,
    pca_mat = pca_mat,
    k = 2
)
se_neighbor_tcc
## class: SummarizedExperiment 
## dim: 8 57 
## metadata(4): compatibility_matrix transcript_df
##   transcript_exon_granges_list mean_read_length
## assays(3): counts tpm relative_expression
## rownames(8): 470 470,471,472,473,474,475 ... 476 477
## rowData names(3): ec_id gene_id gene_name
## colnames(57): Sample.CGTAGGCAGAAGGCCT Sample.CTAGCCTTCCTATTCA ...
##   Sample.CCGTTCAAGAGTAAGG Sample.CGGCTAGTCCGTTGCT
## colData names(0):

Recalculate transcript expression values from merged data using the EM algorithm:

se_neighbor_transcript <- tcc_to_transcript(
    se_tcc = se_neighbor_tcc,
    use_length_normalization = FALSE
)
se_neighbor_transcript
## class: RangedSummarizedExperiment 
## dim: 4111 57 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(4111): ISOT-7438-e6a0-1dea-329e:s120453200:e120530200:AP:FL
##   ISOT-85c9-224f-fb90-8b49:s120447100:e120530200:AP:FL ...
##   ISOT-0000-0000-0000-3df5:s43834750:e43836550:AP:FL
##   ISOT-f090-cb81-2f39-4fa9:s156254500:e156255400:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(57): Sample.CGTAGGCAGAAGGCCT Sample.CTAGCCTTCCTATTCA ...
##   Sample.CCGTTCAAGAGTAAGG Sample.CGGCTAGTCCGTTGCT
## colData names(0):

11 Downstream analysis

11.1 Preparing input data

Load a transcript-level SummarizedExperiment object containing expression levels of 100 highly variable transcripts across 200 cells from a mix of ovarian cancer cell lines:

se <- readRDS(system.file("extdata", "se_transcript_100_hvts.rds",
                          package = "Isosceles"))
se
## class: RangedSummarizedExperiment 
## dim: 100 200 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(100): ISOT-5434-0815-056e-2ed2:s3020300:e3022400:AP:FL
##   ISOT-85d4-1297-c095-b42b:s17229250:e17237600:AP:FL ...
##   ISOT-79a9-1e21-3e21-7593:s106969800:e106974800:AP:FL
##   ISOT-188a-636e-edb1-889b:s29679000:e29698700:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(200): AAACGAAGTAGCCAGA AAAGAACTCGGTCACG ... TAGGTACGTCGTACAT
##   TAGGTACTCTACTATC
## colData names(0):

11.2 Basic scRNA-Seq analysis

Run a basic scRNA-Seq data analysis using Bioconductor packages:

sce <- as(se, "SingleCellExperiment")
sce <- scuttle::computeLibraryFactors(sce)
sce <- scuttle::logNormCounts(sce)
dec <- scran::modelGeneVar(sce)
top_hvgs <- scran::getTopHVGs(dec, n = 100)
set.seed(42)
sce <- scran::denoisePCA(sce, technical = dec, subset.row = top_hvgs)
set.seed(42)
sce <- scater::runUMAP(sce, dimred = "PCA")
sce
## class: SingleCellExperiment 
## dim: 100 200 
## metadata(0):
## assays(4): counts tpm relative_expression logcounts
## rownames(100): ISOT-5434-0815-056e-2ed2:s3020300:e3022400:AP:FL
##   ISOT-85d4-1297-c095-b42b:s17229250:e17237600:AP:FL ...
##   ISOT-79a9-1e21-3e21-7593:s106969800:e106974800:AP:FL
##   ISOT-188a-636e-edb1-889b:s29679000:e29698700:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(200): AAACGAAGTAGCCAGA AAAGAACTCGGTCACG ... TAGGTACGTCGTACAT
##   TAGGTACTCTACTATC
## colData names(1): sizeFactor
## reducedDimNames(2): PCA UMAP
## mainExpName: NULL
## altExpNames(0):

Cell clustering using k-means:

set.seed(42)
colLabels(sce) <- bluster::clusterRows(reducedDim(sce, "PCA"),
                                       bluster::KmeansParam(3))
dittoSeq::dittoDimPlot(sce, "label", reduction.use = "UMAP",
                       size = 2, main = "",
                       do.label = TRUE, labels.highlight = TRUE,
                       legend.show = FALSE)

11.3 Identifying marker transcripts

Identify marker transcripts for individual clusters:

markers <- scran::findMarkers(sce, groups = sce$label, test.type = "t",
                              pval.type = "all", direction = "up")
markers
## List of length 3
## names(3): 1 2 3
head(markers[[1]])
## DataFrame with 6 rows and 5 columns
##                                                          p.value         FDR
##                                                        <numeric>   <numeric>
## ISOT-9a12-71e0-57a3-e57d:s170864850:e170870500:AP:FL 2.29637e-40 2.29637e-38
## ISOT-0664-96b1-953c-fbcd:s40457250:e40468600:AP:FL   6.80912e-33 3.40456e-31
## ISOT-222e-275a-5543-ff6e:s39312850:e39319800:AP:FL   1.81760e-31 6.05867e-30
## ISOT-76a1-49c3-a1a7-9d07:s8377500:e8383200:AP:FL     5.55753e-27 1.38938e-25
## ISOT-1310-94db-577c-5321:s39406650:e39412550:AP:FL   3.17542e-23 6.35085e-22
## ISOT-d695-7986-8bd7-ca02:s78033850:e78040700:AP:FL   2.62126e-21 4.36877e-20
##                                                      summary.logFC   logFC.2
##                                                          <numeric> <numeric>
## ISOT-9a12-71e0-57a3-e57d:s170864850:e170870500:AP:FL       2.65280   3.06150
## ISOT-0664-96b1-953c-fbcd:s40457250:e40468600:AP:FL         3.18546   3.74561
## ISOT-222e-275a-5543-ff6e:s39312850:e39319800:AP:FL         1.57421   1.57421
## ISOT-76a1-49c3-a1a7-9d07:s8377500:e8383200:AP:FL           1.70687   2.67656
## ISOT-1310-94db-577c-5321:s39406650:e39412550:AP:FL         2.04990   2.15373
## ISOT-d695-7986-8bd7-ca02:s78033850:e78040700:AP:FL         2.29017   2.16768
##                                                        logFC.3
##                                                      <numeric>
## ISOT-9a12-71e0-57a3-e57d:s170864850:e170870500:AP:FL   2.65280
## ISOT-0664-96b1-953c-fbcd:s40457250:e40468600:AP:FL     3.18546
## ISOT-222e-275a-5543-ff6e:s39312850:e39319800:AP:FL     2.25056
## ISOT-76a1-49c3-a1a7-9d07:s8377500:e8383200:AP:FL       1.70687
## ISOT-1310-94db-577c-5321:s39406650:e39412550:AP:FL     2.04990
## ISOT-d695-7986-8bd7-ca02:s78033850:e78040700:AP:FL     2.29017

Get top 3 marker transcripts for each cluster:

top_markers <- markers %>%
  lapply(function(x) rownames(x)[1:3]) %>%
  unlist() %>%
  unname()
top_markers
## [1] "ISOT-9a12-71e0-57a3-e57d:s170864850:e170870500:AP:FL"
## [2] "ISOT-0664-96b1-953c-fbcd:s40457250:e40468600:AP:FL"  
## [3] "ISOT-222e-275a-5543-ff6e:s39312850:e39319800:AP:FL"  
## [4] "ISOT-a049-9476-caf6-f7ba:s114295800:e114312600:AP:FL"
## [5] "ISOT-1d28-0bb1-cc79-de6b:s209675400:e209676400:AP:FL"
## [6] "ISOT-188a-636e-edb1-889b:s29679000:e29698700:AP:FL"  
## [7] "ISOT-3d99-b5d4-4c81-952b:s150338300:e150341700:AP:FL"
## [8] "ISOT-e3aa-f039-f992-8f8f:s105486300:e105488800:AP:FL"
## [9] "ISOT-6639-25b4-48c3-9465:s51166550:e51171750:AP:FL"

Top marker transcript heatmap:

dittoSeq::dittoHeatmap(sce, top_markers, annot.by = "label",
                       cluster_rows = TRUE, fontsize_row = 6)

11.4 Transcript expression UMAP plots

for (transcript_id in top_markers) {
    cat("\n\n### ", transcript_id, "\n")
    umap_plot <- dittoSeq::dittoDimPlot(
        sce, transcript_id, size = 2,
        min.color = "#0072B2", max.color = "#F0E442",
        order = "increasing"
    )
    print(umap_plot)
}

11.4.1 ISOT-9a12-71e0-57a3-e57d:s170864850:e170870500:AP:FL

11.4.2 ISOT-0664-96b1-953c-fbcd:s40457250:e40468600:AP:FL

11.4.3 ISOT-222e-275a-5543-ff6e:s39312850:e39319800:AP:FL

11.4.4 ISOT-a049-9476-caf6-f7ba:s114295800:e114312600:AP:FL

11.4.5 ISOT-1d28-0bb1-cc79-de6b:s209675400:e209676400:AP:FL

11.4.6 ISOT-188a-636e-edb1-889b:s29679000:e29698700:AP:FL

11.4.7 ISOT-3d99-b5d4-4c81-952b:s150338300:e150341700:AP:FL

11.4.8 ISOT-e3aa-f039-f992-8f8f:s105486300:e105488800:AP:FL

11.4.9 ISOT-6639-25b4-48c3-9465:s51166550:e51171750:AP:FL

12 PSI analysis

12.1 Preparing input data

Load a transcript-level SingleCellExperiment object containing expression levels of transcripts of the CAV1 gene across 2,060 cells from a mix of ovarian cancer cell lines:

sce_transcript <- readRDS(system.file("extdata", "sce_transcript_cav1.rds",
                                      package = "Isosceles"))
sce_transcript
## class: SingleCellExperiment 
## dim: 6 2060 
## metadata(0):
## assays(3): counts tpm relative_expression
## rownames(6): ISOT-187e-bc5a-afe9-5320:s116525000:e116559350:AP:FL
##   ISOT-35a7-317f-81c6-345e:s116526250:e116559550:AP:FL ...
##   ISOT-88b1-d6d5-6bc5-0a76:s116526250:e116561200:AP:FL
##   ISOT-bf85-2818-1b10-afd6:s116525700:e116561200:AP:FL
## rowData names(12): transcript_id position ... read_count
##   relative_expression
## colnames(2060): AAACGAAGTAGCCAGA AAAGAACTCGGTCACG ... GGGTGAACAAGGATGC
##   TATCCTACATCAGCGC
## colData names(0):
## reducedDimNames(1): UMAP
## mainExpName: NULL
## altExpNames(0):

12.2 PSI analysis using Isosceles

Create a PSI (Percent Spliced In) SummarizedExperiment object:

se_psi <- transcript_to_psi(sce_transcript)
se_psi
## class: RangedSummarizedExperiment 
## dim: 12 2060 
## metadata(0):
## assays(1): psi
## rownames(12): ENSG00000105974:7:116524750:+:TSS
##   ENSG00000105974:7:116525000:+:TSS ...
##   ENSG00000105974:7:116559550:+:TES ENSG00000105974:7:116561200:+:TES
## rowData names(2): gene_id type
## colnames(2060): AAACGAAGTAGCCAGA AAAGAACTCGGTCACG ... GGGTGAACAAGGATGC
##   TATCCTACATCAGCGC
## colData names(0):

PSI values are calculated for the following types of regions:

  • TSS - transcription start sites
  • TES - transcription end sites
  • CE - core exonic regions
  • RI - retained intronic regions
  • A5 - 5’ alternative exonic regions
  • A3 - 3’ alternative exonic regions

12.3 PSI region visualization

PSI value heatmap for the CAV1 gene:

region_colors <- c(
    TSS = "#FF83FF", CE = "#FF9289", RI = "#82B7FF",
    A5 = "#00D65C", A3 = "#00DAE0", TES = "#D3BA00"
)
plot_psi_heatmap(se_psi, gene_id = "ENSG00000105974",
                 region_colors = region_colors)

Plot the CAV1 transcript structures:

plot_psi_regions(
    se_psi = se_psi,
    se_transcript = sce_transcript,
    gene_id = "ENSG00000105974",
    region_colors = region_colors
)

Plot the CAV1 transcript structures with introns shrinked to max 1000 bp:

plot_psi_regions(
    se_psi = se_psi,
    se_transcript = sce_transcript,
    gene_id = "ENSG00000105974",
    max_intron_length = 1000,
    region_colors = region_colors
)

12.4 PSI UMAP plots

# Calculate UMAP coordinates based on PSI values
sce_psi <- as(se_psi, "SingleCellExperiment")
reducedDim(sce_psi, "UMAP") <- reducedDim(sce_transcript, "UMAP")
sce_psi <- sce_psi[
    , colSums(assay(sce_psi, "psi")) > 0
]
# Create UMAP plots colored by PSI values for each gene region
for (region_id in rownames(sce_psi)) {
    cat("\n\n### ", region_id, "\n")
    umap_plot <- dittoSeq::dittoDimPlot(
        sce_psi, region_id, size = 1,
        min.color = "#0072B2", max.color = "#F0E442",
        order = "increasing"
    )
    print(umap_plot)
}

12.4.1 ENSG00000105974:7:116524750:+:TSS

12.4.2 ENSG00000105974:7:116525000:+:TSS

12.4.3 ENSG00000105974:7:116525093-116525783:+:A5

12.4.4 ENSG00000105974:7:116525700:+:TSS

12.4.5 ENSG00000105974:7:116525784-116525805:+:A5

12.4.6 ENSG00000105974:7:116526250:+:TSS

12.4.7 ENSG00000105974:7:116526389-116526524:+:RI

12.4.8 ENSG00000105974:7:116526525-116526557:+:A3

12.4.9 ENSG00000105974:7:116526558-116526689:+:CE

12.4.10 ENSG00000105974:7:116559350:+:TES

12.4.11 ENSG00000105974:7:116559550:+:TES

12.4.12 ENSG00000105974:7:116561200:+:TES

13 Session Info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/local/lib/R/lib/libRblas.so 
## LAPACK: /usr/local/lib/R/lib/libRlapack.so;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
##  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
##  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
## [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Isosceles_0.2.1             SingleCellExperiment_1.24.0
##  [3] SummarizedExperiment_1.32.0 Biobase_2.62.0             
##  [5] GenomicRanges_1.54.1        GenomeInfoDb_1.38.6        
##  [7] IRanges_2.36.0              S4Vectors_0.40.2           
##  [9] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
## [11] matrixStats_1.2.0           BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.12.0             bitops_1.0-7             
##   [3] filelock_1.0.3            tibble_3.2.1             
##   [5] graph_1.80.0              XML_3.99-0.16.1          
##   [7] rpart_4.1.23              lifecycle_1.0.4          
##   [9] edgeR_4.0.15              OrganismDbi_1.44.0       
##  [11] ensembldb_2.26.0          lattice_0.22-5           
##  [13] backports_1.4.1           magrittr_2.0.3           
##  [15] limma_3.58.1              Hmisc_5.1-1              
##  [17] sass_0.4.8                rmarkdown_2.25           
##  [19] jquerylib_0.1.4           yaml_2.3.8               
##  [21] metapod_1.10.1            ggbio_1.50.0             
##  [23] cowplot_1.1.3             DBI_1.2.2                
##  [25] RColorBrewer_1.1-3        abind_1.4-5              
##  [27] zlibbioc_1.48.0           purrr_1.0.2              
##  [29] AnnotationFilter_1.26.0   biovizBase_1.50.0        
##  [31] RCurl_1.98-1.14           nnet_7.3-19              
##  [33] VariantAnnotation_1.48.1  rappdirs_0.3.3           
##  [35] GenomeInfoDbData_1.2.11   ggrepel_0.9.5            
##  [37] irlba_2.3.5.1             pheatmap_1.0.12          
##  [39] dqrng_0.3.2               DelayedMatrixStats_1.24.0
##  [41] codetools_0.2-19          DelayedArray_0.28.0      
##  [43] scuttle_1.12.0            xml2_1.3.6               
##  [45] tidyselect_1.2.0          farver_2.1.1             
##  [47] ScaledMatrix_1.10.0       viridis_0.6.5            
##  [49] BiocFileCache_2.10.1      base64enc_0.1-3          
##  [51] GenomicAlignments_1.38.2  jsonlite_1.8.8           
##  [53] BiocNeighbors_1.20.2      Formula_1.2-5            
##  [55] ggridges_0.5.6            scater_1.30.1            
##  [57] tools_4.3.1               progress_1.2.3           
##  [59] Rcpp_1.0.12               glue_1.7.0               
##  [61] gridExtra_2.3             SparseArray_1.2.4        
##  [63] xfun_0.42                 dplyr_1.1.4              
##  [65] withr_3.0.0               BiocManager_1.30.22      
##  [67] fastmap_1.1.1             GGally_2.2.1             
##  [69] bluster_1.12.0            fansi_1.0.6              
##  [71] digest_0.6.34             rsvd_1.0.5               
##  [73] R6_2.5.1                  colorspace_2.1-0         
##  [75] dichromat_2.0-0.1         biomaRt_2.58.2           
##  [77] RSQLite_2.3.5             utf8_1.2.4               
##  [79] tidyr_1.3.1               generics_0.1.3           
##  [81] data.table_1.15.0         rtracklayer_1.62.0       
##  [83] FNN_1.1.4                 prettyunits_1.2.0        
##  [85] httr_1.4.7                htmlwidgets_1.6.4        
##  [87] S4Arrays_1.2.0            ggstats_0.5.1            
##  [89] uwot_0.1.16               pkgconfig_2.0.3          
##  [91] gtable_0.3.4              blob_1.2.4               
##  [93] XVector_0.42.0            htmltools_0.5.7          
##  [95] bookdown_0.37             RBGL_1.78.0              
##  [97] ProtGenerics_1.34.0       scales_1.3.0             
##  [99] png_0.1-8                 scran_1.30.2             
## [101] knitr_1.45                rstudioapi_0.15.0        
## [103] reshape2_1.4.4            rjson_0.2.21             
## [105] checkmate_2.3.1           curl_5.2.0               
## [107] cachem_1.0.8              stringr_1.5.1            
## [109] parallel_4.3.1            vipor_0.4.7              
## [111] foreign_0.8-84            AnnotationDbi_1.64.1     
## [113] restfulr_0.0.15           pillar_1.9.0             
## [115] grid_4.3.1                vctrs_0.6.5              
## [117] BiocSingular_1.18.0       dbplyr_2.4.0             
## [119] dittoSeq_1.14.2           beachmat_2.18.1          
## [121] cluster_2.1.4             beeswarm_0.4.0           
## [123] htmlTable_2.4.2           evaluate_0.23            
## [125] GenomicFeatures_1.54.3    magick_2.8.2             
## [127] cli_3.6.2                 locfit_1.5-9.8           
## [129] compiler_4.3.1            Rsamtools_2.18.0         
## [131] rlang_1.1.3               crayon_1.5.2             
## [133] labeling_0.4.3            plyr_1.8.9               
## [135] ggbeeswarm_0.7.2          stringi_1.8.3            
## [137] viridisLite_0.4.2         BiocParallel_1.36.0      
## [139] assertthat_0.2.1          munsell_0.5.0            
## [141] Biostrings_2.70.2         lazyeval_0.2.2           
## [143] Matrix_1.6-5              BSgenome_1.70.2          
## [145] hms_1.1.3                 sparseMatrixStats_1.14.0 
## [147] bit64_4.0.5               ggplot2_3.5.1            
## [149] KEGGREST_1.42.0           statmod_1.5.0            
## [151] highr_0.10                igraph_1.6.0             
## [153] memoise_2.0.1             bslib_0.6.1              
## [155] fastmatch_1.1-4           bit_4.0.5