Skip to contents

Introduction

This vignette details a concise example of running the COO classifier on a new RNASeq dataset. This example was run on the Rescomp3 HPC cluster at Genentech, and uses raw data that was output by the Genentech sequencing pipeline.

Loading the data

This COO classifier requires that you supply the raw count matrix, as supplied Genentech’s RNASeq processing pipeline. The prediction model was fit using “stranded count” data, and for best performance we recommend you use the same dataset. Theoretically this program may work with unstranded counts, or with data normalized in a different way, but we can’t guarantee consistent results.

Before we can classify the samples, we have to convert our data to a DESeqDataSet. For SummarizedExperiments, this is very simple:

library(DESeq2)
cds = DESeqDataSet(strandedCtSummary,design=~1)

Alternatively, if the data is formatted as a matrix, you could use the following command:

## counts is a matrix, with rows of features and columns of samples.
cds = DESeqDataSetFromMatrix(counts,
  colData=data.frame(ID=colnames(counts)),
  rowData=data.frame(ID=rownames(counts)),
  design=~1)

Note that in order for the classifier to work, it is necessary for the row names of our CDS to be in the correct format. The classifier expects the row names to be “Gene IDs”, e.g. “GeneID:9294”, and all genes used in the model must be present in the dataset. For the standard model, there are 21 genes, with the following IDs:

##  [1] "`GeneID:10538`"  "`GeneID:10672`"  "`GeneID:11040`"  "`GeneID:152137`"
##  [5] "`GeneID:23139`"  "`GeneID:23495`"  "`GeneID:26278`"  "`GeneID:3055`"  
##  [9] "`GeneID:327657`" "`GeneID:3662`"   "`GeneID:3707`"   "`GeneID:4311`"  
## [13] "`GeneID:4603`"   "`GeneID:51700`"  "`GeneID:54836`"  "`GeneID:55534`" 
## [17] "`GeneID:64764`"  "`GeneID:6597`"   "`GeneID:79754`"  "`GeneID:8994`"  
## [21] "`GeneID:9294`"

Applying the COO classifier

The COO classifier can be applied to the cds object using the single command, coo_rnaseq(). An example is below:

pred = coo_rnaseq(cds)

The returned object consists of three columns: - Sample IDs, drawn from the column names of the cds object. - LPS - the Linear Predictor Score, used to split samples into GCB and ABC, and - COO - the Cell of Origin classification; either GCB, ABC, or Unclassified

Individual COO classifier functions

The classifier actually consists of two steps, which can be run individually. An example is below:

Step 1: Normalize

norm = coo_normalize(cds)

Step 2: Apply the classifier.

By default, this uses the built-in “coo_GOYA_21gene” classifier, developed from the GOYA dataset. However, you have the option of supplying your own version of the classifier (for example, if not all genes are available, or if our gene model changes).

pred = coo_predict(norm_main)

DZsig calling

The DZsig classifier can be applied to the same cds object:

dzsig_pred = dzsig_rnaseq(cds)

Thresholds for the DZsig classification are as published for the NanoString DLBCL90 assay (Ennishi et al., J Clin Oncol 2019; Data Supplement).

It is also possible obtain both DZsig and COO calls and generate a combined “refined COO” classification:

refined_pred = refined_coo_rnaseq(cds)

The returned object of this last function has six columns:

  • Sample IDs, drawn from the column names of the cds object.
  • LPS_COO - the linear predictor score for the COO classification
  • COO - the Cell of Origin classification; either GCB, ABC, or Unclassified
  • LPS_DZsig - the linear predictor score for the DZsig classification
  • DZsig - the Dark Zone signature classification; either DZsig-POS, DZsig-IND, or DZsig-NEG
  • refined_COO - the refined Cell of Origin classification, which is assigned hierarchically:
    • If ABC, then ABC
    • If DZsig-POS, DZsig-POS
    • Else, COO (GCB or Unclassified)

For more details, please refer to the documentation in the package.