single-cell • grandR

Loading single-cell data

Single-cell data in the form of feature-barcode matrices can be directly read into a grandR object. Two matrices must be prepared, one containing total counts, one containing new RNA counts. To demonstrate this, we will use publicly available data on cortisol response in single-cells of the A549 cell-line, including matrices of total and newly synthesized RNA (https://www.nature.com/articles/s41587-020-0480-9). The data is available on GEO under the accession ID GSM3770930.

First load the necessary packages.

suppressPackageStartupMessages({
    library(grandR)
    library(ggplot2)
    library(patchwork)
    library(Seurat)
})

Since we are downloading large files, the download may take some time. The default time limit in R is 60 seconds, which might be insufficient. We change increase the time limit as follows:

options(timeout=300)
getOption("timeout")

[1] 300

Then we download the data and load them into grandR. The files assigned to the genes,cells and total.matrix is the normal matrix-market format that are generated by most single cell processing pipelines (such as CellRanger or STARsolo). Equivalently, genes,cells andnew.matrix is a matrix-marked formatted matrix containing counts of read with T-to-C mismatches.

Importantly, GRAND-SLAM output of single cell data can be read as usual using the ReadGRAND function. The detection.rate defines how many reads that truly originate from labeled RNA have been recognized as such by having a T-to-C mismatch. For details, see the sci-fate paper. 82% is the value that has been estimated in this paper.

d <- ReadNewTotal(genes = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fgene%5Fannotate%2Etxt%2Egz",
                  cells = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fcell%5Fannotate%2Etxt%2Egz",
                  new.matrix = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fgene%5Fcount%5Fnewly%5Fsynthesised%2Etxt%2Egz",
                  total.matrix = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fgene%5Fcount%2Etxt%2Egz",
                  verbose=TRUE,
                  detection.rate = 0.82)

Downloading file (url: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fgene%5Fannotate%2Etxt%2Egz, destination: /tmp/RtmpTmhPc8/ecbb72c7c593) ...
Downloading file (url: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fcell%5Fannotate%2Etxt%2Egz, destination: /tmp/RtmpTmhPc8/ecbb7ca3217e) ...
Downloading file (url: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fgene%5Fcount%5Fnewly%5Fsynthesised%2Etxt%2Egz, destination: /tmp/RtmpTmhPc8/ecbb3682141b) ...
Downloading file (url: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3770930&format=file&file=GSM3770930%5FA549%5Fgene%5Fcount%2Etxt%2Egz, destination: /tmp/RtmpTmhPc8/ecbb5c5c3905) ...
Reading total count matrix...

Warning: Duplicate gene symbols (n=141, e.g. SNORA31,SNORD70,SNORA51,LTB4R2,RNA5-8S5,U8)
present, making unique!

Reading new count matrix...
Computing NTRs...
Deleting temporary file...
Deleting temporary file...
Deleting temporary file...
Deleting temporary file...

Warning in grandR(prefix = genes$prefix, gene.info = gene.info, slots = re, : No no4sU entry in
coldata, assuming all samples/cells as 4sU treated!

Now, d is a grandR object containing single cell data:

grandR:
Read from https://www.ncbi.nlm.nih.gov/geo/download/
43167 genes, 7404 samples/cells
Available data slots: count,ntr
Available analyses: 
Available plots: 
Default data slot: count

Note that also the cell meta-data has been loaded correctly, which is the Coldata in grandR terms:

head(Coldata(d))

	Name	sample	all_exon	all_intron	all_reads	treatment_time	doublet_score	no4sU
sci-A549-001.ACTCTCTCAA	sci-A549-001.ACTCTCTCAA	sci-A549-001.ACTCTCTCAA	31785	6225	41737	4h	0.1070496	FALSE
sci-A549-001.AAGAAGTTCA	sci-A549-001.AAGAAGTTCA	sci-A549-001.AAGAAGTTCA	31194	2514	36264	6h	0.0369588	FALSE
sci-A549-001.ACGGAGAATA	sci-A549-001.ACGGAGAATA	sci-A549-001.ACGGAGAATA	17207	2270	21127	8h	0.0449735	FALSE
sci-A549-001.CGATCCTGGA	sci-A549-001.CGATCCTGGA	sci-A549-001.CGATCCTGGA	14282	1932	17723	4h	0.0379747	FALSE
sci-A549-001.ACTGAATTCC	sci-A549-001.ACTGAATTCC	sci-A549-001.ACTGAATTCC	32336	8779	44778	6h	0.0779896	FALSE
sci-A549-001.GCCATCAACT	sci-A549-001.GCCATCAACT	sci-A549-001.GCCATCAACT	11058	4494	17677	2h	0.0338983	FALSE

Pre-processing the data

We now could directly hand-over data to Seurat. However, pre-processing and analyses can also be performed in grandR. We filter genes such that only genes remain that have at least one UMI in at least 100 cells.

d <- FilterGenes(d, mode.slot = "count", minval = 1, mincol = 100)
d

grandR:
Read from https://www.ncbi.nlm.nih.gov/geo/download/
18276 genes, 7404 samples/cells
Available data slots: count,ntr
Available analyses: 
Available plots: 
Default data slot: count

Computing new RNA percentage

We can also obtain some statistics, e.g. on the percentage of new RNA per cell, and the percentage of mitochondrial gene expression.

d <- ComputeExpressionPercentage(d,"percent.new",
                                 mode.slot="new.count",mode.slot.total="count")
d <- ComputeExpressionPercentage(d,"percent.mt",
                                 genes=Genes(d,"^MT-",regex=TRUE))  
# results in 0 for all genes, have been filtered on GEO!

The computed percentages are now part of the coldata:

head(Coldata(d))

	Name	sample	all_exon	all_intron	all_reads	treatment_time	doublet_score	no4sU	percent.new
sci-A549-001.ACTCTCTCAA	sci-A549-001.ACTCTCTCAA	sci-A549-001.ACTCTCTCAA	31785	6225	41737	4h	0.1070496	FALSE	22.46687
sci-A549-001.AAGAAGTTCA	sci-A549-001.AAGAAGTTCA	sci-A549-001.AAGAAGTTCA	31194	2514	36264	6h	0.0369588	FALSE	16.48947
sci-A549-001.ACGGAGAATA	sci-A549-001.ACGGAGAATA	sci-A549-001.ACGGAGAATA	17207	2270	21127	8h	0.0449735	FALSE	19.02614
sci-A549-001.CGATCCTGGA	sci-A549-001.CGATCCTGGA	sci-A549-001.CGATCCTGGA	14282	1932	17723	4h	0.0379747	FALSE	18.97797
sci-A549-001.ACTGAATTCC	sci-A549-001.ACTGAATTCC	sci-A549-001.ACTGAATTCC	32336	8779	44778	6h	0.0779896	FALSE	30.89094
sci-A549-001.GCCATCAACT	sci-A549-001.GCCATCAACT	sci-A549-001.GCCATCAACT	11058	4494	17677	2h	0.0338983	FALSE	34.66831

Here, the global percentage of new RNA correlates very well with the percentage of intronic reads in each cell (which is part of the table downloaded from GEO), i.e. there are two independent pieces of evidence that there are cells that are transcriptionally more active than others:

PlotScatter(Coldata(d),percent.new,all_intron/all_reads*100,
            correlation = FormatCorrelation(),
            remove.outlier = FALSE)

Seurat

Data can be handed over to Seurat. For that, modalities and a mode must be defined. Allowed modalities are * total (total counts) * new (new counts) * old (old counts) * prev (estimated counts from the onset of labeling extrapolated using gene specific RNA half-lives)

The mode is one of four options (as outlined in this review): * assay: for each modality, the Seurat object will contain an assay * cells: cells are copied for each modality * genes: genes are copied for each modality * list: a list of Seurat objects is returned

s <- as.Seurat(d, modalities=c(RNA='total',newRNA="new"), mode="assay")

Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

An object of class Seurat 
36552 features across 7404 samples within 2 assays 
Active assay: RNA (18276 features, 0 variable features)
 1 other assay present: newRNA

The percentage of new RNA could now be used as an additional QC filter:

VlnPlot(s,features = "percent.new",group.by = "treatment_time")

E.g. cells with extremely high amounts of new or old RNA could be filtered out. This step should only be done with care. Here we skip it, since the correlation with the percentage of intronic reads provides independent evidence that the high variance in new RNA content is biological.

Note that the new Seurat object has two assays, RNA and newRNA (which are the names we defined for the modalities). Let’s do the normal Seurat workflow for total RNA:

s <- NormalizeData(s)
s <- FindVariableFeatures(s)
s <- ScaleData(s)
s <- RunPCA(s,verbose = FALSE)
s <- RunUMAP(s,dims=1:10)
DimPlot(s,group.by = 'treatment_time')

# treatment_time was transferred from the grandR Coldata table!

And now repeat with new RNA:

DefaultAssay(s) <- 'newRNA'
s <- NormalizeData(s)
s <- FindVariableFeatures(s)
s <- ScaleData(s)
s <- RunPCA(s,verbose = FALSE)
s <- RunUMAP(s,dims=1:10)
DimPlot(s,group.by = 'treatment_time')