Besides Seurat
, we need to import some useful packages.
dplyr
and ggplot2
are part of tidyverse
, a readable and flexible R language for solving data science challenges. I personally prefer the coding style with tidyverse
, but you may use base R
too.patchwork
combines separate ggplots into the same graphic with easy access to control layouts.limma
is a Bioconductor
package to analyze microarray data. It is not essential for our analysis, but may provide a more efficient implementation of the Wilcoxon rank sum test in differential expression analysis.library(Seurat)
library(dplyr) # data manipulation
library(ggplot2)
library(patchwork)
# install.packages('BiocManager')
# BiocManager::install('limma')
library(limma) # optional
We first load one spatial transcriptomics dataset into Seurat, and then explore the Seurat
object a bit for single-cell data storage and manipulation. One 10X Genomics Visium dataset will be analyzed with Seurat
in this tutorial, and you may explore other dataset sources from various sequencing technologies, and other computational toolkits listed in this (non-exhaustive) resource spreadsheet.
Seurat
A spatial gene expression dataset of mouse brain serial section 2 (Sagittal-Posterior) collected by Space Ranger 1.1.0. will be analyzed throughout the tutorial. Both the gene expression matrix and spatial imaging data are necessary for the computational analysis.
For illustration purposes, I have downloaded them from 10x Genomics, and uploaded them to Google Drive. Alternatively, you can download the same files directly from 10x Genomics, or explore other spatial gene expression datasets out there.
The data files we will be using today include:
# specify your working directory
root_dir <- "~/Downloads/"
setwd(root_dir)
# url prefix to download files
url_prefix <- "https://drive.google.com/uc?export=download&id="
# url IDs are extracted from Google Drive
spatial_folder_id <- "13B6ulZwz9XmWBD1RsnTHTMn3ySgIpFCY"
h5_mat_id <- "1nnch2ctksJ4rXYG5FzlPh3bIgnylTicV"
# file names
spatial_folder_name <- "./V1_Mouse_Brain_Sagittal_Posterior_Section_2_spatial.tar.gz"
h5_mat_name <- "./V1_Mouse_Brain_Sagittal_Posterior_Section_2_filtered_feature_bc_matrix.h5"
# download files from the resource URL to a destination path where the file is saved
download.file(url = paste0(url_prefix, spatial_folder_id),
destfile = file.path(spatial_folder_name))
download.file(url = paste0(url_prefix, h5_mat_id),
destfile = file.path(h5_mat_name))
# extract (or list) contents from the tar archives
untar(spatial_folder_name)
untar(spatial_folder_name, list = TRUE) # list contents
## [1] "spatial/" "spatial/tissue_lowres_image.png"
## [3] "spatial/tissue_positions_list.csv" "spatial/aligned_fiducials.jpg"
## [5] "spatial/tissue_hires_image.png" "spatial/scalefactors_json.json"
## [7] "spatial/detected_tissue_image.jpg"
# Load a 10x Genomics Visium Spatial Experiment into a Seurat object
brain_data <- Seurat::Load10X_Spatial(
# The directory contains the read count matrix H5 file and the image data in a subdirectory called `spatial`.
data.dir = root_dir,
filename = h5_mat_name,
assay = "Spatial", # specify name of the initial assay
slice = "slice1", # specify name of the stored image
filter.matrix = TRUE,
to.upper = FALSE
)
brain_data
## An object of class Seurat
## 32285 features across 3289 samples within 1 assay
## Active assay: Spatial (32285 features, 0 variable features)
Seurat
objectA Seurat
object serves as a container that contains both data (like the count matrix) and analysis (like dimension reduction or clustering results) for a single-cell dataset.
Extract the dimension of the active assay using dim
, nrow
, or ncol
.
dim(x = brain_data) # the number of features (genes) by samples (spots)
## [1] 32285 3289
nrow(x = brain_data) # the number of features
ncol(x = brain_data) # the number of samples
Extract the feature and sample names using rownames
or colnames
.
head(x = rownames(brain_data), n = 5)
## [1] "Xkr4" "Gm1992" "Gm19938" "Gm37381" "Rp1"
tail(x = colnames(brain_data), n = 5)
Sample-level metadata is stored as a data.frame
, where each row correspond to one sample (e.g. cell or spot) and each column correspond to one sample-level metadata field. It can be accessed via [[
extract operator, the meta.data
object, or the $
sigil ($
extracts one single column at a time). Row names in the metadata need to match the column names of the counts matrix.
class(brain_data[[]])
## [1] "data.frame"
colnames(brain_data[[]]) # automatically calculated while creating the Seurat object
## [1] "orig.ident" "nCount_Spatial" "nFeature_Spatial"
head(brain_data@meta.data)
## orig.ident nCount_Spatial nFeature_Spatial
## AAACAAGTATCTCCCA-1 SeuratProject 14786 4811
## AAACACCAATAACTGC-1 SeuratProject 9009 2748
## AAACAGCTTTCAGAAG-1 SeuratProject 7853 2665
## AAACAGGGTCTATATT-1 SeuratProject 20932 5334
## AAACAGTGTTCCTGGG-1 SeuratProject 42740 8115
## AAACATTTCCCGGATT-1 SeuratProject 10320 3834
brain_data$nCount_Spatial[1:3]
## AAACAAGTATCTCCCA-1 AAACACCAATAACTGC-1 AAACAGCTTTCAGAAG-1
## 14786 9009 7853
# nFeature_Spatial: the number of unique genes in each sample
sum(brain_data$nFeature_Spatial == colSums(brain_data@assays$Spatial@counts > 0))
## [1] 3289
# nCount_Spatial: the total number of detected molecules in each sample
sum(brain_data$nCount_Spatial == colSums(brain_data@assays$Spatial@counts))
## [1] 3289
Assay
) together with feature-level metadata# A vector of names of associated objects can be had with the names function
# These can be passed to the double [[ extract operator to pull them from the Seurat object
names(x = brain_data)
## [1] "Spatial" "slice1"
brain_data[['Spatial']] # equivalent to: brain_data@assays$Spatial
## Assay data with 32285 features for 3289 cells
## First 10 features:
## Xkr4, Gm1992, Gm19938, Gm37381, Rp1, Sox17, Gm37587, Gm37323, Mrpl15,
## Lypla1
brain_data[['slice1']] # equivalent to: brain_data@images$slice1
## Spatial data from the VisiumV1 technology for 3289 samples
## Associated assay: Spatial
## Image key: slice1_
Each Seurat
object consists of one or more Assay objects (basis unit of Seurat
) as individual representations of single-cell expression data. Examples of the Assay objects include RNA
, ADT
in CITE-seq, or Spatial
. Each Assay stores multiple slots, including raw (counts
), normalized (data
) and scaled data (scaled.data
) as well as a vector of variable features (var.features
) and feature-level metadata (meta.features
).
brain_data@assays$Spatial@counts[5:10, 1:3]
## 6 x 3 sparse Matrix of class "dgCMatrix"
## AAACAAGTATCTCCCA-1 AAACACCAATAACTGC-1 AAACAGCTTTCAGAAG-1
## Rp1 . . .
## Sox17 . . .
## Gm37587 . . .
## Gm37323 . . .
## Mrpl15 1 . 1
## Lypla1 1 . .
Feature-level metadata is associated with each individual assay. Feature-level metadata can be accessed through the double bracket [[
extract operator on the Assay objects, or the meta.features
slot.
brain_data[['Spatial']]@meta.features
head(brain_data[['Spatial']][[]])
Other objects containing the single-cell data analysis results will be discussed later after going through the analysis pipeline.
The general steps to perform preprocessing, dimensiona reduction and clustering for spatial transcriptomics data are quite similar to the Seurat
workflow analyzing single-cell RNA sequencing data.
A standard preprocessing workflow includes the selection and filtration of cells based on quality control (QC) metrics, data normalization and (optional) scaling, and the detection of highly variable features.
A few common QC metrics include
nFeature_Spatial
).nCount_Spatial
).brain_data[["percent.mt"]] <- PercentageFeatureSet(brain_data,
pattern = "^mt-")
VlnPlot(
brain_data, features = c("nFeature_Spatial", "nCount_Spatial", "percent.mt"),
pt.size = 0.1, ncol = 3) &
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
# Jointly (rather than separately) consider the QC metrics when filtering
plot1 <- FeatureScatter(
brain_data, feature1 = "nCount_Spatial", feature2 = "percent.mt") + NoLegend()
plot2 <- FeatureScatter(
brain_data, feature1 = "nCount_Spatial", feature2 = "nFeature_Spatial") +
NoLegend()
plot1 + plot2
brain_subset <- subset(
brain_data,
subset = nFeature_Spatial < 8000 & nFeature_Spatial > 1000 &
nCount_Spatial < 50000 & percent.mt < 30)
print(paste("Filter out", ncol(brain_data) - ncol(brain_subset),
"samples because of the outlier QC metrics, with", ncol(brain_subset),
"samples left."))
## [1] "Filter out 96 samples because of the outlier QC metrics, with 3193 samples left."
The variance of molecular counts expresses spatial heterogeneity which cannot be solely explained by technical noise. Satija Lab and Collaborators recommends normalization using SCTransform
(Hafemeister and Satija, 2019) in order to account for technical bias while preserving true biological differences.
SpatialFeaturePlot(
brain_subset, features = c("nFeature_Spatial", "nCount_Spatial", "percent.mt")) &
theme(legend.position = "bottom")
brain_norm <- SCTransform(brain_subset, assay = "Spatial", verbose = FALSE)
names(brain_norm)
## [1] "Spatial" "SCT" "slice1"
dim(brain_norm@assays$SCT@counts)
## [1] 17980 3193
dim(brain_norm@assays$SCT@data)
## [1] 17980 3193
dim(brain_norm@assays$SCT@scale.data)
## [1] 3000 3193
SCTransform
returns the Seurat
object with a new assay called SCT
, where its counts
slot stores the corrected UMI counts, the data
slot stores the log-normalized version of the corrected UMI counts, and scale.data
slot stores the pearson residuals (normalized values) and is used as PCA input.
The scale.data
slot in output assay are subset to contain only the variable genes with return.only.var.genes = TRUE
by default.
brain_obj <- RunPCA(brain_norm, assay = "SCT", verbose = FALSE)
# compute K nearest neighbors (KNN)
brain_obj <- FindNeighbors(brain_obj, reduction = "pca", dims = 1:30)
# Leiden algorithm for community detection
brain_obj <- FindClusters(brain_obj, verbose = FALSE)
# PCA result is the default UMAP input, use dimensions 1:30 as input features
brain_obj <- RunUMAP(brain_obj, reduction = "pca", dims = 1:30)
plot3 <- DimPlot(brain_obj, reduction = "umap", label = TRUE) + NoLegend()
plot4 <- SpatialDimPlot(brain_obj, label = TRUE, label.size = 3) + NoLegend()
plot3 + plot4
The reductions
object stores a dimensionality reduction taken out in Seurat
; each slot in reductions
consists of a cell embeddings matrix, a feature loadings matrix, and a projected feature loadings matrix.
brain_obj@reductions
## $pca
## A dimensional reduction object with key PC_
## Number of dimensions: 50
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: SCT
##
## $umap
## A dimensional reduction object with key UMAP_
## Number of dimensions: 2
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: SCT
Ordinary differential expression analysis does not take spatial information into account, and thus it is generally applicable for both scRNA-seq data and spatial transcriptomics data to explore cellular heterogeneity of cell types and states.
# identity class of each sample
table(brain_obj@active.ident)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 414 404 301 282 227 210 210 195 191 160 140 129 113 74 70 43 30
# find all markers of cluster 10
cluster10_markers <- FindMarkers(brain_obj, ident.1 = 10, min.pct = 0.25)
head(cluster10_markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Esr1 4.302916e-154 0.645829 0.471 0.021 7.736643e-150
## Trh 1.336079e-140 2.235272 0.557 0.041 2.402270e-136
## Ccn3 8.237450e-118 2.262304 1.000 0.253 1.481094e-113
## Pcdh8 8.330506e-115 1.562598 0.979 0.237 1.497825e-110
## Trhr 1.723452e-114 0.752990 0.543 0.048 3.098767e-110
VlnPlot(brain_obj, features = c("Esr1", "Trh", "Ccn3"))
SpatialFeaturePlot(object = brain_obj,
features = rownames(cluster10_markers)[1:3],
alpha = c(0.1, 1), ncol = 3)
The following code chunk is not evaluated due to computational time constraints, but you are encouraged to give it a try if interested.
# find markers for every cluster compared to all remaining cells,
# report only the positive ones
# this code chunk is not evaluated for now because of time constraints
brain_obj_markers <- FindAllMarkers(brain_obj, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25)
brain_obj_markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
top10 <- brain_obj_markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
DoHeatmap(brain_obj, features = top10$gene) + NoLegend()
Seurat
is able to calculate two metrics in order for the users to identify spatially variable genes, namely Moran’s I statistic and marker-variogram. Spatial coordinates of the samples are incorporated to identify features with spatial heterogeneity in its expression.
Moran’s I is a global metric measuring the correlation of gene expression values between local observed values and the average of neighboring values. We can interpret Moran’s I as a spatial autocorrelation metric similar to the Pearson correlation coefficient in the context of spatial statistics. We can either calculate the z-score under normality assumption or perform permutation test to see whether we can reject the null hypothesis of zero spatial autocorrelation.
\[\begin{align*} \text{Moran's I} = \frac{N}{\sum_{i, j}W_{ij}}\frac{\sum_{i}\sum_{j}W_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i}(x_i - \bar{x})^2} \end{align*}\] , where \(N\) is the total number of spatial location units indexed by \((i, j)\), and \(W\) is a weight matrix to be discussed below. Recall that the Pearson correlation coefficient is \[\begin{align*} r= \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} \end{align*}\]
Our assumptions about the spatial neighboring relationship can be encoded with a weight matrix \(W\), where \(W_{i, j}\) can be either contiguity-based, or distance-based (inverse distance). For instance, the following figure illustrates potential cases for higher-order contiguity, when a first-order contiguity is determined based on the existence of a shared boundary or border.
brain_moransi <- FindSpatiallyVariableFeatures(
brain_obj, assay = "SCT",
features = VariableFeatures(brain_obj)[1:10],
selection.method = "moransi")
## Computing Moran's I
## For a more efficient implementation of the Morans I calculation,
## (selection.method = 'moransi') please install the Rfast2 package
## --------------------------------------------
## install.packages('Rfast2')
## --------------------------------------------
## After installation of Rfast2, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
moransi_output_df <- brain_moransi@assays$SCT@meta.features %>%
na.exclude
head(moransi_output_df[order(moransi_output_df$MoransI_observed, decreasing = T), ])
## MoransI_observed MoransI_p.value moransi.spatially.variable
## Pcp2 0.6926961 0 TRUE
## Car8 0.5745839 0 TRUE
## Ttr 0.5635015 0 TRUE
## Mbp 0.4965477 0 TRUE
## Plp1 0.4491421 0 TRUE
## Ptgds 0.2900585 0 TRUE
## moransi.spatially.variable.rank
## Pcp2 1
## Car8 2
## Ttr 3
## Mbp 4
## Plp1 5
## Ptgds 6
top_features_moransi <- head(
SpatiallyVariableFeatures(brain_moransi,
selection.method = "moransi"), 3)
SpatialFeaturePlot(brain_moransi,
features = top_features_moransi, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = "Top 3 genes with the largest Moran's I",
subtitle = "among 10 top variable genes for illustration purposes")
bottom_features_moransi <- tail(
SpatiallyVariableFeatures(brain_moransi,
selection.method = "moransi"), 3)
SpatialFeaturePlot(brain_moransi,
features = bottom_features_moransi, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = "Bottom 3 genes with the smallest Moran's I",
subtitle = "among 10 top variable genes for illustration purposes")
A gene-specific empirical variogram \(v\) for gene \(g\) can be calculated as a function of pair-wise distance \(r\) between any two samples \(i\) and \(j\). A permutation test can be performed to see whether such variogram \(v(r)\) is associated with the spatial distance \(r\) in trendsceek
.
\[\begin{align*} v(r) = \mathbb{E}_{\forall i, j, \text{ s.t. dist}(i, j) = r}\left[(x_i - x_j)^2\right] \end{align*}\] , where \(x_i, x_j\) stand for the gene expression of gene \(g\) in sample \(i, j\) respectively.
brain_variogram <- FindSpatiallyVariableFeatures(
brain_obj, assay = "SCT",
features = VariableFeatures(brain_obj)[1:10],
selection.method = "markvariogram")
variogram_output_df <- brain_variogram@assays$SCT@meta.features %>%
na.exclude # there are NA rows b/c we only calculated the variogram for 10 genes
head(variogram_output_df[order(variogram_output_df$r.metric.5), ])
## r.metric.5 markvariogram.spatially.variable
## Ttr 0.08242632 TRUE
## Pcp2 0.08359531 TRUE
## Mbp 0.15537902 TRUE
## Car8 0.16478140 TRUE
## Plp1 0.25534753 TRUE
## Ptgds 0.35968268 TRUE
## markvariogram.spatially.variable.rank
## Ttr 1
## Pcp2 2
## Mbp 3
## Car8 4
## Plp1 5
## Ptgds 6
The column named r.metric.5
stores the expected values of mark-variogram at a given r
(radius) value (\(r = 5\) by default) when the marks attached to different points are independent (i.e., there is no spatial autocorrelation).
top_features_variogram <- head(
SpatiallyVariableFeatures(brain_variogram,
selection.method = "markvariogram"), 3)
SpatialFeaturePlot(brain_variogram,
features = top_features_variogram, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = "3 genes with the top spatially variable rank (by mark-variogram)",
subtitle = "among 10 top variable genes for illustration purposes")
bottom_features_variogram <- tail(
SpatiallyVariableFeatures(brain_variogram,
selection.method = "markvariogram"), 3)
SpatialFeaturePlot(brain_variogram,
features = bottom_features_variogram, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = "3 genes with the bottom spatially variale rank (by mark-variogram)",
subtitle = "among 10 top variable genes for illustration purposes")
spatialDE
trendsceek
SPARK
together with its updated version SPARK-X
MERINGUE
To convert single-cell data objects between R (e.g. Seurat
or SingleCellExperiment
) and Python (e.g. AnnData
), you can use either SeuratDisk
or sceasy
. Another option is to use reticulate
, which could be a bit more involved and take more memory.
SeuratDisk
vignette to convert between Seurat
and AnnData
: https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.htmlsceasy
to convert single-cell datasets across different formats: https://github.com/cellgeni/sceasyreticulate
vignette to call Python from R and convert data formats: https://github.com/cellgeni/sceasyThe tutorial was inspired by the computational assignments I created together with Dr. Yun S. Song in a graduate course (CMPBIO 290: Algorithms for single-cell genomics) at University of California, Berkeley in Fall 2021, and it was largely based on many open-source materials, especially the Seurat tutorials from the Satija Lab. I would also like to thank Salwan Butrus for helpful feedback and suggestions.
References
Seurat
: https://satijalab.org/seurat/articles/spatial_vignette.htmlSeurat
guided clustering tutorial for scRNA-seq data: https://satijalab.org/seurat/articles/pbmc3k_tutorial.htmlsessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] limma_3.48.3 patchwork_1.1.1 ggplot2_3.3.5 dplyr_1.0.7
## [5] SeuratObject_4.0.2 Seurat_4.0.3
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 colorspace_2.0-2 deldir_0.2-10
## [4] ellipsis_0.3.2 ggridges_0.5.3 spatstat.data_2.1-0
## [7] farver_2.1.0 leiden_0.3.9 listenv_0.8.0
## [10] bit64_4.0.5 ggrepel_0.9.1 RSpectra_0.16-0
## [13] fansi_0.5.0 codetools_0.2-18 splines_4.1.0
## [16] knitr_1.33 polyclip_1.10-0 jsonlite_1.7.2
## [19] ica_1.0-2 cluster_2.1.2 png_0.1-7
## [22] uwot_0.1.10 shiny_1.6.0 sctransform_0.3.2
## [25] spatstat.sparse_2.0-0 compiler_4.1.0 httr_1.4.2
## [28] assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0
## [31] lazyeval_0.2.2 later_1.2.0 htmltools_0.5.1.1
## [34] tools_4.1.0 igraph_1.2.6 gtable_0.3.0
## [37] glue_1.4.2 RANN_2.6.1 reshape2_1.4.4
## [40] Rcpp_1.0.7 scattermore_0.7 jquerylib_0.1.4
## [43] vctrs_0.3.8 ape_5.5 nlme_3.1-152
## [46] lmtest_0.9-38 xfun_0.24 stringr_1.4.0
## [49] globals_0.14.0 mime_0.11 miniUI_0.1.1.1
## [52] lifecycle_1.0.1 irlba_2.3.3 goftest_1.2-2
## [55] future_1.21.0 MASS_7.3-54 zoo_1.8-9
## [58] scales_1.1.1 spatstat.core_2.3-0 promises_1.2.0.1
## [61] spatstat.utils_2.2-0 parallel_4.1.0 RColorBrewer_1.1-2
## [64] yaml_2.2.1 reticulate_1.20 pbapply_1.4-3
## [67] gridExtra_2.3 sass_0.4.0 rpart_4.1-15
## [70] stringi_1.7.3 highr_0.9 rlang_0.4.11
## [73] pkgconfig_2.0.3 matrixStats_0.60.0 evaluate_0.14
## [76] lattice_0.20-44 ROCR_1.0-11 purrr_0.3.4
## [79] tensor_1.5 labeling_0.4.2 htmlwidgets_1.5.3
## [82] bit_4.0.4 cowplot_1.1.1 tidyselect_1.1.1
## [85] parallelly_1.27.0 RcppAnnoy_0.0.19 plyr_1.8.6
## [88] magrittr_2.0.1 R6_2.5.1 generics_0.1.0
## [91] DBI_1.1.1 pillar_1.6.3 withr_2.4.2
## [94] mgcv_1.8-36 fitdistrplus_1.1-5 survival_3.2-11
## [97] abind_1.4-5 tibble_3.1.4 future.apply_1.7.0
## [100] crayon_1.4.1 hdf5r_1.3.3 KernSmooth_2.23-20
## [103] utf8_1.2.2 spatstat.geom_2.2-2 plotly_4.9.4.1
## [106] rmarkdown_2.9 grid_4.1.0 data.table_1.14.0
## [109] digest_0.6.28 xtable_1.8-4 tidyr_1.1.3
## [112] httpuv_1.6.1 munsell_0.5.0 viridisLite_0.4.0
## [115] bslib_0.2.5.1