Overview
Now that you’re comfortable with bulk RNA-seq data analysis, we’ll shift our focus to the rapidly developing landscape of single cell RNA-seq (scRNA-seq). In this lecture, you’ll learn about the underlying technology and we’ll demonstrate how to process raw single cell data directly on your laptop (!) for importing into R/bioconductor. You’ll then use Seurat to analyze scRNA-seq data, including carrying out dimensional reduction and display using UMAP, identifying cell clusters and cluster-specific marker genes, and integrating data from multiple samples.
Learning objectives
- Understand droplet-based scRNA-seq technology
- Be able to compare and contrast single cell and bulk RNA-seq methods
- Understand cost and experimental design considerations for scRNA-seq experiments.
- Familiarity with multiplexed single cell assays (CITE-seq, ‘multiome’, TEA-seq)
- Be able to define common terms and concepts in single cell genomics
- Use Kallisto-BUStools to preprocess raw scRNA-seq data (via kb-python)
- Be able to import preprocessed data into R and create a Seurat object
- Carry out filtering using DropUtils
- Generate a Quality Control report of your scRNA-seq data directly within R
- Use standard QC metrics and plots to filter your data
- Generate clusters and visualize via UMAP dimensional reduction
- Find cluster-specific marker genes with Seurat
- Annotate unknown clusters using public databases and CellAssign and SingleR
- Integrate multiple samples and use sample details to analyze integrated data
What you need to do
Raw fastq files. You will need about 5Gb of storage space on your harddrive to accomodate this download. please do not uncompress these files (leave them as .gz files). This is data from 1000 peripheral blood mononuclear cells (PBMCs) and is one of the sample datasets provided by 10X Genomics here. I merged the separate lane files to make this simpler to work with for the course. If you are unable to carry out read mapping with Kb-Python, then you can also download the pre-processed data for 1000 PBMCs. This ensures that everyone can follow along with this lecture, regardless of whether you were able to install or use Kb-python.
Human transcriptome reference index file - this is the index you created using Kallisto way back in lecture 2. If you don’t have this, remember it’s easy to create using kallisto index.
t2g.txt - this is a human transcript-to-gene mapping file that we will use with Kallisto-Bustools to preprocess our data. This file is easy to generate with kb ref, but downloading it now will save you some time.
kb-python - You will need to have this software installed in a Conda environment on your laptop. We did this way back in lecture 1. If you are unable to install or use kb-python, just follow along with the lecture so you understand the concepts.
DIY_scRNAseq_basic.R - this is the R script that we’ll use for this lecture.
functions.R - this is the custom R function we’ll use for generating a QC report with our scRNA-seq data (see Reading material below for source).
Seurat objects - this folder contains two Seurat objects from an unpublished mouse experiment (courtesy of Chris Hunter’s lab). One sample is from a naive control mouse, while the second is from a mouse infected with the protozoan parasite, Toxoplasma gondii (14 days post-infection). We’ll use these data in the second 1/2 of the lecture to practice integration and differential gene testing between conditions.
Lecture videos
Part 1 – Intro to single cell RNA-seq (scRNA-seq)
Part 2 – Practical considerations for single cell experiments
Part 3 – Pre-processing scRNA-seq data using Kallisto-bustools
Part 4 – Importing scRNA-seq data into R and carrying out basic QA analysis.
Part 5 – Dimensional reduction with UMAP, and cluster identification
Part 6 – Integration of multiple samples and working with sample metadata
Reading
Modular and efficient pre-processing of single-cell RNA-seq - describes the full Kallisto-Bustools workflow for memory efficient processing of scRNA-seq data.
The barcode, UMI, set format and BUStools, Bioinformatics - Describes the BUS format as an efficient and platform-independent way to store information from scRNA-seq data.
A curated database reveals trends in single-cell transcriptomics - describes the growing collection of scRNA-seq experiments found here, which I used to produce two of the plots in the slides for this lecture.
EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.. - This is the paper describing the DropletUtils package that we use in this lecture to identify empty drops.
Sarah Ennis’ Github repo for preprocessing scRNA-seq data - This is the source of the custom script we use to generate the CellRanger-esque html QC report.
Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling - Describes the CellAssign algorithm and R package that we use to identify clusters.
Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage – describes the SingleR and celldex packages that allow us to leverage bulk RNA-seq data in public repositories to curate clusters in our scRNA-seq.
Comprehensive Integration of Single-Cell Data - This 2019 paper describes the underlying statistical approach for data integration in Seurat.