Accessing public data

Image credit: Brett Ryder, The Economist

Overview

In this class we’ll discuss how you can use R/Bioconductor to tap into vast amounts of RNAseq data available through the Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO).

Learning objectives

Learn about fasterq_dump
Learn about HDF5 file format
Explore ARCHS4 database programatically
Start and finish the Step 4 script

Code

Step 4 script

Downloads

ARCHS4 database in HDF5 format for mouse and human. These HDF5 files contain RNA-seq data already aligned using Kallisto for 1,072,742 and 961,625 samples, respectively. Note that these files will each take about ~40-50GB of space on your hard-drive. For the purposes of this lecture, you only need to download the mouse data.

Lecture videos

Part 1 - Lecture covering how to access public RNA-seq data

Part 2 - Working through the Step 4 script to access the ARCHS4 database

Reading

Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications, April, 2018. Describes the ARCHS4 resource from Avi Ma’ayan’s lab that provides convenient access to public RNAseq datasets, already prealigned with Kallisto. You can access data either through the ARCHS4 website or using the rhdf5 package.

Digital Expression Explorer 2: a repository of 4.5 trillion uniformly processed RNA-seq reads and counting - similar to ARCHS4, the DEE2 project leverages Kallisto and GEO/SRA to make hundreds of thousands of samples readily available to you, either through their website, or through R using the DEE2 package

Activity of Uncleaved Caspase-8 Controls Anti-bacterial Immune Defense and TLR-Induced Cytokine Production Independent of Cell Death, Oct, 2016. This paper contains the data we’ll retrieve from a public gene expression repository. The data is available here