Image credit: Brett Ryder, The Economist

Lecture slides on iCloud


In this class we’ll discuss how you can use R/Bioconductor to tap into vast amounts of RNAseq data available through the Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO).

Learning objectives

  • Learn about fasterq_dump
  • Learn about HDF5 file format
  • Explore ARCHS4 database programatically
  • Start and finish the Step 4 script


Step 4 script


ARCHS4 database in HDF5 format for mouse and human. These HDF5 files contain RNA-seq data already aligned using Kallisto for 405,640 samples and 348,184 samples, respectively. Note that these files will each take about ~12GB of space on your hard-drive. For the purposes of this lecture, you only need to download the mouse data.

Lecture videos

Part 1 - Lecture covering how to access public RNA-seq data

Part 2 - Working through the Step 4 script to access the ARCHS4 database


Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications, April, 2018. Describes the ARCHS4 resource from Avi Ma’ayan’s lab that provides convenient access to public RNAseq datasets, already prealigned with Kallisto. You can access data either through the ARCHS4 website or using the rhdf5 package.

Digital Expression Explorer 2: a repository of 4.5 trillion uniformly processed RNA-seq reads and counting - similar to ARCHS4, the DEE2 project leverages Kallisto and GEO/SRA to make hundreds of thousands of samples readily available to you, either through their website, or through R using the DEE2 package

Activity of Uncleaved Caspase-8 Controls Anti-bacterial Immune Defense and TLR-Induced Cytokine Production Independent of Cell Death, Oct, 2016. This paper contains the data we’ll retrieve from a public gene expression repository. The data is available here

Other video

ARCHS4 video describing how HDF5 files were created from gene expression data (13 min)