Image credit: Brett Ryder, The Economist

Lecture slides

Homework: Data visualization with ggplot2 - part 1 (~4hrs) - due April 3rd


In this class we’ll discuss how you can use R/Bioconductor to tap into vast amounts of RNAseq data available through the Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO).


  • Learn about fasterq_dump
  • Learn about HDF5 file format
  • Explore ARCHS4 database programatically


Step 4 script


ARCHS4 data in HDF5 format - Download the human_matrix.h5 v6 and the mouse_matrix.h5 v6. These HDF5 files contain RNAseq data already aligned using Kallisto for 133,776 and 170,010 samples, respectively. Note that each of these files will take about 4Gb of space on your hard-drive.


Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications, April, 2018. Describes the ARCHS4 resource from Avi Ma’ayan’s lab that provides convenient access to public RNAseq datasets, already prealigned with Kallisto. You can access data either through the ARCHS4 website or using the rhdf5 package.

Digital Expression Explorer 2: a repository of 4.5 trillion uniformly processed RNA-seq reads and counting - similar to ARCHS4, the DEE2 project leverages Kallisto and GEO/SRA to make hundreds of thousands of samples readily available to you, either through their website, or through R using the DEE2 package

Activity of Uncleaved Caspase-8 Controls Anti-bacterial Immune Defense and TLR-Induced Cytokine Production Independent of Cell Death, Oct, 2016. This paper contains the data we’ll retrieve from a public gene expression repository. The data is available here


ARCHS4 video describing how HDF5 files were created from gene expression data (13 min)