Image credit: Brett Ryder, The Economist

Lecture slides on iCloud

Homework #3: Data visualization with ggplot2 - part 1 (~4hrs) - due by start of class on Wednesday, May 13th.


In this class we’ll discuss how you can use R/Bioconductor to tap into vast amounts of RNAseq data available through the Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO).

Learning objectives

  • Learn about fasterq_dump
  • Learn about HDF5 file format
  • Explore ARCHS4 database programatically
  • Start and finish the Step 4 script


Step 4 script


ARCHS4 database in HDF5 format for human and mouse. These HDF5 files contain RNAseq data already aligned using Kallisto for 238,522 and 284,907 samples, respectively. Note that each of these files will take about ~7GB of space on your hard-drive. For the purposes of this lecture, you only need to download the mouse data.

Lecture video

Part 1 - Lecture covering how to access public RNA-seq data

Part 2 - Working through the Step 4 script to access the ARCHS4 database


Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications, April, 2018. Describes the ARCHS4 resource from Avi Ma’ayan’s lab that provides convenient access to public RNAseq datasets, already prealigned with Kallisto. You can access data either through the ARCHS4 website or using the rhdf5 package.

Digital Expression Explorer 2: a repository of 4.5 trillion uniformly processed RNA-seq reads and counting - similar to ARCHS4, the DEE2 project leverages Kallisto and GEO/SRA to make hundreds of thousands of samples readily available to you, either through their website, or through R using the DEE2 package

Activity of Uncleaved Caspase-8 Controls Anti-bacterial Immune Defense and TLR-Induced Cytokine Production Independent of Cell Death, Oct, 2016. This paper contains the data we’ll retrieve from a public gene expression repository. The data is available here

Other video

ARCHS4 video describing how HDF5 files were created from gene expression data (13 min)