Image credit: Picture-alliance/AP Photo/M. Schreiber

Corresponding lecture

Lecture 8 – Differential gene expression

Description

You will be given raw counts and a study design file for an extensive RNA-seq exploration of the host response to SARS-CoV-2 and related viruses in different cell lines, tissues and time points, and which include human primary samples as well as in vivo studies in ferrets – an incredible dataset from Benjamin TenOever’s lab at Mt Sinai! You should check out their recent Cell paper: Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19.

This challenge certainly falls into the category of ‘easier said than done’.

Your challenge is to look at the study design file and decide on a question YOU would like to ask using this highly multivariate dataset, and then carry out your analysis using code from the course.

Logistics

You have a full week to complete this lab. Use this time to work through the dataset either alone, or together with your classmates.

In lab next week, you will have the first ~45min of class to put together the pieces of the figure that you spent the week working on. The reminder of class will be devoted to presenting your figure to the class!

Learning objectives

  • Thinking about what you want to ask with a large dataset, learning to priortize questions when there are many possible things you could ask, and then constructing your analysis around one or a few key questions.
  • Flying solo on a data analysis project, then coming together with collaborators at different points in the project to see how colleagues differ in their approach and perspective, then incorporating these different perspectives in a final product is a key part of the research process when large datasets are involved.

What you’ll need to get started

Human expression data - raw counts obtained as a table, obtained directly from the public GEO record.

Ferret expression data - raw counts obtained as a table, obtained directly from the public GEO record.

Study design file - Manually assembled and curated from sample descriptions in GEO.

Rscript - to get you over the hump of importing the three files above into R.

Tips

  • You can ignore the step 1 script for this challenge since we didn’t align the data, but rather got raw counts directly from the authors entry in the Gene Expression Omnibus repository (GEO).
  • Since the data is already in the form of a count table (genes as rows and samples as columns) you don’t need to worry about annotations either. Go straight to creating a DGEList object.
  • dplyr is going to be critical in this challenge, as you will need to wrangle the study design file and the raw count tables to get what you need to address your question(s).