You'll get the most from this class if you B.Y.O.D. -- Bring your own data. Nothing will keep you more enagaged and crystalize the content of the course quite like working on questions you *actually* care about. That said, it is not a requirement that you come with data in hand. Throughout the course, I will demonstrate every step in the analysis using a 'real' dataset, which you can download below and use to follow along with me in class. If you run into serious problems following along with your own data, I strongly recommend that you use the dataset provided below. You can always return to your data once you have a functional pipeline and understand the process.

The response of the intestinal epithelium to infection with the protozoan parasite, Cryptosporidium parvum

  • The featured dataset for this iteration of the course comes courtesy of Boris Striepen’s lab and is unpublished, so please be respectful of this.
  • Download the raw data, which consists of 9 fastq files. You will need about 30Gb of storage space on your harddrive to accomodate these file. please do not uncompress these files (leave them as .gz files)
  • you also need this basic study design file that describes the experiment.
  • In the event that you have any problems installing or using Kallisto to map this raw data, I’ve already mapped this data to produce transcript-level abundance data , which you can download and start working with immediately. These files are available as a single compressed file here. Unzipping this file will reveal 9 folders (each containing the Kallisto output from mapping each of the 9 fastq files above). You may notice that each folder contains several files. Please leave these in place. During the course, we will discuss what these files actually mean.

HackDash #1

The challenge: your collaborator is interested in understanding how cells respond to infection with Respiratory Syncytial Virus (RSV). Previous work from their lab has show that some cells infected with RSV harbor primarily full-length (FL) viral genomes, while other cells accumulate short ‘defective’ genome genomes (DVGs), while yet other cells accumulate a mix of the two. They ask for your help in interpreting data from their recent sequencing experiment in which they profiled response of human cells to these different viral states in sorted FL-hi, DVGs-hi, intermediate, or neither (not infected). Using only your collaborator’s Kallisto alignments and study design file, your job is to identify the main sources of variance in the data. The winner will be the first team to email me with an explanation of what source of variation are evident in their experiment. Please include images of PCA plots to help make your case. Download the files to get started!.

HackDash #2

The challenge: A colleague has asked for your help to mine data produced from a very large RNAseq study of the parasitc worm, Schistosoma mansoni. In this experiment, male (M), female (F), juvenile (J) and mixed sex (X) worms were recovered from infected mice at various timepoints (control, 3hr, 12hr, and 24hr) following in vivo treatment with a low dose of the frontline anti-parasitic drug, praziquantel. Experiments were carried out with three different strains of worms: NMRI, LE, and LEPZQ. To complete this challenge, you’ll need to Download the processed data in a text file to get started!, read it into the R environment and begin using the tools you’ve learned in class for wrangling dataframes (NOTE: you do NOT need to carry out a formal differential gene expression analysis here). The first team to submit the most complete answer to the following questions by the end of class will win the challenge. Good luck!

  • what are the dimensions of this data frame?
  • select only the columns containing annotation info and the expression data for the female, LE strain worms
  • add new columns that show the average expression for the triplicates at each timepoint for these samples
  • add new columns that show the Log2 Fold Change for each AVG column, compared to control samples
  • arrange genes in descending order based on Log2 FC of 24hr vs control
  • how many genes have a log2 FC of 2 or more for 24hr vs control?
  • produce an html table using the DT package that shows the top 10 most DE genes (based on Log2 FC only) for 24hr vs control, and includes all AVG columns for the female LE worms, along with annotation data.
  • Send me your html table by email, and tell me how many genes met the FC cutoff above.

** Solution: see this script

HackDash #3

Download the mapped data and study design from a basic experiment in which T cells were left unstimulated or stimulated with anti-CD3/28 in duplicate. You would like to import these files into R and use your Transcriptomics skills to answer the following questions.

  1. What is the gene name of the 1015th row of the output of tximport? (2 points)
  2. What are the total number of counts for each sample within Txi_gene$counts (4 points)
  3. Create a dendrogram and include an image here. Describe how your samples are clustering. (4 points)
  4. Complete a PCA and include an image here. What do PC1 and PC2 correspond to? (5 points)
  5. How many genes are differentially expressed (FDR < 0.05, abs LogFC > 1) when you compare unstimulated and stimulated T cells? (5 points)
  6. Create a heatmap of all differentially expressed genes. (5 points)

Download the .gct and .cls files to run GSEA in order to answer the following questions.

  1. Using the ‘Hallmarks collection’ from MSigDB and your data, identify the pathway most highly enriched in simulated cells (5 points).
  2. How many genes sets from the Hallmark collection are significantly enriched in the stimulated group relative to unstimulated (FDR < 0.05) (2 points)
  3. Within Hallmarks, what is the most enriched pathway in phenotype “unstimulated”? Is it a significant enrichment? (5 points)
  4. How many Hallmark genes sets are significantly enriched in the unstimulated group relative to stimulated (FDR < 0.05) (2 points)