Image credit: 'Maps' by Cara Barer

Lecture slides on iCloud

Homework #3: Data visualization with ggplot2 - part 1 (~4hrs) - due by start of class next Wednesday, May 13th.

Overview

In this class you’ll learn about a variety of approaches exploring your data. You’ll use multivariate statistical approaches such as Principal Component Analysis (PCA) to understand sources of variance in our data, while continuing to build your plotting skills by using ggplot2 to graph the results of PCA analyses. You’ll also learn how to use the dplyr package to take control over our gene expression dataframes, allowing us to change, sort, filter, arrange and summarize large data sets quickly and easily using simple commands in R. We’ll discuss common missteps and how to identify sources of bias in transcriptional data sets.

Learning objectives

  • Start and finish Step 3 script
  • Discuss basics of multivariate statistical analysis
  • Carry out hierarchical clustering of samples
  • Discuss and perform principal component analyses (PCA)
  • Produce ‘small multiples’ plot
  • Use standard dplyr ‘verbs’ to quickly query our data
  • Produce interactive graphics using the plotly package
  • Produce interactive tables with DT package

Code

Step 3 script


Lecture video

Part 1 - Discussion of multivariate data, dimensional reduction via PCA, and starting our Step 3 script

Part 2 - Plotting PCA results and small multiples

Part 3 - Producing interactive tables and plots


Reading

Ten quick tips for effective dimensionality reduction - a absolute must-read for understanding data exploration methods.

Blog post describing T-SNE - I mentioned various unsupervised linear methods for dimensional reduction of your data (PCA, MDS). T-SNE and UMAP are non-linear unsupervised methods that have become popular for representing single-cell RNAseq data and flow cytometry data.

Original T-SNE paper.

UMAP paper - A new algorithm, called uniform manifold approximation and projection (UMAP) has been recently published and is gaining popularity in single cell RNAseq and flow cytometry analysis. UMAP is proposed to preserve as much of the local and more of the global data structure than t-SNE, with a shorter run time.

Other videos

Josh Starmer does a fantastic job breakding down the math behind PCA in his StatQuest series