Data exploration

Image credit: 'Maps' by Cara Barer

Overview

In this class you’ll learn about a variety of approaches exploring your data. You’ll use multivariate statistical approaches such as Principal Component Analysis (PCA) to understand sources of variance in our data, while continuing to build your plotting skills by using ggplot2 to graph the results of PCA analyses. You’ll also learn how to use the dplyr package to take control over our gene expression dataframes, allowing us to change, sort, filter, arrange and summarize large data sets quickly and easily using simple commands in R. We’ll discuss common missteps and how to identify sources of bias in transcriptional data sets.

Learning objectives

Start and finish Step 3 script
Discuss basics of multivariate statistical analysis
Carry out hierarchical clustering of samples
Discuss and perform principal component analyses (PCA)
Produce ‘small multiples’ plot
Use standard dplyr ‘verbs’ to quickly query our data
Produce interactive graphics using the plotly package
Produce interactive tables with DT package

Code

Step 3 script

Lecture video

Part 1 - Discussion of multivariate data, dimensional reduction via PCA, and starting our Step 3 script

Part 2 - Plotting PCA results and small multiples

Part 3 - Producing interactive tables and plots

Reading

Ten quick tips for effective dimensionality reduction - a absolute must-read for understanding data exploration methods.

lab post describing T-SNE - I mentioned various unsupervised linear methods for dimensional reduction of your data (PCA, MDS). T-SNE and UMAP are non-linear unsupervised methods that have become popular for representing single-cell RNAseq data and flow cytometry data.

Original T-SNE paper.

UMAP paper - A new algorithm, called uniform manifold approximation and projection (UMAP) has been recently published and is gaining popularity in single cell RNAseq and flow cytometry analysis. UMAP is proposed to preserve as much of the local and more of the global data structure than t-SNE, with a shorter run time.