```{r setup, include=FALSE} library(rmarkdown) library(knitr) knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE) ``` ## Introduction During the Spring 2019 offering of **[DIYtranscriptomics](http://diytranscriptomics.com/)**, we analyzed a dataset from Boris Striepen's lab, in which HCT-8 human intestinal epithelial cells were infected with the protozoan parasite, _Cryptosporidium parvum_. This reproducible and dynamic report was created using Rmarkdown and the [Knitr package](https://yihui.name/knitr/), and summarizes the basic code and outputs (plots, tables, etc) produced during the course. *** ## R packages used A variety of R packages was used for this analysis. All graphics and data wrangling were handled using the [tidyverse suite of packages](https://www.tidyverse.org/). Typographic style for graphs was set using [hrbrthemes](https://github.com/hrbrmstr/hrbrthemes). All packages used are available from the Comprehensive R Archive Network (CRAN), Bioconductor.org, or Github. ```{r packages} library(tidyverse) library(reshape2) library(tximport) library(biomaRt) library(RColorBrewer) library(genefilter) library(edgeR) library(matrixStats) library(hrbrthemes) library(gplots) library(limma) library(d3heatmap) library(DT) library(gt) library(plotly) ``` *** ## Read mapping ### Aligning raw reads with Kallisto Raw reads were mapped to the human reference transcriptome using [Kallisto](https://pachterlab.github.io/kallisto/), version 0.45. The quality of raw reads, as well as the results of Kallisto mapping are summarized in [this summary report](http://DIYtranscriptomics.github.io/Data/files/multiqc_report.html) generated using [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [multiqc](https://multiqc.info/). ```{bash, eval=FALSE} # build index from reference fasta kallisto index -i Homo_sapiens.GRCh38.cdna.all.index Homo_sapiens.GRCh38.cdna.all.fa # map reads kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o uninf_rep1 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/Control1_mergedLanes_mergedRuns.fastq.gz &> uninf_rep1.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o uninf_rep2 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/Control2_mergedLanes_mergedRuns.fastq.gz &> uninf_rep2.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o uninf_rep3 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/Control3_mergedLanes_mergedRuns.fastq.gz &> uninf_rep3.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o crypto.wt_rep1 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/WT1_mergedLanes_mergedRuns.fastq.gz &> crypto.wt_rep1.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o crypto.wt_rep2 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/WT2_mergedLanes_mergedRuns.fastq.gz &> crypto.wt_rep2.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o crypto.wt_rep3 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/WT3_mergedLanes_mergedRuns.fastq.gz &> crypto.wt_rep3.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o crypto.mut_rep1 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/Trans1_mergedLanes_mergedRuns.fastq.gz &> crypto.mut_rep1.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o crypto.mut_rep2 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/Trans2_mergedLanes_mergedRuns.fastq.gz &> crypto.mut_rep2.log kallisto quant -i Homo_sapiens.GRCh38.cdna.all.index -o crypto.mut_rep3 -t 4 -b 30 --single -l 250 -s 30 ../../DATA/raw/Trans3_mergedLanes_mergedRuns.fastq.gz &> crypto.mut_rep3.log ``` *** ### Importing count data into R After read mapping with Kallisto, [TxImport](https://bioconductor.org/packages/release/bioc/html/tximport.html) was used to read kallisto outputs into the R environment. Annotation data from Biomart was used to 'collapse' data from transcript-level to gene-level. ```{r step 1} # copy and paste in the essential code from the step 1 script targets <- read_tsv("../../studyDesign.txt")# read in your study design path <- file.path("../readMapping", targets$sample, "abundance.h5") # set file paths to your mapped data targets <- mutate(targets, path) # add paths to your study design (only necessary for Sleuth) Hs.anno <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") # select 'mart' from biomaRt for annotations Tx <- getBM(attributes=c('ensembl_transcript_id_version', # get gene symbols for each transcript ID 'external_gene_name'), mart = Hs.anno) Tx <- as_tibble(Tx) # convert this annotation mapping file to a tibble (the tidyverse version of a dataframe) Txi_gene <- tximport(path, #reading kallisto data into R type = "kallisto", tx2gene = Tx, txOut = FALSE, countsFromAbundance = "lengthScaledTPM") myCPM <- as_tibble(Txi_gene$abundance, rownames = "geneSymbol") # these are you counts after adjusting for transcript length myCounts <- as_tibble(Txi_gene$counts, rownames = "geneSymbol") # these are your transcript per million (TPM) values, or counts per million (CPM) if you collapsed data to gene level ``` *** ## Preprocessing ### Impact of filtering and normalization ```{r step2, echo=FALSE} # copy and paste in your essential code from the step 2 script # because each code chunk in an Rmarkdown should only have one graphical output, we need make a few changes to this code # also get rid of title, subtitle, etc from these graphs groups1 <- targets$treatment groups1 <- factor(groups1) sampleLabels <- targets$sample myDGEList <- DGEList(Txi_gene$counts) log2.cpm <- cpm(myDGEList, log=TRUE) nsamples <- ncol(log2.cpm) myColors <- brewer.pal(nsamples, "Paired") log2.cpm.df <- as_tibble(log2.cpm) colnames(log2.cpm.df) <- sampleLabels log2.cpm.df.melt <- melt(log2.cpm.df) p1 <- ggplot(log2.cpm.df.melt, aes(x=variable, y=value, fill=variable)) + geom_violin(trim = FALSE, show.legend = FALSE) + stat_summary(fun.y = "median", geom = "point", shape = 124, size = 6, color = "black", show.legend = FALSE) + labs(y="log2 expression", x = "sample") + coord_flip() + theme_ipsum_rc() cpm <- cpm(myDGEList) keepers <- rowSums(cpm>1)>=3 #user defined myDGEList.filtered <- myDGEList[keepers,] myDGEList.filtered.norm <- calcNormFactors(myDGEList.filtered, method = "TMM") log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE) log2.cpm.filtered.norm.df <- as_tibble(log2.cpm.filtered.norm) colnames(log2.cpm.filtered.norm.df) <- sampleLabels log2.cpm.filtered.norm.df.melt <- melt(log2.cpm.filtered.norm.df) p2 <- ggplot(log2.cpm.filtered.norm.df.melt, aes(x=variable, y=value, fill=variable)) + geom_violin(trim = FALSE, show.legend = FALSE) + stat_summary(fun.y = "median", geom = "point", shape = 124, size = 6, color = "black", show.legend = FALSE) + labs(y="log2 expression", x = "sample") + coord_flip() + theme_ipsum_rc() library(cowplot) plot_grid(p1, p2, labels = c("A", "B")) ``` Filtering was carried out to remove lowly expressed genes. Genes with less than 1 count per million (CPM) in at least 3 or more samples filtered out. This reduced the number of genes from `r nrow(myDGEList)` to `r nrow(myDGEList.filtered)`. *** ### table of filtered and normalized data The table shown below includes expression data for `r nrow(myDGEList.filtered)` genes. You can sort and search the data directly from the table. ```{r step 3 - part 1} mydata.df <- as_tibble(log2.cpm.filtered.norm, rownames = "geneSymbol") colnames(mydata.df) <- c("geneSymbol", sampleLabels) write_tsv(mydata.df, "normData.txt") #user defined mydata.df <- mutate(mydata.df, uninfected.AVG = (uninf_rep1 + uninf_rep2 + uninf_rep1)/3, crypto.wt.AVG = (crypto.wt_rep1 + crypto.wt_rep2 + crypto.wt_rep3)/3, crypto.mut.AVG = (crypto.mut_rep1 + crypto.mut_rep2 + crypto.mut_rep3)/3, #now make columns comparing each of the averages above that you're interested in LogFC.crypto.wt_vs_uninfected = (crypto.wt.AVG - uninfected.AVG), LogFC.crypto.mut_vs_uninfected = (crypto.mut.AVG - uninfected.AVG)) %>% mutate_if(is.numeric, round, 2) datatable(mydata.df, extensions = c('KeyTable', "FixedHeader"), options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>% formatRound(columns=c(1:6), digits=2) ``` *** ## PCA plot ```{r step 3 - part 2} pca.res <- prcomp(t(log2.cpm.filtered.norm), scale.=F, retx=T) pc.var<-pca.res$sdev^2 pc.per<-round(pc.var/sum(pc.var)*100, 1) pca.res.df <- as_tibble(pca.res$x) ggplot(pca.res.df, aes(x=PC1, y=PC3, color=groups1)) + geom_point(size=5) + theme(legend.position="right") + theme_ipsum_rc() ``` *** ## Volcano plot ```{r step 5 - part 1} design <- model.matrix(~0 + groups1) colnames(design) <- levels(groups1) v.DEGList.filtered.norm <- voom(myDGEList.filtered.norm, design, plot = FALSE) fit <- lmFit(v.DEGList.filtered.norm, design) contrast.matrix <- makeContrasts(infection_with_WT = crypto.wt - uninfected, infection_with_Mut = crypto.mut - uninfected, levels=design) fits <- contrasts.fit(fit, contrast.matrix) ebFit <- eBayes(fits) myTopHits <- topTable(ebFit, adjust ="BH", coef=1, number=40000, sort.by="logFC") myTopHits <- as_tibble(myTopHits, rownames = "geneSymbol") ggplotly(ggplot(myTopHits, aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", geneSymbol))) + geom_point(size=2) + ylim(-0.5,12) + geom_hline(yintercept = -log10(0.01), linetype="longdash", colour="grey", size=1) + geom_vline(xintercept = 1, linetype="longdash", colour="#BE684D", size=1) + geom_vline(xintercept = -1, linetype="longdash", colour="#2C467A", size=1) + theme_ipsum_rc()) ``` *** ## table of DEGs To identify differentially expressed genes, precision weights were first applied to each gene based on its mean-variance relationship using [VOOM](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29), then data was normalized using the [TMM method](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25) in [EdgeR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/). Linear modeling and bayesian stats were employed via [Limma](https://academic.oup.com/nar/article/43/7/e47/2414268) to find genes that were up- or down-regulated in _C. parvum_ infected cells, compared to naive cells, by 2-fold or more, with a false-discovery rate (FDR) of 0.01. ```{r step 5 - part 2} results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=1) colnames(v.DEGList.filtered.norm$E) <- sampleLabels diffGenes <- v.DEGList.filtered.norm$E[results[,1] !=0 | results[,2] !=0,] diffGenes.df <- as_tibble(diffGenes, rownames = "geneSymbol") datatable(diffGenes.df, extensions = c('KeyTable', "FixedHeader"), caption = 'Table 1: DEGs for infected (wt Crypto) vs control', options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>% formatRound(columns=c(1:10), digits=2) ``` *** ## heatmap Pearson correlation was used to cluster **`r nrow(diffGenes)`** differentially expressed genes, which were then represented as heatmap with the data scaled by Zscore for each row. ```{r step 6} myheatcolors2 <- colorRampPalette(colors=c("yellow","white","blue"))(100) clustRows <- hclust(as.dist(1-cor(t(diffGenes), method="pearson")), method="complete") #cluster rows by pearson correlation clustColumns <- hclust(as.dist(1-cor(diffGenes, method="spearman")), method="complete") module.assign <- cutree(clustRows, k=2) module.color <- rainbow(length(unique(module.assign)), start=0.1, end=0.9) module.color <- module.color[as.vector(module.assign)] d3heatmap(diffGenes, colors = myheatcolors2, Rowv=as.dendrogram(clustRows), row_side_colors = module.color, scale='row') ``` *** ## Conclusions Describe the results in your own words. Some things to think about: * What are the key takeaways from the analysis? * What types of analyses would you want to do next? * Based on your analysis, what wet-lab experiments would you pursue? * How could you expand on or otherwise enhance this Rmarkdown report? ## Session info The output from running 'sessionInfo' is shown below and details all packages and version necessary to reproduce the results in this report. ```{r session info} sessionInfo() ```