Aedes aegypti mosquitoes cause high morbidity and mortality by transmitting pathogens such as dengue virus, chikungunya virus, and yellow fever virus. The midgut of the mosquito is the primary barrier that inhibits dissemination of pathogens acquired in a bloodmeal into the mosquito circulatory system. This inhibition is crucial because once a pathogen, such as a virus or parasite, has successfully entered the circulatory system they can be transmitted to another host upon a subsequent bite. Therefore, understanding regulation of the midgut environment is critical. Specifically, this work focuses on understanding the regulation of the midgut by the steroid hormone, 20-hydroxyecdysone, which is produced in direct response to bloodfeeding.
The analysis below is based on RNAseq data from 10 samples:
-Dissected female midguts (control)
-Dissected female carcasses (control)
-Dissected 20E-injected female midguts at 5hr
-Dissected 20E-injected female carcasses at 5hr
-Dissected PBS-injected female midguts at 5hr
-Dissected PBS-injected female carcasses at 5hr
-Dissected 20E-injected female midguts at 18hr
-Dissected 20E-injected female carcasses at 18hr
-Dissected PBS-injected female midguts at 18hr
-Dissected PBS-injected female carcasses at 18hr
*each sample is pooled RNA from 30 mosquitoes
library(Rsubread)
library(limma)
library(edgeR)
library(ShortRead)
library(biomaRt)
library(ggplot2)
library(reshape2)
library(dplyr)
library(ggvis)
library(gplots)
library(RColorBrewer)
This summary report was compiled using RPackages:
library(rmarkdown)
library(knitr)
Prepare groups that represent the samples and incorporate information about their different components (tissue, timepoint, hormone-treatment).
#Read in study design file
targets <- readTargets("Aedes_studyDesign.txt", row.names=NULL)
targets
groups <- factor(paste(targets$treatment, targets$timepoint, targets$tissue, sep="."))
#Creates more understandable labels for samples using info in the study design file
sampleLabels <- paste(targets$treatment, targets$timepoint, targets$tissue, sep=".")
#Set-up of experimental design
design <- model.matrix(~0+groups)
colnames(design) <- levels(groups)
design
Reading in Primary Data using raw fastq files, these steps completed on CHMI desktop workstation
#Build index from reference genome; must have already downloaded the fasta file for genome of interest and have it in the working directory
buildindex(basename="Aedes",reference="Aedes_aegypti.AaegL3.26.dna.genome.fa")
#Aligns reads (in the fastq files) to indexed reference genome that above
reads <- targets$fastq
align(index="Aedes", readfile1=reads, input_format="FASTQ",output_format="BAM",
output_file=targets$OutputFile,tieBreakHamming=TRUE,unique=TRUE,indels=5, nthreads=8)
#Read in text file with bam file names
MyBAM <- targets$OutputFile
#Summarize aligned reads to genomic features
fc <- featureCounts(files=MyBAM, annot.ext="Aedes-aegypti-Liverpool_BASEFEATURES_AaegL3.3.gtf",
isGTFAnnotationFile=TRUE, useMetaFeatures=TRUE, strandSpecific=2, nthreads=8)
#Use the 'DGEList' function from EdgeR to make a 'digital gene expression list' object
DGEList <- DGEList(counts=fc$counts, genes=fc$annotation)
save(DGEList, file="DGEList")
load("DGEList")
#Retrieve all gene/transcript identifiers from this DGEList object
myGeneIDs <- DGEList$genes$GeneID
Filtered data so that only genes that have greater than 10 reads per million mapped reads in at least two samples are kept in the dataset for analysis. There are no replicates in this dataset, so the only standardization that is useful is converting both rpkm-filtered and rpkm-unfiltered datasets into binary logarithm (log2).
#RPKM (Reads Per Kilobase per Mapped reads) for unfiltered data
rpkm.unfiltered <- rpkm(DGEList, DGEList$genes$Length)
rpkm.unfiltered <- log2(rpkm.unfiltered + 1)
#Filtering the data to only keep in analysis genes that have >10 reads per million mapped reads in at least two libraries.
cpm.matrix.filtered <- rowSums(cpm(DGEList) > 10) >= 2
DGEList.filtered <- DGEList[cpm.matrix.filtered,]
dim(DGEList.filtered)
rpkm.filtered <- rpkm(DGEList.filtered, DGEList.filtered$genes$Length)
rpkm.filtered <- log2(rpkm.filtered + 1)
#Note that data is now in log2, so numbers smaller than 1 will be negative
Data can be visualized by histogram and boxplot
#Choose color scheme for graphs
cols.ALL <- topo.colors (n=10, alpha=1)
hist(rpkm.filtered, xlab = "log2 expression", main = "normalized data - histograms", col=cols.ALL)
boxplot(rpkm.filtered, ylab = "normalized log2 expression", main = "non-normalized data - boxplots", col=cols.ALL)
Annotating normalized data using mosquito-specific database
library(biomaRt)
#List all the available 'marts' to choose from
listMarts()
#Select mart of interest
vectorBase <- useMart("vb_gene_mart_1506")
#List the databases available in that mart
listDatasets(vectorBase)
#Select database to work with
aedesDB <- useDataset("aaegypti_eg_gene", mart=vectorBase)
#List the various filters, or keys, that can used to access this database
listFilters(aedesDB)
#List all the attributes that can be retrieved from the database
listAttributes(aedesDB)
#Pull out rownames of your dataset to use as one of the filters
myEnsemblIDs.filtered <- rownames(rpkm.filtered)
myEnsemblIDs.unfiltered <- rownames(rpkm.unfiltered)
#Attach annotation information to file
myAnnot.filtered <- getBM(attributes = c("external_gene_id", "ensembl_gene_id", "description"),
filters = "ensembl_gene_id",
values = myEnsemblIDs.filtered,
mart=aedesDB)
myAnnot.unfiltered <- getBM(attributes = c("external_gene_id", "ensembl_gene_id", "description"),
filters = "ensembl_gene_id",
values = myEnsemblIDs.unfiltered,
mart=aedesDB)
#Transform your identifiers to entrezIDs
resultTable.filtered <- merge(myAnnot.filtered, rpkm.filtered, by.x="ensembl_gene_id", by.y=0)
resultTable.unfiltered <- merge(myAnnot.unfiltered, rpkm.unfiltered, by.x="ensembl_gene_id", by.y=0)
This dendrogram shows the relationship between the samples. Tissue-specific differences are responsible for the greatest difference (eg. between midgut and carcass), followed by treatement (20E vs. PBS), and then time point (5hr v. 18hr).
resultTable.filtered.alt <- resultTable.filtered[,-2:-3]
rownames(resultTable.filtered.alt) <- resultTable.filtered.alt[,1]
resultTable.filtered.alt <- resultTable.filtered.alt[,-1]
resultTable.filtered.alt.matrix <- as.matrix(resultTable.filtered.alt)
rownames(resultTable.filtered.alt.matrix)
distance <- dist(t(resultTable.filtered.alt.matrix),method="euclidean")
clusters <- hclust(distance, method = "average")
plot(clusters, label = sampleLabels, hang = -1)
pca.res <- prcomp(t(resultTable.filtered.alt.matrix), scale.=F, retx=T)
ls(pca.res)
summary(pca.res) #Prints variance summary for all principal components (PCs).
head(pca.res$rotation) # $rotation shows you how much each GENE influenced each PC (callled 'eigenvalues', or loadings)
head(pca.res$x) # $x shows you how much each SAMPLE influenced each PC (called 'scores')
plot(pca.res, las=1)