Early human development requires precise regulation of gene expression in order to facilitate normal embryonic growth. Long noncoding RNAs (lncRNAs) are RNA transcripts lacking coding potential that regulate gene expression during development. We have identified a novel human-specific X-linked lncRNA called lncRHOXF2B, located in the RHOX cluster of the X chromosome. LncRHOXF2B is expressed from a 4.8kb gene to produce a 780 bp transcript, found in both the cytoplasm and the nucleus, and is subject to X-chromosome inactivation. To analyze the function of this novel lncRNA we targeted an inducible lncRHOXF2B transgene to chromosome 9 in human embryonic stem cells (hESCs).Overexpession of lncRHOXF2B in undifferentiated hESCs reduced cellular growth rates and increased differentiation. In this experiment, we examined the transcriptional profile of undifferentiated hESCs overexpressing lncRHOXF2B by illumina microarray. The expression profiling anaylsis along with highlighted results are reviewed below.
The following R/bioconductor packages are used in this analysis:
library(lumi)
library(lumiHumanIDMapping)
library(lumiHumanAll.db)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(genefilter)
library(limma)
library(annotate)
library(reshape2)
library(Biobase)
library(dplyr)
This lncRHOXF2B microarray analysis summary report was compiled in Rmarkdown using the following packages:
library(rmarkdown)
library(knitr)
First, I will read in the raw illumina microarray data and will add in controls
I will remove clone23 and just focus on clone28, and will get rid of the outlier replicate 2
rawData <- lumiR("FinalReport_noNorm_noBkrnd_samples.txt", convertNuID = TRUE, sep = NULL, detectionTh = 0.01, na.rm = TRUE, lib = "lumiHumanIDMapping")
rawData <- addControlData2lumi("FinalReport_noNorm_noBkrnd_controls.txt", rawData)
rawData <- rawData[,-1:-5] #gets rid of clone23 data
rawData <- rawData[,-2] #gets rid of an outlier replicate for clone28
Next, I will read in a text file that describes the design of this study, and will use this file to set up sample labels, treatment groups for comparison, and replicates
targets <- read.delim("Anguera_studyDesign_exper2.txt", sep="\t", stringsAsFactors = FALSE)
myGroups <- factor(paste(targets$Ident, targets$Treat, sep="."))
I begin by analyzing the array quality and consistency by looking at how a set of housekeeping genes behaves on the array
Result: embryonic stem cells appear to have variable housekeeping gene signatures
plotHousekeepingGene(rawData)
This is another QC check before normalization or data filtering occurs
cols <- topo.colors(n=6, alpha=1)
hist(rawData, xlab = "log2 expression", main = "non-normalized data - histograms", col=cols)
I will continue with the data preprocessing by using Lumi to perform robust spline normalization. This will control for the variation between arrays, and I will repeat the graph in Figure 2 with this normalization
normData <- lumiExpresso(rawData, QC.evaluation=TRUE, normalize.param=list(method='rsn'))
normData@annotation <- "lumiHumanAll.db"
hist(normData, xlab = "log2 expression", main = "Normalized data - histograms", col=cols)
I can also view the normalized data in boxplot form
boxplot(normData, ylab = "log2 expression", main = "Normalized data - boxplots", col=cols.ALL)
Finally, I will filter the data to remove genes with (1) a detection rate below the background, (2) a low variation across the 6 samples, (3) duplicates, and (4) those without an EntrezID
filtered_geneList <- nsFilter(normData, require.entrez=TRUE, remove.dupEntrez=TRUE, var.func=IQR, var.filter=TRUE, var.cutoff=0.5, filterByQuantile=TRUE)
filtered.eset <- filtered_geneList$eset # extract the ExpressionSet from this filtered list
# convert to a datamatrix that will contain only the probes after filtering
filtered.matrix <- as.matrix(filtered.eset)
probeList <- rownames(filtered.matrix)
First I will perform hierarchical clustering on the filtered data to group samples based on their similarity in a clustered dendogram
Result: biological replicates cluster together
sampleLabels <- paste(targets$Ident, targets$Treat, targets$Rep, sep=".")
distance <- dist(t(filtered.matrix),method="maximum")
clusters <- hclust(distance, method = "complete")
plot(clusters, label = sampleLabels, hang = -1, main="LncRHOXF2B analysis")
Next, I will explore the filtered data using a PCA
Result: the first two principle components account for around 75% of the variation in the data
pca.res <- prcomp(t(filtered.matrix), scale.=F, retx=T) #makes R object
ls(pca.res) #lists components of PCA
summary(pca.res) #prints variance summary for all principal components.
head(pca.res$rotation) #rotation shows you how much each GENE influenced each PC
head(pca.res$x) #$x shows you how much each SAMPLE influenced each PC (called 'scores')
plot(pca.res, las=1) #las gives you fine control of the data: turns the vertices
pc.var<-pca.res$sdev^2 #sdev^2 gives you the eigenvalues
pc.per<-round(pc.var/sum(pc.var)*100, 1)
pc.per #makes new object (numerical vector) to show percentage of contribution for each PCA
I will next make a graph to visualize the PCA result by plotting PC1 and PC2 against each other
data.frame <- as.data.frame(pca.res$x)
ggplot(data.frame, aes(x=PC1, y=PC2, colour=factor(myGroups))) +
geom_point(size=5) +
theme(legend.position="right")
Next, I will create a ‘small multiples’ chart to look at impact of each variable on each pricipal component
melted <- cbind(myGroups, melt(pca.res$x[,1:4])) #look at your 'melted' data
ggplot(melted) +
geom_bar(aes(x=Var1, y=value, fill=myGroups), stat="identity") +
facet_wrap(~Var2)
First, I will set up the experimental design. At this stage I will construct a contrast matrix so that I can begin to compare treatment groups (plus dox and untreated) to determine the list of differentially expressed genes. I will make groups so that all of the clone 28 plus dox are the same, and clone 28 minus dox are the same group, etc.
design <- model.matrix(~0+myGroups)
colnames(design) <- levels(myGroups) #makes column names exactly what I want
design
To create the DEG list, I will first read in the normalized filtered data, change the table headings to reflect their content, and will use the dply package to calculate averages and fold changes for the treatment groups (plus dox, untreated, and huES9 WT)
#read in your data from a text file that contains genes symbols as rows, and samples as columns.
myData <- read.delim("normalizedFilteredData.txt", header=TRUE)
#subsetting operations to remove/add columns: we will make ours smaller: [rows,columns] -2 gets rid of column 2
myData <- myData[,-2]
row.names(myData) <- myData[,1]
myData <- myData[,-1]
head(myData)
## X3998787029_F X3998787029_H X3998787029_I X3998787029_J
## ADA 9.578877 10.196835 9.811445 9.668738
## CDH2 12.727465 12.466396 13.151107 12.966272
## AKT3 7.688918 7.784887 7.912100 7.988348
## RNA18S5 9.424180 8.520160 8.572957 8.825126
## RNA28S5 10.080981 8.646188 8.971384 9.186701
## MED6 10.936741 10.930679 11.010000 10.989799
## X3998787029_K X3998787029_L
## ADA 9.175357 9.260412
## CDH2 12.185256 12.364068
## AKT3 7.628926 7.682692
## RNA18S5 8.922661 9.011450
## RNA28S5 9.316808 10.720771
## MED6 11.356416 11.122473
Next, I will change the table headings on the columns to more accurately reflect their content
targets <- read.delim("Anguera_studyDesign_exper2.txt", sep="\t", stringsAsFactors = FALSE)
sampleLabels <- as.character(paste(targets$Ident, targets$Treat, targets$Rep, sep="."))
colnames(myData) <- sampleLabels
geneSymbols <- row.names(myData)
head(myData)
## clone28.plusdox.1 clone28.plusdox.3 clone28.untxt.1
## ADA 9.578877 10.196835 9.811445
## CDH2 12.727465 12.466396 13.151107
## AKT3 7.688918 7.784887 7.912100
## RNA18S5 9.424180 8.520160 8.572957
## RNA28S5 10.080981 8.646188 8.971384
## MED6 10.936741 10.930679 11.010000
## clone28.untxt.2 p37.dox.day5a p37.dox.day5b
## ADA 9.668738 9.175357 9.260412
## CDH2 12.966272 12.185256 12.364068
## AKT3 7.988348 7.628926 7.682692
## RNA18S5 8.825126 8.922661 9.011450
## RNA28S5 9.186701 9.316808 10.720771
## MED6 10.989799 11.356416 11.122473
Finally, I will use the dplyr ‘mutate’ command to get averages and fold changes for my two replicates in each of the three treatment groups
myData <- mutate(myData,
clone28.plusdox.AVG = (clone28.plusdox.1 + clone28.plusdox.3)/2,
clone28.untxt.AVG = (clone28.untxt.1 + clone28.untxt.2)/2,
p37.dox.AVG = (p37.dox.day5a + p37.dox.day5b)/2,
LogFC.clone28 = (clone28.plusdox.AVG - clone28.untxt.AVG),
geneSymbols = geneSymbols)
row.names(myData) <- geneSymbols
head(myData)
## clone28.plusdox.1 clone28.plusdox.3 clone28.untxt.1
## ADA 9.578877 10.196835 9.811445
## CDH2 12.727465 12.466396 13.151107
## AKT3 7.688918 7.784887 7.912100
## RNA18S5 9.424180 8.520160 8.572957
## RNA28S5 10.080981 8.646188 8.971384
## MED6 10.936741 10.930679 11.010000
## clone28.untxt.2 p37.dox.day5a p37.dox.day5b clone28.plusdox.AVG
## ADA 9.668738 9.175357 9.260412 9.887856
## CDH2 12.966272 12.185256 12.364068 12.596931
## AKT3 7.988348 7.628926 7.682692 7.736903
## RNA18S5 8.825126 8.922661 9.011450 8.972170
## RNA28S5 9.186701 9.316808 10.720771 9.363584
## MED6 10.989799 11.356416 11.122473 10.933710
## clone28.untxt.AVG p37.dox.AVG LogFC.clone28 geneSymbols
## ADA 9.740091 9.217884 0.14776492 ADA
## CDH2 13.058689 12.274662 -0.46175842 CDH2
## AKT3 7.950224 7.655809 -0.21332120 AKT3
## RNA18S5 8.699042 8.967056 0.27312824 RNA18S5
## RNA28S5 9.079042 10.018789 0.28454188 RNA28S5
## MED6 10.999899 11.239444 -0.06618952 MED6
Additionally, using dplyr I can sort by the column of interest (i.e. arrage by LogFC) so that I can display the most differentially expressed genes
myData.sort <- myData %>%
arrange(desc(LogFC.clone28)) %>%
select(geneSymbols, LogFC.clone28)
head(myData.sort)
## geneSymbols LogFC.clone28
## 1 CAMKV 0.9702477
## 2 LEFTY1 0.8784327
## 3 SPHK2 0.8116943
## 4 GAL 0.8107635
## 5 SNAR-A1 0.8029395
## 6 CENPV 0.7739661
Using ggplot I will graph these data by creating a basic scatterplot
ggplot(myData, aes(x=clone28.plusdox.AVG, y=clone28.untxt.AVG)) +
geom_point(shape=1) +
geom_point(size=4)
To generate the DEG list, I will rely purely on fold changes. This is typically done when there are no biological replicates. Although I do have 2 biological replicates, leveraging statistical tools unfortunately did not produce any differentially expressed genes that were significant
Results: this produces a list of 162 differentially expressed genes
myData.filter <- myData %>%
filter((abs(LogFC.clone28) >= 0.58)) %>%
select(geneSymbols, clone28.plusdox.AVG, clone28.untxt.AVG, p37.dox.AVG)
myData.filter
## geneSymbols clone28.plusdox.AVG clone28.untxt.AVG p37.dox.AVG
## 1 SNAR-A1 14.695097 13.892157 14.950183
## 2 CDH13 8.390152 9.130336 7.573793
## 3 CEBPZ 11.442156 12.131565 11.680061
## 4 KATNB1 9.800605 9.220116 9.937070
## 5 MCRS1 11.312419 10.615289 11.071367
## 6 ARL6IP5 10.722700 11.460211 10.388641
## 7 ARFGEF1 10.419899 11.005101 10.816669
## 8 IFITM2 13.347513 12.613013 12.986428
## 9 DNPH1 10.039639 9.458199 9.613348
## 10 TXNIP 12.023804 12.822756 10.077300
## 11 LEFTY1 9.308792 8.430360 10.802877
## 12 GNA13 9.729950 10.360537 9.904536
## 13 STAG2 10.729757 11.431305 11.154941
## 14 BTG3 11.236507 11.910575 10.285177
## 15 ASCC3 10.221296 11.009534 10.035547
## 16 RAB31 10.797186 11.421857 9.976030
## 17 STMN2 8.099180 8.688847 8.653536
## 18 PRSS23 9.163694 10.254331 8.326022
## 19 MFSD3 10.891507 10.139355 10.855421
## 20 OSBPL9 11.286309 11.873600 11.415539
## 21 CTHRC1 8.359018 8.954354 8.069718
## 22 COL1A1 11.387114 12.231503 9.061571
## 23 COL1A2 11.966657 12.999932 9.885078
## 24 COL3A1 9.523737 10.565244 8.991050
## 25 COL4A1 12.694582 13.280041 11.378987
## 26 COL5A1 11.037734 12.035872 9.780391
## 27 COL5A2 10.513903 11.415516 9.415805
## 28 COL6A3 8.838278 9.643537 8.923414
## 29 COL8A1 9.185538 9.903319 7.608936
## 30 COL11A1 9.559151 10.623020 8.397106
## 31 COPB1 11.471960 12.100558 11.057587
## 32 KLF6 9.673163 10.315741 9.338646
## 33 PRRC1 9.323041 10.120188 9.393063
## 34 CPE 9.855100 10.595440 8.641014
## 35 ZNF296 10.132832 9.499394 10.293075
## 36 DCP2 11.445249 12.082172 11.761183
## 37 CENPV 12.641464 11.867498 12.707414
## 38 FASN 13.462221 12.864194 13.157779
## 39 SEC31A 11.196987 11.792516 10.952640
## 40 PALLD 10.285312 10.940639 9.049747
## 41 SMG1 9.789337 10.374133 10.387167
## 42 EXOC7 10.190874 9.569208 10.813251
## 43 AGTPBP1 10.491511 11.085301 10.684497
## 44 SF3B1 11.032046 11.616161 11.521411
## 45 PHF3 11.339506 12.124500 11.647315
## 46 MACF1 10.555123 11.203117 10.356559
## 47 CADM1 9.709893 10.343583 9.020915
## 48 SLC37A4 11.298791 10.679428 10.974843
## 49 ZNF549 11.565416 12.304362 12.538183
## 50 TMEM158 12.300351 11.662141 12.218029
## 51 TIPARP 8.968953 9.563717 8.911786
## 52 UPF2 11.093819 11.710569 11.906333
## 53 GBE1 9.139667 9.734606 9.062179
## 54 MYOF 9.485149 10.465208 8.157901
## 55 RRP7A 9.845009 9.236733 10.058339
## 56 GOLGA4 8.733106 9.574325 8.776022
## 57 H19 9.490266 10.282092 9.738735
## 58 RCOR2 11.260893 10.650315 11.202601
## 59 DBNL 10.110472 9.398279 10.651788
## 60 ANXA1 11.356607 12.008246 8.793268
## 61 HAS3 10.483645 9.894171 10.542639
## 62 ANXA3 9.734161 10.362593 8.589749
## 63 HMGCR 11.896414 12.550334 11.406712
## 64 HMGCS1 11.765344 12.741421 11.142189
## 65 HSP90AA1 12.240088 12.893376 13.171506
## 66 IGFBP7 9.915365 10.992000 8.107186
## 67 TICAM2 8.984772 9.598616 8.205934
## 68 IL6ST 8.581893 9.361684 8.583256
## 69 IMPA2 11.119012 10.537315 11.497298
## 70 INSIG1 11.270095 11.971405 9.429481
## 71 ITGAV 9.944363 11.199717 9.370054
## 72 ITGB1 13.348600 14.076317 12.650985
## 73 NDUFS7 11.464367 10.762855 10.928883
## 74 PTAR1 10.218028 10.837489 9.956722
## 75 TNPO1 10.208128 10.871144 10.365063
## 76 PSMG4 9.896790 9.296486 9.824007
## 77 KTN1 9.993141 10.912297 10.387161
## 78 LPP 11.021340 11.805877 10.322901
## 79 LRP3 10.510101 9.896006 10.439405
## 80 CTXN1 11.818729 11.210628 11.839267
## 81 SMAD5 9.760980 10.360465 10.026623
## 82 MAN2A1 9.424411 10.004711 9.650731
## 83 MCM2 10.341480 9.663234 10.749221
## 84 MCM5 9.643682 8.932836 10.428244
## 85 METTL1 10.224291 9.631439 10.459508
## 86 MKLN1 10.526478 11.156889 10.780742
## 87 LOC441087 12.058015 12.879558 13.188069
## 88 NAP1L1 12.037091 12.619938 12.432122
## 89 NCK1 10.013503 10.604112 9.854146
## 90 NFE2L2 9.531815 10.168854 9.649574
## 91 NKTR 10.232566 10.884156 11.253284
## 92 NOS2 9.416821 8.767466 7.759247
## 93 GPX8 10.260958 11.023043 10.308420
## 94 P4HA1 10.204943 11.002883 9.790140
## 95 PAM 11.906599 12.598867 11.549278
## 96 TMED7 10.106636 10.703725 10.056696
## 97 GAL 12.295306 11.484542 12.351752
## 98 WBP5 9.611354 10.256046 9.695783
## 99 ARMCX3 9.509903 10.186141 9.132957
## 100 UFM1 9.814824 10.513276 9.534603
## 101 PHB 10.951252 10.345962 11.101667
## 102 PLAU 9.851478 10.491233 8.323184
## 103 POLD2 10.360963 9.704188 10.295434
## 104 KLHL24 8.697399 9.358073 8.243459
## 105 PHIP 10.155596 10.825387 10.804860
## 106 PPP2R4 11.456482 10.872229 11.073226
## 107 THUMPD1 9.214503 9.807545 9.941391
## 108 SCYL2 9.164398 9.841067 9.301674
## 109 MAPK6 11.693443 12.279480 11.972437
## 110 SPHK2 10.303156 9.491462 11.158261
## 111 NCLN 10.675709 10.059540 10.670849
## 112 TWSG1 9.410519 10.096720 9.777996
## 113 VARS2 10.358039 9.713149 9.908066
## 114 SERINC1 8.765525 9.390514 8.812684
## 115 RAB5A 10.244803 10.949627 10.293676
## 116 MRPL23 9.674266 9.040742 9.770924
## 117 SAT1 12.148998 12.801047 11.177716
## 118 MSMO1 11.389443 12.153273 10.152571
## 119 BGN 11.194309 12.284404 9.305487
## 120 XPNPEP3 13.783027 14.525255 14.550030
## 121 ZMAT3 12.287540 12.969488 13.051277
## 122 SHMT2 12.351022 11.756119 12.169035
## 123 ACBD3 9.960411 10.781161 9.711403
## 124 MRPL11 11.323897 10.674152 11.386883
## 125 SNAI2 8.878604 9.728315 8.207603
## 126 CAPRIN2 10.219665 10.849427 10.239603
## 127 SPARC 11.139942 11.987438 9.604505
## 128 SPP1 10.541842 11.954566 8.777345
## 129 SRF 11.719476 11.134178 11.638871
## 130 SSR1 10.164792 10.749737 10.050035
## 131 TGFBI 9.527704 10.402313 8.898427
## 132 TOP2A 11.932314 12.645700 12.286802
## 133 TTC3 11.352421 12.372398 11.065888
## 134 VPS51 10.753383 10.147679 10.553051
## 135 VIM 13.807754 14.435023 12.766557
## 136 XPO1 12.186363 12.841422 12.519908
## 137 CAMKV 10.574524 9.604277 11.513397
## 138 AHNAK 9.875644 10.485999 9.215877
## 139 HPS6 10.076818 9.447772 9.813908
## 140 EPHX3 9.115044 8.522897 8.953618
## 141 CALD1 12.500803 13.186643 11.619032
## 142 C17orf70 10.572024 9.922758 10.788535
## 143 FXR1 12.068253 12.675335 12.125486
## 144 COLEC12 9.472619 10.107578 9.427428
## 145 CALU 11.019209 11.740601 10.653792
## 146 RND2 10.671496 10.080094 10.669215
## 147 DGCR6 10.644127 9.933650 10.374848
## 148 GTPBP6 11.607746 10.850136 11.254595
## 149 HIST1H4C 14.116647 13.529951 14.272942
## 150 SRPX 9.974701 10.586539 8.963908
## 151 TEAD2 12.635841 11.938131 13.026657
## 152 TSPAN18 9.941368 9.290086 9.319833
## 153 BTAF1 10.385743 11.027212 10.777242
## 154 SMC3 10.981285 11.715962 11.821154
## 155 FIBP 10.903416 10.310872 11.007432
## 156 MTDH 8.945789 9.591128 9.427497
## 157 UBE2Q2 9.926839 10.574280 9.648303
## 158 ZRANB2 10.634189 11.262521 11.462129
## 159 GDF3 8.606831 7.948487 9.804614
## 160 PJA2 12.152250 12.830609 11.775421
## 161 GOLGA5 9.925967 10.623248 9.593634
## 162 PTBP3 11.839425 12.502819 11.717612
To generate the heatmap, I need to convert the filtered data object from a dataframe into a matrix
class(myData.filter) #shows if it is a matrix or data frame
## [1] "data.frame"
myData.filter.matrix <- (myData.filter) #convert data frame to matrix
rownames(myData.filter.matrix) <- myData.filter.matrix[,1]
myData.filter.matrix <- myData.filter.matrix[,-1]
myData.filter.matrix <- as.matrix(myData.filter.matrix)
class(myData.filter.matrix)
## [1] "matrix"
is.matrix(myData.filter.matrix) #tests if argument is a strict matrix
## [1] TRUE
head(myData.filter.matrix)
## clone28.plusdox.AVG clone28.untxt.AVG p37.dox.AVG
## SNAR-A1 14.695097 13.892157 14.950183
## CDH13 8.390152 9.130336 7.573793
## CEBPZ 11.442156 12.131565 11.680061
## KATNB1 9.800605 9.220116 9.937070
## MCRS1 11.312419 10.615289 11.071367
## ARL6IP5 10.722700 11.460211 10.388641
I will create a tree and color the clusters. I can vary the number of clusters as well as the cut height. Additionally, I will pick a color scheme for the heatmap
hr <- hclust(as.dist(1-cor(t(myData.filter.matrix), method="pearson")), method="complete") #cluster rows by pearson correlation
hc <- hclust(as.dist(1-cor(myData.filter.matrix, method="spearman")), method="complete") #cluster columns by spearman correlation
mycl <- cutree(hr, k=2)
mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9)
mycolhc <- mycolhc[as.vector(mycl)]
myheatcol <- colorpanel(40, "darkblue", "yellow", "white")
#plot the hclust results as a heatmap
heatmap.2(myData.filter.matrix, Rowv=as.dendrogram(hr), Colv=NA,
col=myheatcol, scale="row", labRow=NA,
density.info="none", trace="none", RowSideColors=mycolhc,
cexRow=1, cexCol=.75, margins=c(10,30), key=T)