Abstract

Early human development requires precise regulation of gene expression in order to facilitate normal embryonic growth. Long noncoding RNAs (lncRNAs) are RNA transcripts lacking coding potential that regulate gene expression during development. We have identified a novel human-specific X-linked lncRNA called lncRHOXF2B, located in the RHOX cluster of the X chromosome. LncRHOXF2B is expressed from a 4.8kb gene to produce a 780 bp transcript, found in both the cytoplasm and the nucleus, and is subject to X-chromosome inactivation. To analyze the function of this novel lncRNA we targeted an inducible lncRHOXF2B transgene to chromosome 9 in human embryonic stem cells (hESCs).Overexpession of lncRHOXF2B in undifferentiated hESCs reduced cellular growth rates and increased differentiation. In this experiment, we examined the transcriptional profile of undifferentiated hESCs overexpressing lncRHOXF2B by illumina microarray. The expression profiling anaylsis along with highlighted results are reviewed below.

R packages

The following R/bioconductor packages are used in this analysis:

library(lumi)
library(lumiHumanIDMapping) 
library(lumiHumanAll.db)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(genefilter)
library(limma)
library(annotate)
library(reshape2)
library(Biobase)
library(dplyr)

This lncRHOXF2B microarray analysis summary report was compiled in Rmarkdown using the following packages:

library(rmarkdown)
library(knitr) 

Set-up & QC

First, I will read in the raw illumina microarray data and will add in controls
I will remove clone23 and just focus on clone28, and will get rid of the outlier replicate 2

rawData <- lumiR("FinalReport_noNorm_noBkrnd_samples.txt", convertNuID = TRUE, sep = NULL, detectionTh = 0.01, na.rm = TRUE, lib = "lumiHumanIDMapping")
rawData <- addControlData2lumi("FinalReport_noNorm_noBkrnd_controls.txt", rawData)
rawData <- rawData[,-1:-5] #gets rid of clone23 data
rawData <- rawData[,-2] #gets rid of an outlier replicate for clone28

Next, I will read in a text file that describes the design of this study, and will use this file to set up sample labels, treatment groups for comparison, and replicates

targets <- read.delim("Anguera_studyDesign_exper2.txt", sep="\t", stringsAsFactors = FALSE)
myGroups <- factor(paste(targets$Ident, targets$Treat, sep="."))

Figure 1: Housekeeping genes

I begin by analyzing the array quality and consistency by looking at how a set of housekeeping genes behaves on the array
Result: embryonic stem cells appear to have variable housekeeping gene signatures

plotHousekeepingGene(rawData)

Figure 2: Distribution of signal intensity

This is another QC check before normalization or data filtering occurs

cols <- topo.colors(n=6, alpha=1)
hist(rawData, xlab = "log2 expression", main = "non-normalized data - histograms", col=cols)

Figure 3: Data normalization histogram

I will continue with the data preprocessing by using Lumi to perform robust spline normalization. This will control for the variation between arrays, and I will repeat the graph in Figure 2 with this normalization

normData <- lumiExpresso(rawData, QC.evaluation=TRUE, normalize.param=list(method='rsn'))
normData@annotation <- "lumiHumanAll.db"
hist(normData, xlab = "log2 expression", main = "Normalized data - histograms", col=cols)

Figure 4: Data normalization boxplot

I can also view the normalized data in boxplot form

boxplot(normData, ylab = "log2 expression", main = "Normalized data - boxplots", col=cols.ALL)

Finally, I will filter the data to remove genes with (1) a detection rate below the background, (2) a low variation across the 6 samples, (3) duplicates, and (4) those without an EntrezID

filtered_geneList <- nsFilter(normData, require.entrez=TRUE, remove.dupEntrez=TRUE, var.func=IQR, var.filter=TRUE, var.cutoff=0.5, filterByQuantile=TRUE)
filtered.eset <- filtered_geneList$eset # extract the ExpressionSet from this filtered list
# convert to a datamatrix that will contain only the probes after filtering
filtered.matrix <- as.matrix(filtered.eset)
probeList <- rownames(filtered.matrix)

Exploratory analysis

Figure 5: Hierarchical clustering

First I will perform hierarchical clustering on the filtered data to group samples based on their similarity in a clustered dendogram
Result: biological replicates cluster together

sampleLabels <- paste(targets$Ident, targets$Treat, targets$Rep, sep=".")
distance <- dist(t(filtered.matrix),method="maximum")
clusters <- hclust(distance, method = "complete")
plot(clusters, label = sampleLabels, hang = -1, main="LncRHOXF2B analysis")

Figure 6: Principal component analysis I

Next, I will explore the filtered data using a PCA
Result: the first two principle components account for around 75% of the variation in the data

pca.res <- prcomp(t(filtered.matrix), scale.=F, retx=T) #makes R object
ls(pca.res) #lists components of PCA
summary(pca.res) #prints variance summary for all principal components.
head(pca.res$rotation) #rotation shows you how much each GENE influenced each PC
head(pca.res$x) #$x shows you how much each SAMPLE influenced each PC (called 'scores')
plot(pca.res, las=1) #las gives you fine control of the data: turns the vertices

pc.var<-pca.res$sdev^2 #sdev^2 gives you the eigenvalues
pc.per<-round(pc.var/sum(pc.var)*100, 1)
pc.per #makes new object (numerical vector) to show percentage of contribution for each PCA

Figure 7: Principal component analysis II

I will next make a graph to visualize the PCA result by plotting PC1 and PC2 against each other

data.frame <- as.data.frame(pca.res$x)
ggplot(data.frame, aes(x=PC1, y=PC2, colour=factor(myGroups))) +
  geom_point(size=5) +
  theme(legend.position="right")

Figure 8: PCA small multiples chart

Next, I will create a ‘small multiples’ chart to look at impact of each variable on each pricipal component

melted <- cbind(myGroups, melt(pca.res$x[,1:4])) #look at your 'melted' data
ggplot(melted) +
  geom_bar(aes(x=Var1, y=value, fill=myGroups), stat="identity") +
  facet_wrap(~Var2)

Identification of Differentially Expressed Genes

First, I will set up the experimental design. At this stage I will construct a contrast matrix so that I can begin to compare treatment groups (plus dox and untreated) to determine the list of differentially expressed genes. I will make groups so that all of the clone 28 plus dox are the same, and clone 28 minus dox are the same group, etc.

design <- model.matrix(~0+myGroups)
colnames(design) <- levels(myGroups) #makes column names exactly what I want
design

To create the DEG list, I will first read in the normalized filtered data, change the table headings to reflect their content, and will use the dply package to calculate averages and fold changes for the treatment groups (plus dox, untreated, and huES9 WT)

#read in your data from a text file that contains genes symbols as rows, and samples as columns. 
myData <- read.delim("normalizedFilteredData.txt", header=TRUE)
#subsetting operations to remove/add columns: we will make ours smaller: [rows,columns] -2 gets rid of column 2
myData <- myData[,-2]
row.names(myData) <- myData[,1]
myData <- myData[,-1]
head(myData)
##         X3998787029_F X3998787029_H X3998787029_I X3998787029_J
## ADA          9.578877     10.196835      9.811445      9.668738
## CDH2        12.727465     12.466396     13.151107     12.966272
## AKT3         7.688918      7.784887      7.912100      7.988348
## RNA18S5      9.424180      8.520160      8.572957      8.825126
## RNA28S5     10.080981      8.646188      8.971384      9.186701
## MED6        10.936741     10.930679     11.010000     10.989799
##         X3998787029_K X3998787029_L
## ADA          9.175357      9.260412
## CDH2        12.185256     12.364068
## AKT3         7.628926      7.682692
## RNA18S5      8.922661      9.011450
## RNA28S5      9.316808     10.720771
## MED6        11.356416     11.122473

Next, I will change the table headings on the columns to more accurately reflect their content

targets <- read.delim("Anguera_studyDesign_exper2.txt", sep="\t", stringsAsFactors = FALSE)
sampleLabels <- as.character(paste(targets$Ident, targets$Treat, targets$Rep, sep="."))
colnames(myData) <- sampleLabels
geneSymbols <- row.names(myData)
head(myData)
##         clone28.plusdox.1 clone28.plusdox.3 clone28.untxt.1
## ADA              9.578877         10.196835        9.811445
## CDH2            12.727465         12.466396       13.151107
## AKT3             7.688918          7.784887        7.912100
## RNA18S5          9.424180          8.520160        8.572957
## RNA28S5         10.080981          8.646188        8.971384
## MED6            10.936741         10.930679       11.010000
##         clone28.untxt.2 p37.dox.day5a p37.dox.day5b
## ADA            9.668738      9.175357      9.260412
## CDH2          12.966272     12.185256     12.364068
## AKT3           7.988348      7.628926      7.682692
## RNA18S5        8.825126      8.922661      9.011450
## RNA28S5        9.186701      9.316808     10.720771
## MED6          10.989799     11.356416     11.122473

Finally, I will use the dplyr ‘mutate’ command to get averages and fold changes for my two replicates in each of the three treatment groups

myData <- mutate(myData,
                 clone28.plusdox.AVG = (clone28.plusdox.1  + clone28.plusdox.3)/2,
                 clone28.untxt.AVG = (clone28.untxt.1 + clone28.untxt.2)/2,
                 p37.dox.AVG = (p37.dox.day5a + p37.dox.day5b)/2,
                 LogFC.clone28 = (clone28.plusdox.AVG - clone28.untxt.AVG),
                 geneSymbols = geneSymbols)
row.names(myData) <- geneSymbols
head(myData)
##         clone28.plusdox.1 clone28.plusdox.3 clone28.untxt.1
## ADA              9.578877         10.196835        9.811445
## CDH2            12.727465         12.466396       13.151107
## AKT3             7.688918          7.784887        7.912100
## RNA18S5          9.424180          8.520160        8.572957
## RNA28S5         10.080981          8.646188        8.971384
## MED6            10.936741         10.930679       11.010000
##         clone28.untxt.2 p37.dox.day5a p37.dox.day5b clone28.plusdox.AVG
## ADA            9.668738      9.175357      9.260412            9.887856
## CDH2          12.966272     12.185256     12.364068           12.596931
## AKT3           7.988348      7.628926      7.682692            7.736903
## RNA18S5        8.825126      8.922661      9.011450            8.972170
## RNA28S5        9.186701      9.316808     10.720771            9.363584
## MED6          10.989799     11.356416     11.122473           10.933710
##         clone28.untxt.AVG p37.dox.AVG LogFC.clone28 geneSymbols
## ADA              9.740091    9.217884    0.14776492         ADA
## CDH2            13.058689   12.274662   -0.46175842        CDH2
## AKT3             7.950224    7.655809   -0.21332120        AKT3
## RNA18S5          8.699042    8.967056    0.27312824     RNA18S5
## RNA28S5          9.079042   10.018789    0.28454188     RNA28S5
## MED6            10.999899   11.239444   -0.06618952        MED6

Additionally, using dplyr I can sort by the column of interest (i.e. arrage by LogFC) so that I can display the most differentially expressed genes

myData.sort <- myData %>%
  arrange(desc(LogFC.clone28)) %>%
  select(geneSymbols, LogFC.clone28)
head(myData.sort)
##   geneSymbols LogFC.clone28
## 1       CAMKV     0.9702477
## 2      LEFTY1     0.8784327
## 3       SPHK2     0.8116943
## 4         GAL     0.8107635
## 5     SNAR-A1     0.8029395
## 6       CENPV     0.7739661

Figure 9: Basic scatterplot of gene expression

Using ggplot I will graph these data by creating a basic scatterplot

ggplot(myData, aes(x=clone28.plusdox.AVG, y=clone28.untxt.AVG)) +
  geom_point(shape=1) +
  geom_point(size=4)

To generate the DEG list, I will rely purely on fold changes. This is typically done when there are no biological replicates. Although I do have 2 biological replicates, leveraging statistical tools unfortunately did not produce any differentially expressed genes that were significant
Results: this produces a list of 162 differentially expressed genes

myData.filter <- myData %>%
  filter((abs(LogFC.clone28) >= 0.58)) %>%
  select(geneSymbols, clone28.plusdox.AVG, clone28.untxt.AVG, p37.dox.AVG)
myData.filter
##     geneSymbols clone28.plusdox.AVG clone28.untxt.AVG p37.dox.AVG
## 1       SNAR-A1           14.695097         13.892157   14.950183
## 2         CDH13            8.390152          9.130336    7.573793
## 3         CEBPZ           11.442156         12.131565   11.680061
## 4        KATNB1            9.800605          9.220116    9.937070
## 5         MCRS1           11.312419         10.615289   11.071367
## 6       ARL6IP5           10.722700         11.460211   10.388641
## 7       ARFGEF1           10.419899         11.005101   10.816669
## 8        IFITM2           13.347513         12.613013   12.986428
## 9         DNPH1           10.039639          9.458199    9.613348
## 10        TXNIP           12.023804         12.822756   10.077300
## 11       LEFTY1            9.308792          8.430360   10.802877
## 12        GNA13            9.729950         10.360537    9.904536
## 13        STAG2           10.729757         11.431305   11.154941
## 14         BTG3           11.236507         11.910575   10.285177
## 15        ASCC3           10.221296         11.009534   10.035547
## 16        RAB31           10.797186         11.421857    9.976030
## 17        STMN2            8.099180          8.688847    8.653536
## 18       PRSS23            9.163694         10.254331    8.326022
## 19        MFSD3           10.891507         10.139355   10.855421
## 20       OSBPL9           11.286309         11.873600   11.415539
## 21       CTHRC1            8.359018          8.954354    8.069718
## 22       COL1A1           11.387114         12.231503    9.061571
## 23       COL1A2           11.966657         12.999932    9.885078
## 24       COL3A1            9.523737         10.565244    8.991050
## 25       COL4A1           12.694582         13.280041   11.378987
## 26       COL5A1           11.037734         12.035872    9.780391
## 27       COL5A2           10.513903         11.415516    9.415805
## 28       COL6A3            8.838278          9.643537    8.923414
## 29       COL8A1            9.185538          9.903319    7.608936
## 30      COL11A1            9.559151         10.623020    8.397106
## 31        COPB1           11.471960         12.100558   11.057587
## 32         KLF6            9.673163         10.315741    9.338646
## 33        PRRC1            9.323041         10.120188    9.393063
## 34          CPE            9.855100         10.595440    8.641014
## 35       ZNF296           10.132832          9.499394   10.293075
## 36         DCP2           11.445249         12.082172   11.761183
## 37        CENPV           12.641464         11.867498   12.707414
## 38         FASN           13.462221         12.864194   13.157779
## 39       SEC31A           11.196987         11.792516   10.952640
## 40        PALLD           10.285312         10.940639    9.049747
## 41         SMG1            9.789337         10.374133   10.387167
## 42        EXOC7           10.190874          9.569208   10.813251
## 43      AGTPBP1           10.491511         11.085301   10.684497
## 44        SF3B1           11.032046         11.616161   11.521411
## 45         PHF3           11.339506         12.124500   11.647315
## 46        MACF1           10.555123         11.203117   10.356559
## 47        CADM1            9.709893         10.343583    9.020915
## 48      SLC37A4           11.298791         10.679428   10.974843
## 49       ZNF549           11.565416         12.304362   12.538183
## 50      TMEM158           12.300351         11.662141   12.218029
## 51       TIPARP            8.968953          9.563717    8.911786
## 52         UPF2           11.093819         11.710569   11.906333
## 53         GBE1            9.139667          9.734606    9.062179
## 54         MYOF            9.485149         10.465208    8.157901
## 55        RRP7A            9.845009          9.236733   10.058339
## 56       GOLGA4            8.733106          9.574325    8.776022
## 57          H19            9.490266         10.282092    9.738735
## 58        RCOR2           11.260893         10.650315   11.202601
## 59         DBNL           10.110472          9.398279   10.651788
## 60        ANXA1           11.356607         12.008246    8.793268
## 61         HAS3           10.483645          9.894171   10.542639
## 62        ANXA3            9.734161         10.362593    8.589749
## 63        HMGCR           11.896414         12.550334   11.406712
## 64       HMGCS1           11.765344         12.741421   11.142189
## 65     HSP90AA1           12.240088         12.893376   13.171506
## 66       IGFBP7            9.915365         10.992000    8.107186
## 67       TICAM2            8.984772          9.598616    8.205934
## 68        IL6ST            8.581893          9.361684    8.583256
## 69        IMPA2           11.119012         10.537315   11.497298
## 70       INSIG1           11.270095         11.971405    9.429481
## 71        ITGAV            9.944363         11.199717    9.370054
## 72        ITGB1           13.348600         14.076317   12.650985
## 73       NDUFS7           11.464367         10.762855   10.928883
## 74        PTAR1           10.218028         10.837489    9.956722
## 75        TNPO1           10.208128         10.871144   10.365063
## 76        PSMG4            9.896790          9.296486    9.824007
## 77         KTN1            9.993141         10.912297   10.387161
## 78          LPP           11.021340         11.805877   10.322901
## 79         LRP3           10.510101          9.896006   10.439405
## 80        CTXN1           11.818729         11.210628   11.839267
## 81        SMAD5            9.760980         10.360465   10.026623
## 82       MAN2A1            9.424411         10.004711    9.650731
## 83         MCM2           10.341480          9.663234   10.749221
## 84         MCM5            9.643682          8.932836   10.428244
## 85       METTL1           10.224291          9.631439   10.459508
## 86        MKLN1           10.526478         11.156889   10.780742
## 87    LOC441087           12.058015         12.879558   13.188069
## 88       NAP1L1           12.037091         12.619938   12.432122
## 89         NCK1           10.013503         10.604112    9.854146
## 90       NFE2L2            9.531815         10.168854    9.649574
## 91         NKTR           10.232566         10.884156   11.253284
## 92         NOS2            9.416821          8.767466    7.759247
## 93         GPX8           10.260958         11.023043   10.308420
## 94        P4HA1           10.204943         11.002883    9.790140
## 95          PAM           11.906599         12.598867   11.549278
## 96        TMED7           10.106636         10.703725   10.056696
## 97          GAL           12.295306         11.484542   12.351752
## 98         WBP5            9.611354         10.256046    9.695783
## 99       ARMCX3            9.509903         10.186141    9.132957
## 100        UFM1            9.814824         10.513276    9.534603
## 101         PHB           10.951252         10.345962   11.101667
## 102        PLAU            9.851478         10.491233    8.323184
## 103       POLD2           10.360963          9.704188   10.295434
## 104      KLHL24            8.697399          9.358073    8.243459
## 105        PHIP           10.155596         10.825387   10.804860
## 106      PPP2R4           11.456482         10.872229   11.073226
## 107     THUMPD1            9.214503          9.807545    9.941391
## 108       SCYL2            9.164398          9.841067    9.301674
## 109       MAPK6           11.693443         12.279480   11.972437
## 110       SPHK2           10.303156          9.491462   11.158261
## 111        NCLN           10.675709         10.059540   10.670849
## 112       TWSG1            9.410519         10.096720    9.777996
## 113       VARS2           10.358039          9.713149    9.908066
## 114     SERINC1            8.765525          9.390514    8.812684
## 115       RAB5A           10.244803         10.949627   10.293676
## 116      MRPL23            9.674266          9.040742    9.770924
## 117        SAT1           12.148998         12.801047   11.177716
## 118       MSMO1           11.389443         12.153273   10.152571
## 119         BGN           11.194309         12.284404    9.305487
## 120     XPNPEP3           13.783027         14.525255   14.550030
## 121       ZMAT3           12.287540         12.969488   13.051277
## 122       SHMT2           12.351022         11.756119   12.169035
## 123       ACBD3            9.960411         10.781161    9.711403
## 124      MRPL11           11.323897         10.674152   11.386883
## 125       SNAI2            8.878604          9.728315    8.207603
## 126     CAPRIN2           10.219665         10.849427   10.239603
## 127       SPARC           11.139942         11.987438    9.604505
## 128        SPP1           10.541842         11.954566    8.777345
## 129         SRF           11.719476         11.134178   11.638871
## 130        SSR1           10.164792         10.749737   10.050035
## 131       TGFBI            9.527704         10.402313    8.898427
## 132       TOP2A           11.932314         12.645700   12.286802
## 133        TTC3           11.352421         12.372398   11.065888
## 134       VPS51           10.753383         10.147679   10.553051
## 135         VIM           13.807754         14.435023   12.766557
## 136        XPO1           12.186363         12.841422   12.519908
## 137       CAMKV           10.574524          9.604277   11.513397
## 138       AHNAK            9.875644         10.485999    9.215877
## 139        HPS6           10.076818          9.447772    9.813908
## 140       EPHX3            9.115044          8.522897    8.953618
## 141       CALD1           12.500803         13.186643   11.619032
## 142    C17orf70           10.572024          9.922758   10.788535
## 143        FXR1           12.068253         12.675335   12.125486
## 144     COLEC12            9.472619         10.107578    9.427428
## 145        CALU           11.019209         11.740601   10.653792
## 146        RND2           10.671496         10.080094   10.669215
## 147       DGCR6           10.644127          9.933650   10.374848
## 148      GTPBP6           11.607746         10.850136   11.254595
## 149    HIST1H4C           14.116647         13.529951   14.272942
## 150        SRPX            9.974701         10.586539    8.963908
## 151       TEAD2           12.635841         11.938131   13.026657
## 152     TSPAN18            9.941368          9.290086    9.319833
## 153       BTAF1           10.385743         11.027212   10.777242
## 154        SMC3           10.981285         11.715962   11.821154
## 155        FIBP           10.903416         10.310872   11.007432
## 156        MTDH            8.945789          9.591128    9.427497
## 157      UBE2Q2            9.926839         10.574280    9.648303
## 158      ZRANB2           10.634189         11.262521   11.462129
## 159        GDF3            8.606831          7.948487    9.804614
## 160        PJA2           12.152250         12.830609   11.775421
## 161      GOLGA5            9.925967         10.623248    9.593634
## 162       PTBP3           11.839425         12.502819   11.717612

To generate the heatmap, I need to convert the filtered data object from a dataframe into a matrix

class(myData.filter) #shows if it is a matrix or data frame
## [1] "data.frame"
myData.filter.matrix <- (myData.filter) #convert data frame to matrix
rownames(myData.filter.matrix) <- myData.filter.matrix[,1]
myData.filter.matrix <- myData.filter.matrix[,-1]
myData.filter.matrix <- as.matrix(myData.filter.matrix)
class(myData.filter.matrix)
## [1] "matrix"
is.matrix(myData.filter.matrix) #tests if argument is a strict matrix 
## [1] TRUE
head(myData.filter.matrix)
##         clone28.plusdox.AVG clone28.untxt.AVG p37.dox.AVG
## SNAR-A1           14.695097         13.892157   14.950183
## CDH13              8.390152          9.130336    7.573793
## CEBPZ             11.442156         12.131565   11.680061
## KATNB1             9.800605          9.220116    9.937070
## MCRS1             11.312419         10.615289   11.071367
## ARL6IP5           10.722700         11.460211   10.388641

Figure 10: Heatmap of differential gene expression

I will create a tree and color the clusters. I can vary the number of clusters as well as the cut height. Additionally, I will pick a color scheme for the heatmap

hr <- hclust(as.dist(1-cor(t(myData.filter.matrix), method="pearson")), method="complete") #cluster rows by pearson correlation
hc <- hclust(as.dist(1-cor(myData.filter.matrix, method="spearman")), method="complete") #cluster columns by spearman correlation
mycl <- cutree(hr, k=2)
mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9) 
mycolhc <- mycolhc[as.vector(mycl)] 
myheatcol <- colorpanel(40, "darkblue", "yellow", "white")
#plot the hclust results as a heatmap
heatmap.2(myData.filter.matrix, Rowv=as.dendrogram(hr), Colv=NA, 
          col=myheatcol, scale="row", labRow=NA,
          density.info="none", trace="none", RowSideColors=mycolhc, 
          cexRow=1, cexCol=.75, margins=c(10,30), key=T)