Principal Components Analysis

Principal components analysis (PCA) is a multivariate technique used to reduce the dimensionality of a dataset. In particular, PCA produces a smaller set of uncorrelated variables, called principal components (PCs), that capture most of the variance within a larger set of correlated variables. PCA is useful for revealing hidden patterns in highdimensional data, decreasing redundancy, filtering noise, and preparing data for further analysis using other techniques. PCA is commonly used in a variety of biological applications, from genome-wide expression studies to neurobiology, morphometrics, landscape genetics, and community ecology.

Today, we will apply PCA to a very famous dataset known as Fisher’s Iris dataset (because it was introduced by Sir Ronald Fisher in 1936 as an example of discriminant analysis), or sometimes as Anderson’s Iris data (because Edgar Anderson collected the data). The dataset includes four measurements from each of 50 iris flowers: length and width of the sepals and petals. The flowers were collected from three different species of Iris (Iris setosa, Iris virginica, and Iris versicolor), and this information is noted as well.

Start by reading the data (iris_data.txt) into R and taking a look at it. Then calculate the means of each of the variables:

> colMeans(data[,1:4])

The function you will use to perform a PCA automatically starts by centering the data, or subtracting the mean of each variable from each observation of that variable. Just to see what this looks like, try centering the data yourself and then confirming that the new mean of each variable is 0 (or extremely close to 0, due to rounding error), using the following code:

> centered.data<-(scale(data[,1:4],center=T, scale=F) )

> centered.data

> colMeans(centered.data)

In the first line of code, the “center=T” option says to go ahead and center. The “scale = F” option keeps the original scale. If your variables differ much in their variances, you will also want to re-scale your data to finish “normalizing” it, which will make the variances roughly equal, prior to performing a PCA. Otherwise, the first few components of your PCA will be dominated by whatever variables have the greatest variance. Again, the function you will use to perform PCA does this automatically, if you tell it to perform a PCA on the correlation matrix instead of the covariance matrix. However, you should see what this normalization looks like. Use the following code to calculate the covariance matrix of the variables before and after normalization, to confirm that the variances have been equalized:

> var(data[,1:4])

> centered.data<-(scale(data[,1:4],center=T,scale=T))

> var(centered.data[,1:4])

Note that, before normalization, some variables (such as petal length) were more variable than others, although the differences in variance were not overwhelming. The covariance

matrix after normalization is equivalent to the correlation matrix; all variables are now bounded between -1 and 1, and all variables have a correlation of 1 with themselves.

Given these particular data, you could justify running your PCA on either the covariance matrix (because that covariances are not grossly unequal) or the correlation matrix. Comparing the two outcomes would be instructive, and I will leave this exercise to you. For now, we will forge ahead and perform a PCA using the correlation matrix. PCA is based on finding the eigenvectors of the correlation matrix and their corresponding eigenvalues. The first eigenvector is a linear combination of the original variables that describes the dimension in which the data have the greatest variance. The second eigenvector is a linear combination of the original variables that describes the dimension in which the data have the second greatest variance, with the constraint that it must be orthogonal to the first eigenvector. The following eigenvectors must each be orthogonal to previous eigenvectors, but otherwise capture the maximum remaining variance in the dataset. The eigenvalues correspond to the eigenvectors, and describe the amount of variation in each eigenvector direction. The total number of eigenvectors and eigenvalues equals the total number of variables in the original dataset. However, the first few eigenvectors are likely to capture most of the variation, and can therefore represent a reduced dataset that contains most of the information within the original dataset.

Use the following code to perform a PCA on the correlation matrix of the iris data. The first line of code stores the results of the PCA in the object pca_iris. The names command shows the elements of this output. The loadings and summary commands extract the most relevant bits of this output:

> pca_iris <- princomp(data[,1:4], cor=T)

> names(pca_iris)

> loadings(pca_iris)

> summary(pca_iris)

The loadings define the principal components (PCs). Again, these are new uncorrelated variables, composed of linear combinations of the original variables, that capture maximum variation in the data. For example, to calculate PC 1 (Comp.1) for observation i, you would need to use the equation: Comp.1i = 0.521*Sepal.Lengthi -

0.269*Sepal.Widthi + 0.58*Petal.Length + 0.565*Petal.Width. This PC thus primarily separates flowers with long, narrow sepals (see the positive loading on sepal length and the negative loading on sepal width) and large petals (see the positive loadings on both petal length and width) from flowers with short, wide sepals and small petals. Note that the loadings function does not print very small loadings, in order to minimize distraction; this is why two loadings appear to be missing from Comp.2. They exist, but are small. If you wish to see them, you can simply use a command to extract all the eigenvectors from the correlation matrix:

> R <-cor(data[,1:4])

> eigen(R)

The summary command shows that the first PC explains approximately 73% of the variance in the data, and the first and second PC together explain approximately 96% of

the variance in the data—a substantial proportion! Apparently, there is quite a bit of redundancy (i.e., collinearity) in the original variables, and we can thus use PCA to identify a smaller subset of PCs that capture most of the variance in the data. There are several approaches to deciding how many PCs to retain, given that the main point of PCA is to reduce dimensionality. According to the cumulative proportion of explained variance criterion, you should retain as many PCs as needed to capture ~80-90% of the total variance. This indicates that the first two PCs are sufficient. According to the Kaiser’s rule for PCs derived from a correlation matrix, only PCs with an eigenvalue greater than 1 should be retained. A look at the eigenvalues (given by the command above) shows that only the first PC should be retained, at least according to that criterion.

Finally, it is common to create a “scree plot”, or plot of the eigenvalues against the PCs. Generally, this plot will show a steep drop, after which the curve flattens out because most of the variability after that point is noise and cannot be easily captured by the remaining PCs. The eigenvalues that correspond to the flat section of the graph should be dropped. Create a scree plot with the code below:

> plot(pca_iris,type="lines")

According to this plot, you should retain the first three PCs.

Two additional types of plots can help you interpret your PCA. First, it is often helpful to plot the major PC scores against each other. The scores object within the pca_iris output contains all of the PC scores from each variable. Take a look at this so that you see what your new variables look like. The original measurements on flower morphology have been replaced by scores for each PC:

> pca_iris$scores

Now plot the first 3 PCs against each other. Just for fun, go ahead and color the points on the plot according to the species of iris, so that you can get a feeling for whether the species differ from each other in particular aspects of their morphology:

> data$speciescode <-as.numeric(data$Species)

> plot(as.data.frame(pca_iris$scores[,1:3]), col=data$speciescode,main="PC Score Plot of the Iris Data")

It looks as though PC1 captures most of the variation between species. PC2 and 3 capture remaining variation that is not associated with differences between species.

A biplot, which is a two-dimensional approximation of the results of your PCA, can also help you interpret your results:

> biplot(pca_iris, scale=1, cex=c(0.4, 1), arrow.len=0, main="Correlation biplot")

> abline(h=0, lty=2)

> abline(v=0, lty=2)

Here, the scale = 1 calls for a correlation biplot. (scale = 0 would indicate a distance biplot). Biplots are described in section 12.5 of your textbook, and the rules for interpreting biplots can be found on page 203. For a correlation biplot on normalized

data, such as the one you just created, the length of a red line indicates how well a variable is represented by your two-dimensional approximation. Variables with long lines are well represented, but variables with short lines should be interpreted with care. Here, all four variables are well represented. The angles between the lines approximate the correlation between variables. Here, petal length and width are highly correlated (i.e. petals vary a lot in overall size, but not shape), and both are somewhat positively correlated with sepal length (flowers with large petals also tend to have long sepals). Neither one is correlated with sepal width, however; flowers thus vary a lot in overall sepal shape. Each observation can be projected perpendicular on the lines, giving information of the observed flower shape values for that particular observation. Distances between observations are two-dimensional approximations of Mahalanobis distances, although these can be difficult to interpret.

Assignment

Upload the possum dataset. This contains morphometric data on possums in Australia. Perform a PCA on the morphometric variables (hdlgth through belly) to see how many PCs are needed to capture most of the morphological variation in Australian possums.

Note whether you decide to perform your PCA on the covariance or correlation matrix. Produce a scree plot, a pairplot of the first few PC scores color-coded by Pop, and a biplot. Interpret each plot in your figure captions. Upload your assignment as a word document besides R commands in a separate file.

**Subject Mathematics Statistics-R Programming**