Principal Component Analysis (PCA) is a dimension reduction technique that was first formulated by Karl Pearson already in 1901. In this introduction, we will walk through the underlying theory of PCA, show how to run it in R, and interpret the results. We will work with small data sets throughout, where we can interpret the variables and make individual plots. However, PCA is used extensively in a wide range of fields, from finance to neuroscience, sometimes with very large data sets, for applications such as visualization, denoising, preprocessing and feature extraction.

In biology, we often see PCA plots used to visualize the observations in a data set in two or three dimensions. Since the first principal components represent (in a sense that will be explained in detail below) the most pertinent signals in the data set, it is often used to get a ‘first look’ at the data, to identify and characterize possible outliers, identify mislabelled samples, and gauge the strength and impact of potential batch effects.

Preparation

We start by loading some R packages that we will use later on for plotting and some data wrangling. If you are not familiar with these packages, that’s not a problem; all necessary code will be provided.

suppressPackageStartupMessages({
  library(ggplot2)
  library(ggrepel)
  library(ggcorrplot)
  library(dplyr)
  library(tibble)
})

We also take a first look at the data set we will be using (mtcars). This is a built-in data set in R, which contains data on fuel consumption and aspects of design and performance for 32 types of cars (1973-1974 models).

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

We can get more information about the variables by typing ?mtcars, which tells us what the different columns mean:

?mtcars
A data frame with 32 observations on 11 (numeric) variables.

   [, 1]  mpg   Miles/(US) gallon
   [, 2]  cyl   Number of cylinders
   [, 3]  disp  Displacement (cu.in.)
   [, 4]  hp    Gross horsepower
   [, 5]  drat  Rear axle ratio
   [, 6]  wt    Weight (1000 lbs)
   [, 7]  qsec  1/4 mile time
   [, 8]  vs    Engine (0 = V-shaped, 1 = straight)
   [, 9]  am    Transmission (0 = automatic, 1 = manual)
   [,10]  gear  Number of forward gears
   [,11]  carb  Number of carburetors

For reasons that will be clear later in this session, let us assume that these measurements were obtained at two different occasions - in the first round, we collect information about 25 cars, and in the second, we collect information about the remaining 7. We split the mtcars data frame accordingly, and for now consider only the measurements from the first round (for simplicity, we’ll call this part mtcars, and the second part mtcars_round2).

set.seed(42)
(idx <- sample(x = seq_len(nrow(mtcars)), size = 7))
## [1] 17  5  1 25 10  4 18
mtcars_round2 <- mtcars[idx, ]
mtcars <- mtcars[-idx, ]

For later reference, we’ll also plot the correlations among the variables.

ggcorrplot(round(cor(mtcars), 3), 
           hc.order = TRUE, 
           lab = TRUE, 
           lab_size = 3, 
           method = "circle", 
           colors = c("tomato2", "white", "blue"), 
           title = "Correlations among variables", 
           ggtheme = theme_bw)

We note that there appears to be two groups of variables, which show a positive correlation within each group and a negative correlation with variables from the other group. One group consists of the number of cylinders, the number of carburators, the weight, the horsepower and the displacement. The other group consists of the number of gears, the transmission type, the rear axle ratio, the miles per gallon, the engine type and the time to go 1/4 mile.

Why dimension reduction?

The dimensionality of a data set is used to refer to the number of measured features, or variables, in the data set. In other words, if you measure the expression levels of five genes across 100,000 cells, the dimensionality of your data set is 5, and if you perform an RNA-seq experiment in a couple of samples, the dimensionality of the data set is the number of genes that you quantify (in the order of tens of thousands). Dimension reduction (or dimensionality reduction) is the process of, well, reducing the dimensionality of the data set. This can be useful for several reasons, two of the more common ones being plotting (we’re limited in the number of dimensions we can represent graphically) and preprocessing for other tasks (e.g., to reduce both the number of variables and the correlation among them, which can be an issue e.g. for linear regression). Of course, when reducing the dimensionality we’re typically bound to also lose some ‘information’ in the data; after all, the full data set does live in a high-dimensional space. However, by constructing the dimension reduction in a clever way, we can minimize this information loss. Moreover, in many cases it’s actually desirable to reduce the data set to the strongest signals only - consider it a kind of noise reduction.

Question: What is the dimensionality of the mtcars data set?

There are many ways of doing dimensionality reduction - for example, we can select the first two variables from our data set, which immediately reduces the dimensionality to 2! However, we’ll likely miss a lot of information (in particular, all information encoded in the other variables), and moreover we can not see effects depending on interactions between more than two variables. The other option is to somehow combine the information from all the variables into a few ‘summary features’. These features can be created in different ways - either explicitly as, e.g., linear combinations of the original variables, or implicitly by just defining their value for each of the samples in the data set. PCA is an example of the first approach, where the summary features (principal components) are linear combinations (weighted sums) of the original variables. As we will see, one advantage of this is that it allows us to directly measure the impact (or contribution) of each original variable on each of the summary features. In addition, we can easily calculate the value of the summary feature for any new sample. The disadvantage is that we are somewhat limited in the scope of summary features that we can generate.

Linear combinations and projections

We’ll make a small detour here to gain a better understanding of linear combinations, and how they relate to projections. In order to do this, we create a small, two-dimensional data set:

set.seed(2)
df <- data.frame(x = rnorm(50)) %>% dplyr::mutate(y = x + rnorm(50, sd = 0.5))
ggplot(df, aes(x = x, y = y)) + coord_fixed() + geom_point(size = 5) + theme_bw()

Here we have two variables; \(x\) and \(y\). A linear combination of these two variables is a feature of the form \[z = ax + by\]for any numbers \(a\) and \(b\). For example, we can set \(a=1, b=0\) to get \(z=x\), or \(a=b=1/2\) to get the average of \(x\) and \(y\).

Calculating such a linear combination can be related to projecting the points above onto a line through the origin in the two-dimensional space. For example, let’s choose the line to be the x-axis. We define a vector of unit length, pointing in the direction of the x-axis, as \((1,0)\).

ggplot(df) + coord_fixed() + 
  geom_point(aes(x = x, y = y), size = 5, alpha = 0.25) + theme_bw() + 
  geom_segment(aes(x = 0, y = 0, xend = 1, yend = 0), 
               arrow = arrow(length = unit(0.03, "npc")), size = 2, color = "red")

Projecting any point \((x_i,y_i)\) onto this axis now amounts to calculating the dot product between \((1,0)\) and \((x_i,y_i)\), that is, \[(1,0)\cdot(x_i,y_i) = 1\cdot x_i + 0\cdot y_i = x_i.\]

The interpretation of this is that along the axis defined by the direction \((1,0)\), the coordinate of the point \((x_i,y_i)\) (that is, the distance from the origin) is equal to \(x_i\). This makes sense, since here we project onto the x-axis, and the coordinate of each point is indeed the x-coordinate of that point. We visualize the projection of one point below:

ggplot(df) + coord_fixed() + geom_point(aes(x = x, y = y), size = 5) + theme_bw() + 
  geom_segment(aes(x = 0, y = 0, xend = 1, yend = 0), 
               arrow = arrow(length = unit(0.03, "npc")), size = 2, color = "red") + 
  geom_point(data = df[7, ], aes(x = x, y = y), size = 5, color = "blue") + 
  geom_segment(data = df[7, ], aes(x = x, y = y, xend = x, yend = 0),
               arrow = arrow(length = unit(0.03, "npc")), color = "blue") + 
  geom_point(data = df[7, ], aes(x = x, y = 0), shape = 1, 
             color = "blue", size = 4.75, stroke = 1.25)

Before moving on, let’s clarify one aspect that is often confusing. The dot product calculated above gave us the coordinate of a point along a given axis in the two-dimensional space. This is comparable to saying that for the point \((x_i,y_i)\) (in two dimensions), \(x_i\) (a scalar value) is the coordinate along the \(x\)-axis, and \(y_i\) (a scalar value) is the coordinate along the \(y\)-axis. Now, the actual projected point (the hollow circle in the plot above) still lives in the two-dimensional space! The projected point in this two-dimensional space is given by the coordinate (\(x_i\)) times the projection axis \((1, 0)\), that is, \((x_i, 0)\).

The fact that projecting a collection of data points onto an axis (or more generally, onto a plane, or another linear subspace) keeps it in the same ambient space is important, if we want to be able to compare the point cloud before and after the projection. However, we can now choose to represent our points in a new, 1-dimensional space, where the \(x\)-axis is given by the chosen projection axis. We can no longer compare the coordinates directly to the original ones (since the basis of the spaces are different), but now we have truly achieved dimension reduction, since the points are now represented in a space with fewer coordinate axes. In the case we just considered (projection onto the x-axis), we would get the following representation:

ggplot(df %>% dplyr::mutate(y = 0)) + 
  geom_segment(aes(x = -3, y = 0, xend = 3, yend = 0), 
               arrow = arrow(length = unit(0.03, "npc")), size = 2, color = "red") + 
  geom_point(aes(x = x, y = y), size = 5) + theme_minimal() + 
  labs(y = "", x = "x") + 
  theme(axis.text.y = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank())

Of course, we are not restricted to projecting onto coordinate axes. Let’s take another example, and define the direction that we want to project onto as \[v=\left(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\right).\]Now, the projected coordinate (along the defined axis) for a point \((x_i, y_i)\) is obtained as \[\frac{1}{\sqrt{2}}\cdot x_i + \frac{1}{\sqrt{2}}\cdot y_i.\]Denoting the \(N\times 2\) matrix with original \(x\)- and \(y\)-coordinates by \(M\) (the number of rows \(N\) is the number of points), and considering \(v\) as a \(2\times 1\) column vector, we can compute the coordinates along the new axis by matrix multiplication:\[z=\begin{bmatrix}x_1 & y_1\\x_2 & y_2\\x_3 & y_3\\x_4 & y_4\end{bmatrix}\cdot\begin{bmatrix}\frac{1}{\sqrt{2}}\\\frac{1}{\sqrt{2}}\end{bmatrix}=Mv\]

Note that the components \((v_x, v_y)\) of \(v\) were chosen such the the length (or norm) of \(v\), calculated as \[\lvert v \rvert = \sqrt{{v_x}^2 + {v_y}^2}\] is equal to 1. In other words, \(v\) is a unit vector. Working with unit vectors is common practice when defining the basis vectors of a space - for example, the typical basis vectors of a two-dimensional Euclidean space are (1,0) and (0,1), both unit vectors. This implies that the coordinate of a point along each of these axes can be represented by the distance from the origin to the point along that axis. In this tutorial, we will consistently work with unit vectors to represent projection axes. If \(v\) is not a unit vector, we need to divide the value obtained by the formula above by the norm of \(v\) in order to get the actual projected coordinate. We will see later that with PCA, our projection axes (principal components) will always be unit vectors.

Let’s see what the projection onto \[v=\left(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\right)\] looks like for the data above.

## Define projection axis
a <- b <- 1/sqrt(2)
v <- rbind(a, b)

## Project the measured values
df0 <- c(as.matrix(df) %*% v)

## Plot the original and projected values
ggplot(df) + coord_fixed() + theme_bw() + 
  geom_point(aes(x = x, y = y), size = 5, alpha = 0.25) + 
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + 
  geom_point(data = data.frame(x = df0 * a, y = df0 * b),
             aes(x = x, y = y), shape = 1, color = "green3", size = 3.75, stroke = 1.25) + 
  geom_segment(aes(x = 0, y = 0, xend = a, yend = a), 
               arrow = arrow(length = unit(0.03, "npc")), size = 2, color = "red") + 
  geom_segment(data = cbind(df, xend = df0 * a, yend = df0 * b)[c(35, 42), ],
               aes(x = x, y = y, xend = xend, yend = yend), 
               arrow = arrow(length = unit(0.03, "npc")), size = 1, color = "green3")

As we can see, each original (grey) point has been projected onto the new axis (the dashed line, also represented by the red arrow of unit length), and the new coordinate along that axis (the distance from the origin) is given by a linear combination of the original coordinates (the projection of two example points is indicated by green arrows). By doing this projection, we have essentially reduced our original 2-dimensional data set to a 1-dimensional one, but incorporating information from both the original \(x\) and \(y\) in the process. We can see that we are still losing some information (the difference between the grey and green points). The two example points with the green projection arrows are well separated in the original 2-dimensional space, but they are much closer on the 1-dimensional projection space. However, if we select the projection axis in a clever way, this loss is overall smaller than if we would simply select, e.g., the x-coordinate as the one-dimensional representation of our points.

We will see below that PCA performs precisely this type of projection (typically from a very high-dimensional space, but the principle is the same as in our two-dimensional example above), and the projection axis is chosen in such a way as to retain as much ‘information’ as possible from the original data set.

Question: What are the obtained coordinates when projecting the points (1,2), (2,1), and (0,1) onto the direction (0.8,0.6)?

The Frobenius norm

Before going further, we will define a way to calculate the ‘size’ of a matrix. The Frobenius norm is a generalization of the Euclidean distance, and for an \(N\times p\) matrix \(M\) it’s defined as \[\|M\|_F^2=\sum_{j=1}^p\sum_{i=1}^Nm_{ij}^2.\]In other words, the sum of the squared values of all matrix elements. We can use this to define a distance measure between two matrices of the same dimensions, as: \[d(M,W)=\|M-W\|_F.\]Let’s define the two matrices as representing the original (grey) and projected (green) points in the plot above. Then, the squared Frobenius norm of the difference corresponds to the sum of the squared Euclidean distances from the original points to the projected points (the lengths of the green lines below).

ggplot(df) + coord_fixed() + theme_bw() + 
  geom_point(aes(x = x, y = y), size = 5, alpha = 0.25) + 
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + 
  geom_point(data = data.frame(x = df0 * a, y = df0 * b),
             aes(x = x, y = y), shape = 1, color = "green3", size = 3.75, stroke = 1.25) + 
  geom_segment(data = cbind(df, xend = df0 * a, yend = df0 * b),
               aes(x = x, y = y, xend = xend, yend = yend), 
               size = 1, color = "green3")

Let’s calculate the Frobenius norm of the difference between the original data matrix and the projected values above.

M <- as.matrix(df)

## Projection onto diagonal
a <- b <- 1/sqrt(2)
v <- rbind(a, b)
df0 <- c(M %*% v)
W <- cbind(df0 * a, df0 * b)
(frob <- sqrt(sum((M - W) ^ 2)))
## [1] 2.970274
## Projection onto x-axis
a <- 1
b <- 0
v <- rbind(a, b)
df0 <- c(M %*% v)
W <- cbind(df0 * a, df0 * b)
(frob <- sqrt(sum((M - W) ^ 2)))
## [1] 8.79197

The singular value decomposition (SVD)

Before showing how to perform a PCA on the mtcars data, we will look into the theoretical underpinnings a bit. First, we’ll talk about the singular value decomposition (SVD, sometimes also referred to as compact SVD, e.g. on Wikipedia). This is an important mathematical result, which states that every real-valued \(N\times p\) matrix \(M\) can be decomposed as \[M=UDV^T,\] where \(U\) is an \(N\times r\) orthogonal matrix (a matrix where each pair of columns are orthogonal, and the sum of the squared values in each column is equal to 1), \(V\) is a \(p\times r\) orthogonal matrix (\(^T\) represents the transpose of a matrix) and \(D\) is a non-negative \(r\times r\) diagonal matrix. Here, \(r\) denotes the rank of the matrix, which corresponds to the number of non-zero values in the diagonal of \(D\). In fact, the decomposition exists also for complex-valued matrices, but we’ll stay with the real-valued ones for this introduction. In this decomposition, the columns of \(U\) and \(V\) are called the left and right singular vectors, respectively, and the diagonal elements of \(D\) are called the singular values. It is also worth noting that if all the singular values are distinct, and we follow the convention of ordering them in decreasing order, the decomposition is uniquely determined (up to the sign of the singular vectors).

To see how this works in practice, let’s compute the SVD for the mtcars matrix above. In R, the singular value decomposition can be calculated using the svd() function.

mtcars_svd <- svd(as.matrix(mtcars))
str(mtcars_svd)
## List of 3
##  $ d: num [1:11] 1489.63 204.82 80.33 9.98 4.87 ...
##  $ u: num [1:25, 1:11] -0.131 -0.096 -0.166 -0.293 -0.107 ...
##  $ v: num [1:11, 1:11] -0.0516 -0.0208 -0.8378 -0.5402 -0.0104 ...

We indeed get the expected three components (\(U\), \(V\), \(D\)) in the output. Let’s first check that the \(U\) and \(V\) matrices are orthogonal. This can be done by calculating the dot product (scalar product) between each pair of columns. If the matrix is orthogonal, this product should be 1 if a column is multiplied with itself, and 0 otherwise. Practically, we can achieve this by performing a matrix multiplication between the transpose of the matrix and the matrix itself. The diagonal elements in the resulting matrix correspond to multiplication of a column by itself, and the off-diagonal elements correspond to multiplication of different columns.

round(t(mtcars_svd$u) %*% mtcars_svd$u, 3)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
##  [1,]    1    0    0    0    0    0    0    0    0     0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0     0
## [10,]    0    0    0    0    0    0    0    0    0     1     0
## [11,]    0    0    0    0    0    0    0    0    0     0     1
round(t(mtcars_svd$v) %*% mtcars_svd$v, 3)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
##  [1,]    1    0    0    0    0    0    0    0    0     0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0     0
## [10,]    0    0    0    0    0    0    0    0    0     1     0
## [11,]    0    0    0    0    0    0    0    0    0     0     1

Just as we expected! We also confirm that the values in \(D\) are all non-negative and ordered in decreasing order of magnitude (note that since we know that \(D\) is a diagonal matrix, only the diagonal values are given here - to turn it into a matrix we can use the diag() function):

round(mtcars_svd$d, 1)
##  [1] 1489.6  204.8   80.3   10.0    4.9    4.0    2.3    1.6    1.1    1.1    0.9

Finally, we have to confirm that we have actually decomposed our original matrix! In other words, performing the matrix multiplication \(UDV^T\) should return our original matrix.

mtcars_rec <- mtcars_svd$u %*% diag(mtcars_svd$d) %*% t(mtcars_svd$v)
max(abs(mtcars_rec - mtcars))
## [1] 3.979039e-13

Success! Note that due to rounding issues the number might be slightly different in your case - but still very very small.

Relation to PCA

Now that we have seen that we can decompose our matrix using the SVD (I promise you, it will work for any other matrix as well), let’s see how this is related to PCA. In fact, it is very straightforward: given a matrix \(M\) with samples in the rows and variables in the columns, we define the principal components by the columns of \(V\). Relating back to what we saw in the beginning of this session, we can interpret each of these columns as a vector of weights, one for each variable, and we can create a linear combination of the original variables with precisely these weights. Recall that this would correspond to projecting the data onto a corresponding line in the original space. Also note that, as promised, each projection axis is a unit vector, since the \(V\) matrix is orthogonal.

We’ll now see what is so special about the decomposition obtained by the SVD, which motivates its widespread use for dimensionality reduction. To do that, let’s see what happens if we perform the projection of our original data matrix \(M\) onto these components. As indicated above, this would be achieved by matrix multiplication of \(M\) and \(V\). Since \(V\) is an orthogonal matrix, we get the projections as \[MV=UDV^TV=UD.\] In other words, the projections of the original samples onto the new orthogonal basis defined by the principal components are given by the rows of \(UD\). Importantly, once we have identified the basis for the projection (\(V\)), we can project also new samples onto the same basis. Hence, once we have extracted the principal components from one data set, we can project any new samples (for which we have measured the same set of variables) onto the same space.

So, we have decomposed \(M\), using the SVD, into \[M=(UD)V^T,\]where \(V\) contains the principal components and \(UD\) contains the projections of each original point onto these principal components. But what is this all good for - remember that what we are after is to reduce the dimensionality of the data, while in some sense keeping as much information as possible. The answer lies in the Eckart-Young theorem:

Eckart-Young theorem: Let M be a real \(N\times p\) matrix with rank \(r\), and let \(s\leq r\). Then the optimal rank-\(s\) approximation to \(M\) (minimizing the Frobenius norm of the difference) is given by \[M_s=U_sD_sV_s^T,\] where \(U_s\) and \(V_s\) are matrices containing the first \(s\) left and right singular vectors of \(M\), respectively, and \(D_s\) is a diagonal matrix containing the first \(s\) singular values. The error in the approximation is given by \[\|M-M_s\|_F^2=\sum_{k=s+1}^rd_k^2.\]

In other words, performing the SVD and keeping only the first \(s\) columns of the respective singular matrices gives us the best \(s\)-dimensional approximation of our original matrix, which was exactly what we were after! Note also that by choosing \(s=r\), we get back our original matrix, as we saw previously. Furthermore, the singular values (elements of \(D\)) are a measure of how important each new dimension is for the reconstruction of the original matrix \(M\).

Let’s try what we just saw in practice. Given our mtcars matrix \(M\), we approximate it with the first 6 singular vectors, and calculate the error.

## Get U, D, V via SVD, as above
mtcars_svd <- svd(as.matrix(mtcars))

## Calculate M_s, with s=6
mtcars_s6 <- mtcars_svd$u[, 1:6] %*% diag(mtcars_svd$d[1:6]) %*% t(mtcars_svd$v[, 1:6])
dim(mtcars_s6) ## same as mtcars
## [1] 25 11
## Calculate the Frobenius norm of the difference between the two matrices (M and M_s)
sum((mtcars - mtcars_s6)^2)
## [1] 11.01997
## Compare it to the sum of the squared singular values for the components that are not included
sum(mtcars_svd$d[7:11]^2)
## [1] 11.01997

As expected, the error in the approximation is exactly given by the sum of the squares of the non-included singular values!

To recap, the SVD gives us a decomposition of the original matrix, which is such that if we select the first \(s\) components from it, we get the best approximation of the original matrix of rank \(s\), and where we can interpret \(UD\) as the projection of the original matrix onto the orthogonal columns of \(V\). Sounds pretty good! But it doesn’t stop there. Let’s make one more assumption, namely that the columns of our original matrix are centered (in other words, that the mean of each variable is 0). In that case, we can prove that the projection onto the first \(s\) right singular vectors is also the rank-\(s\) projection with maximal variance. In other words, if we consider variance a good measure of “information”, the first \(s\) principal components provides the most informative rank-\(s\) projection of our data! We can easily get the variance explained by each component as the square of the corresponding singular value (up to a scaling factor).

In fact, we can also prove that the projecting the data onto the first principal components is optimal in the sense of preserving distances, that is, minimizing \[\sum_{j,k=1}^n\delta_{jk}^2-\hat{\delta}_{jk}^2\]where \(\delta_{jk}\) and \(\hat{\delta}_{jk}\) are Euclidean distances between points \(j\) and \(k\) in the original and low-dimensional space, respectively.

Note that, just as for the two-dimensional example above, the rank-\(s\) approximation of \(M\) obtained by the first \(s\) components from the SVD still lies in the same space as the original data set. In this form, the SVD can be used for denoising of a data set, by representing it with an approximate version, which has the same size, less ‘degrees of freedom’, but still provides a close approximation of the original data set. To achieve the dimensionality reduction that we are after here, instead, we will represent our approximate values in a new space, defined by the principal components (the new basis vectors). This space is of lower dimensionality than the original space (in fact, the dimensionality is equal to the number of retained singular vectors), and we can use it e.g. for plotting of the coordinates in the new space.

Running PCA in R

Hopefully the derivations above have convinced you that projecting data onto the first principal components can be a sensible way of reducing the dimensionality while retaining as much as possible of the ‘information’ (=variance) in the original data, or, alternatively, to approximate the original data as well as possible in a lower-dimensional space. Now, we will see how to run the PCA in R. We can of course do it with the svd() function as described above, but R has also other functions that perform a bit more of the legwork for us. The one we are going to use here is prcomp(), which is provided in base R. Internally, it makes use of the SVD, which can be efficiently and robustly computed. This makes it numerically preferable to the princomp() function (also in base R), which uses an eigendecomposition to find the principal components.

Important! When applying prcomp() to your data, make sure that you have the samples in the rows, and the variables in the columns! This is counter to most of the omics data, but keep in mind that prcomp() was written before omics data, during the time when data sets typically had more samples than variables, in which case this was the typical way of representing the data matrix. Other PCA implementations may, however, require the data matrix to be given with samples in the columns, so read the documentation carefully!

Question: What do you think would happen if you run prcomp() on a matrix with samples in the columns instead of samples in the rows? Would R give you a warning, or will you get results back? If so, how would you interpret them?

The most important argument of prcomp() is the data matrix itself. In addition, we can specify whether the variables should be centered (center argument) and/or scaled (scale. argument) before the SVD is performed. We almost always want to set center=TRUE, which is the default in prcomp(); note that if we don’t center the variables we can no longer interpret the square of the singular values as variances (they will merely be sums of squares). We will discuss the scale. argument a bit later - whether you want to scale your variables or not can depend strongly on your data and can have a non-negligible impact on the results.

Applying prcomp()

Let’s start by performing PCA on the mtcars data, without scaling the variables:

mtcars_pca <- prcomp(mtcars, center = TRUE, scale. = FALSE)

The mtcars_pca object contains all the output we need:

  • x - the sample coordinates on each of the principal components
  • rotation - the variable contributions (weights, loadings) to each of the principal components
  • sdev - the singular values, that is, the square roots of the variances of the respective components
  • center, scale - the values used to center and/or scale the variables before the SVD
str(mtcars_pca)
## List of 5
##  $ sdev    : num [1:11] 135.711 39.798 2.704 1.325 0.952 ...
##  $ rotation: num [1:11, 1:11] -0.04167 0.01267 0.88125 0.47044 -0.00264 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
##   .. ..$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:11] 19.85 6.08 220.75 148.2 3.63 ...
##   ..- attr(*, "names")= chr [1:11] "mpg" "cyl" "disp" "hp" ...
##  $ scale   : logi FALSE
##  $ x       : num [1:25, 1:11] -71.5 -125.5 -16.5 168.5 -106.1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:25] "Mazda RX4 Wag" "Datsun 710" "Valiant" "Duster 360" ...
##   .. ..$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Comparing to the output of svd()

Before interpreting the results further, let’s make sure that we get the same results as if we had applied the SVD directly.

## Center the data set
mtcars_centered <- scale(mtcars, center = TRUE, scale = FALSE)

## Apply SVD
mtcars_svd <- svd(mtcars_centered)

## Compare V to rotation
max(abs(mtcars_svd$v) - abs(mtcars_pca$rotation))
## [1] 0
mtcars_svd$v[1:3, 1:3]
##             [,1]         [,2]        [,3]
## [1,] -0.04167011 -0.008865461 -0.97391881
## [2,]  0.01266662 -0.002447099  0.06775838
## [3,]  0.88125479  0.470508935 -0.04296672
mtcars_pca$rotation[1:3, 1:3]
##              PC1          PC2         PC3
## mpg  -0.04167011 -0.008865461 -0.97391881
## cyl   0.01266662 -0.002447099  0.06775838
## disp  0.88125479  0.470508935 -0.04296672
## Compare UD to x
max(abs(mtcars_svd$u %*% diag(mtcars_svd$d)) - abs(mtcars_pca$x))
## [1] 6.448175e-13
(mtcars_svd$u %*% diag(mtcars_svd$d))[1:3, 1:3]
##            [,1]      [,2]     [,3]
## [1,]  -71.54613  5.017345 1.685920
## [2,] -125.50959 -4.359330 2.145094
## [3,]  -16.53449 40.221454 2.259488
mtcars_pca$x[1:3, 1:3]
##                      PC1       PC2      PC3
## Mazda RX4 Wag  -71.54613  5.017345 1.685920
## Datsun 710    -125.50959 -4.359330 2.145094
## Valiant        -16.53449 40.221454 2.259488
## Compare D to sdev
mtcars_svd$d  ## these are not actual standard deviation, but a constant multiple of them
##  [1] 664.8470889 194.9675108  13.2467765   6.4931246   4.6625765   3.5348529
##  [7]   1.6625802   1.5496142   1.1116214   1.0561373   0.8645973
mtcars_svd$d/sqrt(nrow(mtcars_svd$u) - 1)
##  [1] 135.7113437  39.7975765   2.7039869   1.3254035   0.9517444   0.7215488
##  [7]   0.3393728   0.3163137   0.2269088   0.2155831   0.1764852
mtcars_pca$sdev
##  [1] 135.7113437  39.7975765   2.7039869   1.3254035   0.9517444   0.7215488
##  [7]   0.3393728   0.3163137   0.2269088   0.2155831   0.1764852

Hopefully it is clear from these comparisons how SVD is used to power PCA, and how using the prcomp() function rather than the svd() function directly can help us take care of some of the book-keeping.

Summarizing the prcomp() output

The prcomp() output also has a helpful summary() function, that tabulates the standard deviation of each component, and the corresponding fraction of the total variance in the data set.

summary(mtcars_pca)
## Importance of components:
##                             PC1      PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     135.7113 39.79758 2.70399 1.32540 0.95174 0.72155 0.33937
## Proportion of Variance   0.9203  0.07914 0.00037 0.00009 0.00005 0.00003 0.00001
## Cumulative Proportion    0.9203  0.99946 0.99982 0.99991 0.99996 0.99998 0.99999
##                           PC8    PC9   PC10   PC11
## Standard deviation     0.3163 0.2269 0.2156 0.1765
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000 1.0000 1.0000

We note that the first principal component represents almost all the variance in the data (92.7%), and the second one adds a bit more. Components 3-11 contain almost no ‘information’, which indicates that seen like this, the data set is essentially two-dimensional. It is common to visualize the relative variance encoded by each component with a so called ‘scree plot’:

plot(mtcars_pca)

Again, we see that the first two components represent more or less all the variance in this data set. Depending on why the PCA was performed, scree plots can be used to

  • determine how many principal components to retain (useful e.g. if PCA is used for preprocessing the data in order to generate a smaller number of uncorrelated features to use for input into other analysis methods such a linear regression or classification models)
  • determine what fraction of the variance is determined by the first few components (useful if PCA is used for visualization, in which case the number of components is limited by practical limits but we still want to know how well we are approximating the original data)

Interpreting the sample representation

To see how these components can be interpreted, let’s first look at the sample representation. Here, we display the sample coordinates (\(UD\), which we saw corresponds to mtcars_pca$x) for the first two components, and color the points by the value of the disp variable:

ggplot(cbind(data.frame(mtcars_pca$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = disp, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

Question: Is it a coincidence that the samples seem to be centered around the origin in the PCA plot? Does the same have to be true for the variable plot?

Interpreting the variable contributions

Let’s also look at the contributions from each of the variables to each of the principal components (i.e., \(V\)). We can do this either by printing out the rotation matrix, or by plotting the contributions to the first two components. Note that by construction (remember that \(V\) is an orthogonal matrix), the sum of the squares of the weights in each component is always equal to 1.

round(mtcars_pca$rotation, 3)
##         PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8    PC9   PC10   PC11
## mpg  -0.042 -0.009 -0.974 -0.123 -0.040 -0.165  0.043 -0.035  0.016  0.015 -0.046
## cyl   0.013 -0.002  0.068  0.225  0.252 -0.786 -0.426  0.209 -0.108 -0.134 -0.094
## disp  0.881  0.471 -0.043  0.001 -0.010  0.006  0.001  0.000 -0.003 -0.001  0.005
## hp    0.470 -0.881 -0.009 -0.035  0.019  0.004  0.001 -0.001  0.004  0.002 -0.002
## drat -0.003 -0.004 -0.062  0.042 -0.121  0.227 -0.019  0.932 -0.006  0.139 -0.199
## wt    0.006  0.007  0.099 -0.110 -0.196 -0.124  0.052 -0.169  0.414  0.054 -0.848
## qsec -0.008  0.026  0.129 -0.907 -0.157 -0.176 -0.227  0.111  0.066 -0.014  0.191
## vs   -0.003  0.002  0.000 -0.166  0.001  0.131  0.106 -0.010 -0.662 -0.618 -0.352
## am   -0.002 -0.006 -0.068  0.127 -0.082  0.189 -0.255  0.075  0.557 -0.728  0.162
## gear -0.002 -0.012 -0.078  0.117 -0.320  0.312 -0.797 -0.199 -0.229  0.205 -0.104
## carb  0.007 -0.026  0.063  0.199 -0.865 -0.326  0.226  0.016 -0.106 -0.076  0.183
## Visualize as a heatmap
dn <- dimnames(mtcars_pca$rotation)
df1 <- cbind(expand.grid(variable = factor(dn[[1]], levels = rev(dn[[1]])),
                         PC = factor(dn[[2]], levels = dn[[2]])),
             rotation = as.vector(mtcars_pca$rotation))
mx <- max(abs(mtcars_pca$rotation))
ggplot(df1, aes(x = PC, y = variable, fill = rotation)) +
  scale_fill_distiller(type = "div", palette = "RdBu", direction = 1, 
                       limits = c(-mx, mx)) +
  geom_tile() + theme_bw(16)

ggplot(data.frame(mtcars_pca$rotation) %>% tibble::rownames_to_column("feature")) + 
  geom_segment(aes(xend = PC1, yend = PC2, x = 0, y = 0),
               arrow = arrow(length = unit(0.03, "npc"))) + 
  theme_bw() + 
  geom_text_repel(aes(x = PC1, y = PC2, label = feature))

From these plots we note that two variables (disp and hp) have much higher contributions to the first and second principal components than the other variables do. Essentially, the value of these variables alone determine the coordinate of each sample on these components. Why is that? Let’s look at the variances of the original variables:

round(apply(mtcars, 2, var), 1)
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear 
##    39.1     3.5 14653.9  5306.8     0.3     1.0     3.8     0.3     0.3     0.6 
##    carb 
##     2.9

The disp and hp variables have way higher variances than the other variables. Why is this important? Recall that we are essentially trying to distribute the weights between the variables (while keeping the sum of squared weights equal to 1) in such a way that the resulting weighted sum has maximal variance. Thus, if one variable has much higher variance than all others, the maximal variance of the linear combination is obtained by giving all the weight to that variable! In some situations, this may be desirable (e.g., when the high variance of a variable actually means that it is more important for us). One example where this is the case would be if the variables are all similar (e.g., blood pressure measurements at different time points) and thus directly numerically comparable.

However, in the current case, the reason for the high variance of disp and hp is rather that they are measured in different units than the other variables, not that they are more important (as a comparison, height doesn’t become more important just because you measure it in centimeters instead of in meters, but the variance invariably increases by a factor of 10,000). This is where the scaling comes in (the scale. argument of prcomp()). If we set this to TRUE, all variables will be scaled to have the same variance before the SVD is applied, and thus the original variance will have no influence anymore. Instead, the correlation among variables will become important. We can increase the variance of a linear combination by assigning higher weights to highly correlated variables.

Applying PCA to scaled variables

Let’s see what happens if we instead run the PCA on scaled data:

mtcars_pcasc <- prcomp(mtcars, center = TRUE, scale. = TRUE)
summary(mtcars_pcasc)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6    PC7    PC8
## Standard deviation     2.5748 1.6509 0.74019 0.54263 0.4714 0.45820 0.3663 0.3182
## Proportion of Variance 0.6027 0.2478 0.04981 0.02677 0.0202 0.01909 0.0122 0.0092
## Cumulative Proportion  0.6027 0.8505 0.90026 0.92703 0.9472 0.96632 0.9785 0.9877
##                            PC9   PC10    PC11
## Standard deviation     0.25496 0.2298 0.13155
## Proportion of Variance 0.00591 0.0048 0.00157
## Cumulative Proportion  0.99363 0.9984 1.00000
plot(mtcars_pcasc)

Now, while still a large fraction of the variance is explained by the first two components, they are nowhere near as dominant as before.

ggplot(cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = disp, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

Now, many more variables contribute to the first principal components:

dn <- dimnames(mtcars_pcasc$rotation)
df2 <- cbind(expand.grid(variable = factor(dn[[1]], levels = rev(dn[[1]])),
                         PC = factor(dn[[2]], levels = dn[[2]])),
             rotation = as.vector(mtcars_pcasc$rotation))
mx <- max(abs(mtcars_pcasc$rotation))
ggplot(df2, aes(x = PC, y = variable, fill = rotation)) +
  scale_fill_distiller(type = "div", palette = "RdBu", direction = 1, 
                       limits = c(-mx, mx)) +
  geom_tile() + theme_bw(16)

ggplot(data.frame(mtcars_pcasc$rotation) %>% tibble::rownames_to_column("feature")) + 
  geom_segment(aes(xend = PC1, yend = PC2, x = 0, y = 0),
               arrow = arrow(length = unit(0.03, "npc"))) + 
  theme_bw() + labs(x = "PC1", y = "PC2") + 
  geom_text_repel(aes(x = PC1, y = PC2, label = feature))

Interestingly, we see two ‘groups’ of variables, defined by whether they have a positive or a negative contribution to the first principal component. Let’s compare this division to the correlogram we made earlier:

ggcorrplot(round(cor(mtcars), 3), 
           hc.order = TRUE, 
           lab = TRUE, 
           lab_size = 3, 
           method = "circle", 
           colors = c("tomato2", "white", "blue"), 
           title = "Correlations among variables", 
           ggtheme = theme_bw)

Pretty similar patterns emerge! The five variables with a positive contribution to PC1 (disp, cyl, hp, wt, carb) are also highly positively correlated with each other, and the same is true for the variables with a negative contribution to PC1. This is typical, and in fact one reason why PCA works so well for dimensionality reduction - highly correlated variables, which don’t really add much independent information, are ‘clumped together’ in the sense that they give similar contributions to the principal components. Thus, instead of considering each of the highly correlated variables independently, we are effectively reducing them to their average.

What else can we see from these plot? Let’s consider the variable with the highest positive weight on PC1 (cyl). The fact that it has a high weight in the linear combination that is PC1 means that the sample coordinates on PC1 will be strongly impacted by their value of cyl. As a consequence, we would expect the value of cyl to be strongly associated with the position of the samples along PC1. Let’s color the points by cyl to see whether this is correct:

ggplot(cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = cyl, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

Indeed it is! Moreover, the value of cyl is high in the samples with a positive coordinate on PC1. Let’s similarly look at the variable with the largest negative weight on PC1 (mpg). This similarly indicates that it has a strong impact on the sample coordinate on this component, but the effect is oppositely directed compared to cyl. Let’s color the samples by the value of mpg:

ggplot(cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = mpg, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

This association between the weights of the variables in the principal components and their values in the samples with large (positive or negative) coordinates in the corresponding components is important, and can be very helpful for interpreting PCA plots. The take-home message from this is: don’t just look at the sample plot!

Interpreting variable contributions in arbitrary directions

Above we looked at the contributions to the first principal component, but we can of course also look at any other component. For example, let’s color the samples by the value of the variable with the largest negative contribution to PC2 (qsec):

ggplot(cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = qsec, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

Or let’s consider a direction which is not parallel to one of the actual components - e.g., pointing towards the top-left corner of the plot (the am or gear variables). Samples located in the same direction tend to have a high value of that variable, and samples located in the opposite direction from the origin tend to have a low value.

ggplot(cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = am, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

ggplot(cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars), 
       aes(x = PC1, y = PC2, color = gear, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

Projecting new points onto principal components

Let’s finally come back to the ‘second round’ of measurements that we split out in the very beginning of this document. In this type of situations, we often have two choices: either we create a new, merged data set and redo the PCA (in which case the new samples are also allowed to impact the principal components), or we project the new samples onto the basis defined by the original data set. To achieve the second, we can use the predict() function (which is a generic R function, with an implementation for prcomp objects). We need to provide the PCA object as well as the new data.

mtcars_round2
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
mtcars_round2_proj <- predict(mtcars_pcasc, newdata = mtcars_round2)
class(mtcars_round2_proj)
## [1] "matrix" "array"

Let’s plot these new observations in the principal component space.

mtcars_pcasc_all <- rbind(
  cbind(data.frame(mtcars_pcasc$x) %>% tibble::rownames_to_column("model"), mtcars) %>%
    dplyr::mutate(type = "original"),
  cbind(data.frame(mtcars_round2_proj) %>% tibble::rownames_to_column("model"), mtcars_round2) %>%
    dplyr::mutate(type = "new"))
ggplot(mtcars_pcasc_all,
       aes(x = PC1, y = PC2, color = type, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

We can also color the observations by a couple of the observed variables, to make sure that the new observations indeed are placed close to similar observations from the original set.

ggplot(mtcars_pcasc_all,
       aes(x = PC1, y = PC2, color = cyl, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

ggplot(mtcars_pcasc_all,
       aes(x = PC1, y = PC2, color = qsec, label = model)) + 
  geom_point(size = 5) + 
  theme_bw() + geom_text_repel() + coord_fixed()

Let’s see if we can break down what happened here. As we saw earlier, what we are attempting to do is to project the new data onto the principal components defined by the original observations. Remember that before the
prcomp() function performed the SVD, we asked it to mean-center and standardize the variables. Thus, we must apply the same standardization to the new samples. Importantly, we must standardize using the mean and standard deviation from the original data set (not from the new round). These are stored in the mtcars_pcasc object:

mtcars_pcasc$center
##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
##  19.85200   6.08000 220.75200 148.20000   3.62600   3.15388  17.84000   0.44000 
##        am      gear      carb 
##   0.44000   3.76000   2.88000
mtcars_pcasc$scale
##         mpg         cyl        disp          hp        drat          wt        qsec 
##   6.2516078   1.8690461 121.0533426  72.8474433   0.5606098   0.9824760   1.9384788 
##          vs          am        gear        carb 
##   0.5066228   0.5066228   0.7788881   1.6911535
mtcars_round2_scaled <- scale(mtcars_round2, center = mtcars_pcasc$center,
                              scale = mtcars_pcasc$scale)

Next, we can project onto the extracted principal components (which are stored in the rotation slot of mtcars_pcasc):

mtcars_round2_proj_manual <- mtcars_round2_scaled  %*% mtcars_pcasc$rotation

This gives us exactly the same result as we got from the predict() function above.

mtcars_round2_proj_manual[, 1:3]
##                          PC1        PC2        PC3
## Chrysler Imperial  3.5468644 -0.6172103  0.5034982
## Hornet Sportabout  1.9526136 -0.8486058 -1.1221407
## Mazda RX4         -0.5001228  1.5108836 -0.6358186
## Pontiac Firebird   2.2211372 -0.9615374 -0.9979105
## Merc 280          -0.4100529 -0.3189786  1.2692028
## Hornet 4 Drive    -0.2608649 -2.2981384 -0.1506983
## Fiat 128          -3.6062518 -0.2397594 -0.3383704
mtcars_round2_proj[, 1:3]
##                          PC1        PC2        PC3
## Chrysler Imperial  3.5468644 -0.6172103  0.5034982
## Hornet Sportabout  1.9526136 -0.8486058 -1.1221407
## Mazda RX4         -0.5001228  1.5108836 -0.6358186
## Pontiac Firebird   2.2211372 -0.9615374 -0.9979105
## Merc 280          -0.4100529 -0.3189786  1.2692028
## Hornet 4 Drive    -0.2608649 -2.2981384 -0.1506983
## Fiat 128          -3.6062518 -0.2397594 -0.3383704
max(abs(mtcars_round2_proj_manual - mtcars_round2_proj))
## [1] 0

Question: Why do we center and scale using the mean and standard deviation from the original data when we project new points on the PCs? What could be potential problems with this?

Other bits and pieces

Here we summarize some additional properties of PCA, which may be useful to know:

  • The sign of the principal components is arbitrary - the variance of \(X\) is the same as the variance of \(-X\). Thus, equivalent plots may look different just because one or the other of the axes have been flipped.
  • You may have heard that the principal components are eigenvectors of the covariance matrix - how does that go along with the SVD that we have shown here? Recall that \[M=UDV^T.\]We’re assuming that the columns of \(M\) are centered, which means that, up to a constant scaling factor, the covariance matrix of \(M\) is equal to \[M^TM=(VDU^T)(UDV^T)=VD^2V^T\](since \(U\) is orthogonal). Multiplying from the right with e.g. the first column of \(V\) gives \[M^TMv_1=VD^2V^Tv_1=d_1^2v_1,\]which shows that the columns of \(V\) are eigenvectors of the covariance matrix of \(M\), and the squared singular values are the corresponding eigenvalues. Similarly, the columns of \(U\) are eigenvectors of \(MM^T\).
  • When performing PCA, it’s good practice to record the fraction of the variance explained by each principal component. Also keep in mind that the ‘expected’ fraction depends on the size of the data set! With only two variables, the first two principal components will always contain 100% of the variance. Similarly, if you have only three samples, in any dimension, and each variable is mean-centered, the three points will always lie in at most a 2-dimensional plane. The more variables and samples the data set contain, the less variance is expected to be contained in the first few principal components. See e.g. this paper for a principled approach to estimating whether the obtained subspace contains more variance than ‘expected’ in random data.

Shortcomings of PCA

While PCA is extremely widely used, it doesn’t automatically make it the best tool for every occasion. Here, we list some of the potential shortcomings of PCA (or, rather, properties that make it a suboptimal choice in certain situations):

  • PCA relies on the assumption that it makes sense to attempt to reproduce Euclidean distances, and that variance is a meaningful measure of ‘information content’. If this is not the case, another method may be more suitable.
  • Similarly, PCA will only allow the mapping from the high-dimensional to the low-dimensional space to be a linear projection. Nonlinear mappings are not considered. Similarly, it assumes that the data lies on a linear subspace of some dimension. There are nonlinear generalizations of PCA (kernel PCA), and of course many other dimensionality reduction techniques attempting to find nonlinear structure.
  • PCA assigns a weight to each feature, which can sometimes be difficult to interpret if the number of features is very large. Sparse variants have been developed using the L1 (lasso) penalty, where typically only one among a set of correlated features will have a non-zero weight in a principal component.
  • PCA is unsupervised, which is often a strength, but sometimes a shortcoming. If the goal is to find a projection that represents some external variable (e.g. a categorization of samples), there are other more suitable techniques, e.g., linear discriminant analysis (LDA) or partial least squares-discriminant analysis (PLS-DA).
  • PCA considers a single data set, and attempts to extract linear combinations with maximal variance. Other approaches, such as partial least squares (PLS) or canonical correlation analysis (CCA) instead consider a pair of data sets (with the same observations, but possibly different sets of variables) and extract pairs of linear combinations with maximal covariance (PLS) or correlation (CCA).
  • For very large data sets (with many rows and columns), where only the first few principal components are required, approximate algorithms such as randomized SVD can be helpful (see e.g. R packages rsvd, irlba or BiocSingular.)

Session info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tibble_3.0.4     dplyr_1.0.2      ggcorrplot_0.1.3 ggrepel_0.8.2   
## [5] ggplot2_3.3.2   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         pillar_1.4.6       compiler_4.0.3     RColorBrewer_1.1-2
##  [5] plyr_1.8.6         tools_4.0.3        digest_0.6.27      evaluate_0.14     
##  [9] lifecycle_0.2.0    gtable_0.3.0       pkgconfig_2.0.3    rlang_0.4.8       
## [13] yaml_2.2.1         xfun_0.18          withr_2.3.0        stringr_1.4.0     
## [17] knitr_1.30         generics_0.0.2     vctrs_0.3.4        grid_4.0.3        
## [21] tidyselect_1.1.0   glue_1.4.2         R6_2.4.1           rmarkdown_2.5     
## [25] purrr_0.3.4        reshape2_1.4.4     farver_2.0.3       magrittr_1.5      
## [29] scales_1.1.1       ellipsis_0.3.1     htmltools_0.5.0    colorspace_1.4-1  
## [33] labeling_0.4.2     stringi_1.5.3      munsell_0.5.0      crayon_1.3.4