- Motivation I: Data Compression
- Motivation II: Visualization
- Principal Component Analysis Problem Formulation
- Principal Component Analysis Algorithm
- Choosing the Number of Principal Components
- Reconstruction from Compressed Representation
- Advice for Applying PCA
Motivation I: Data Compression
We may want to reduce the dimension of our features if we have a lot of redundant data.
To do this, we find two highly correlated features, plot them, and make a new line that seems to describe both features accurately. We place all the new features on this single line.
Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.
Note: in dimensionality reduction, we are reducing our features rather than our number of examples. Our variable \(m\) will stay the same size; \(n\), the number of features each example from \(x^{(1)}\) to \(x^{(m)}\) carries, will be reduced.
Motivation II: Visualization
It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.
We need to find new features, \(z_1, z_2\) (and perhaps \(z_3\)) that can effectively summarize all the other features.
Example: hundreds of features related to a country's economic system may all be combined into one feature that you call "Economic Activity."
Principal Component Analysis Problem Formulation
The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA)
Problem formulation
Given two features, \(x_1\) and \(x_2\), we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature.
The same can be done with three features, where we map them to a plane.
The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error.
- Reduce from 2d to 1d: find a direction (a vector \(u^{(1)} \in \mathbb{R}^n\)) onto which to project the data so as to minimize the projection error.
The more general case is as follows:
- Reduce from n-dimension to k-dimension: Find \(k\) vectors \(u^{(1)}, u^{(2)}, \dots, u^{(k)}\) onto which to project the data so as to minimize the projection error.
If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so \(k\) will be 2.
PCA is not linear regression * In linear regression, we are minimizing the squared error from every point to our predictor line. These are vertical distances. * In PCA, we are minimizing the shortest distance, or shortest orthogonal distances, to our data points.
More generally, in linear regression we are taking all our examples in \(x\) and applying the parameters in \(\Theta\) to predict \(y\).
In PCA, we are taking a number of features \(x_1, x_2, \dots, x_n\), and finding a closest common dataset among them. We aren't trying to predict any result and we aren't applying any theta weights to the features.
Principal Component Analysis Algorithm
Before we can apply PCA, there is a data pre-processing step we must perform:
Data preprocessing
- Let \(μ=\frac{1}{m}\sum^{m}_{i=1}x^{(i)}\).
- Replace each \(x^{(i)}\) with \(x^{(i)}−μ\).
- Let \(σ^2_j=\frac{1}{m}\sum_i(x^{(i)}_j)^2\)
- Replace each \(x^{(i)}_j\) with \(\frac{x^{(i)}_j}{σ_j}\).
Above, we first subtract the mean of each feature from the original feature. Then we scale all the features (\(x_j^{(i)} = \dfrac{x_j^{(i)} - \mu_j}{s_j}\)) We can define specifically what it means to reduce from 2d to 1d data as follows: \[x^{(i)} \in \mathbb{R}^2 \rightarrow z^{(i)} \in \mathbb{R}\] The \(z\) values are all real numbers and are the projections of our features onto \(u^{(1)}\).
So, PCA has two tasks: figure out \(u^{(1)},\dots,u^{(k)}\) and also to find \(z_1, z_2, \dots, z_m\).
Mathematical proof
We would like to automatically select the direction \(u\) corresponding to the first of the two figures shown above.
To formalize this, note that given a unit vector \(u\) and a point \(x\), the length of the projection of \(x\) onto \(u\) is given by \(x^Tu\). I.e., if \(x^{(i)}\) is a point in our dataset ,then its projection onto \(u\) is distance \(x^Tu\) from the origin. Hence, to maximize the variance of the projections, we would like to choose a unit-length \(u\) so as to maximize:
\[\frac{1}{m}\sum^m_{i=1}(x^{(i)T}u)^2 = \frac{1}{m}\sum^m_{i=1}u^tx^{(i)}x^{(i)T}u = u^T(\frac{1}{m}\sum^m_{i=1}x^{(i)}x^{(i)T})u\]
If you haven’t seen this before, try using the method of Lagrange multipliers(拉格朗日乘数法) to maximize \(u^TΣu\) subject to that \(u^Tu= 1\). You should be able to show that \(Σu=λu\), for some \(λ\), which implies \(u\) is an eigenvector of \(Σ\), with eigenvalue \(λ\). Because \(Σ\) is symmetric(对称的), the \(u_i\) ’s will (or always can be chosen to be) orthogonal(正交的 ) to each other.
We easily recognize that the maximizing this subject to \(||u||^2= 1\) gives the principal eigenvector of \(Σ =\frac{1}{m}\sum^m_{i=1}x^{(i)}x^{(i)T}\), which is just the empirical covariance matrix of the data (assuming it has zero mean).
To summarize, we have found that if we wish to find a 1-dimensional subspace with with to approximate the data, we should choose \(u\) to be the principal eigenvector of \(Σ\). More generally, if we wish to project our data into a k-dimensional subspace (k < n), we should choose \(u_1\), . . . , \(u_k\) to be the top \(k\) eigenvectors of \(Σ\). The \(u_i\)’s now form a new, orthogonal basis for the data.
Then, to represent \(x^{(i)}\) in this basis, we need only compute the corresponding vector \[y^{(i)}=\begin{bmatrix} u^T_1x^{(i)} \\ u^T_2x^{(i)} \\ \vdots \\ u^T_kx^{(i)} \end{bmatrix}∈\mathbb{R}^k\]
Thus, whereas \(x^{(i)}∈\mathbb{R}^n\), the vector \(y^{(i)}\) now gives a lower,k-dimensional, approximation/representation for \(x^{(i)}\). PCA is therefore also referred to as a dimensionality reduction algorithm. The vectors \(u_1\), . . . , \(u_k\) are called the first \(k\) principal components of the data.
The PCA Algorithm
1.Compute "covariance matrix" \[ \large \Sigma = \dfrac{1}{m}\sum^m_{i=1}(x^{(i)})(x^{(i)})^T \] This can be vectorized in Octave as: 1
Sigma = (1/m) * X' * X;
We denote the covariance matrix with a capital sigma (which happens to be the same symbol for summation, confusingly---they represent entirely different things).
Note that \(x^{(i)}\) is an \(n \times 1\) vector, \((x^{(i)})^T\) is an \(1 \times n\) vector and \(X\) is a \(m \times n\) matrix (row-wise stored examples). The product of those will be an \(n \times n\) matrix, which are the dimensions of \(\Sigma\).
2.Compute "eigenvectors" of covariance matrix \(\Sigma\) 1
[U,S,V] = svd(Sigma);
svd()
is the 'singular value decomposition(奇异值分解)', a built-in Octave function.
What we actually want out of svd()
is the 'U
' matrix of the Sigma covariance matrix: \(U \in \mathbb{R}^{n \times n}\). U contains \(u^{(1)},\dots,u^{(n)}\), which is exactly what we want.
3.Take the first \(k\) columns of the U matrix and compute \(z\)
We'll assign the first \(k\) columns of U to a variable called 'Ureduce'. This will be an \(n \times k\) matrix. We compute z with: \[ \large z^{(i)} = Ureduce^T \cdot x^{(i)} \]
\(Ureduce^T\) will have dimensions \(k \times n\) while \(x^{(i)}\) will have dimensions \(n \times 1\). The product \(Ureduce^T \cdot x^{(i)}\) will have dimensions \(k \times 1\).
To summarize, the whole algorithm in octave is roughly: 1
2
3
4Sigma = (1/m) * X' * X; % compute the covariance matrix
[U,S,V] = svd(Sigma); % compute our projected directions
Ureduce = U(:,1:k); % take the first k directions
z = Ureduce' * x; % compute the projected data points
Choosing the Number of Principal Components
How do we choose \(k\), also called the number of principal components? Recall that \(k\) is the dimension we are reducing to.
One way to choose \(k\) is by using the following formula: * Given the average squared projection error: \(\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2\) * Also given the total variation in the data: \(\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2\) * Choose \(k\) to be the smallest value such that: \[\large \dfrac{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2}{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2} \leq 0.01\]
In other words, the squared projection error divided by the total variation should be less than one percent, so that 99% of the variance is retained.
Algorithm for choosing \(k\)
- Try PCA with \(k = 1, 2, \dots\)
- Compute \(U_{reduce}, z, x\)
- Check the formula given above that 99% of the variance is retained. If not, go to step one and increase \(k\).
This procedure would actually be horribly inefficient. In Octave, we will call svd: 1
[U,S,V] = svd(Sigma)
Which gives us a matrix S. We can actually check for 99% of retained variance using the S matrix as follows: \[ \dfrac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}} \geq 0.99 \]
Reconstruction from Compressed Representation
If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features?
To go from 1-dimension back to 2d we do: \(z \in \mathbb{R} \rightarrow x \in \mathbb{R}^2\).
We can do this with the equation: \(x_{approx}^{(1)} = U_{reduce} \cdot z^{(1)}\).
Note that we can only get approximations of our original data.
Advice for Applying PCA
The most common use of PCA is to speed up supervised learning.
Given a training set with a large number of features (e.g. \(x^{(1)},\dots,x^{(m)} \in \mathbb{R}^{10000}\)) we can use PCA to reduce the number of features in each example of the training set (e.g. \(z^{(1)},\dots,z^{(m)} \in \mathbb{R}^{1000}\)).
Note that we should define the PCA reduction from \(x^{(i)}\) to \(z^{(i)}\) only on the training set and not on the cross-validation or test sets. You can apply the mapping \(z^{(i)}\) to your cross-validation and test sets after it is defined on the training set.
Applications * Compressions * Reduce space of data * Speed up algorithm * Visualization of data * Choose k = 2 or k = 3
Bad use of PCA: trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results \(y\). Using just regularization will be at least as effective.
Don't assume you need to do PCA. Try your full machine learning algorithm without PCA first. Then use PCA if you find that you need it.