Recently, I had the opportunity to review Singular Value Decomposition (SVD) and Principal Component Analysis (PCA), so I am writing this article as a memo.

Overview of Singular Value Decomposition and Principal Component Analysis

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) are important methods not only in recommendation systems but also in data analysis.

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a method to decompose a matrix $ A $ as follows:

$$ A = U \Sigma V^T $$

Here, $ U $ is an orthogonal matrix, $ \Sigma $ is a diagonal matrix of singular values, and $ V $ is an orthogonal matrix. SVD is used to reduce the rank of a matrix and in recommendation systems, it performs rating prediction by low-rank approximation of the rating matrix.

$$ \displaystyle R_k = U_k \Sigma_k V_k^T $$

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction method. It performs eigenvalue decomposition of the covariance matrix $ C $ and uses eigenvectors to reduce dimensions. PCA can also be implemented using SVD.

Source Code

The source code used in this article is as follows.

GitHub

  • The Jupyter notebook file can be found here

Google Colaboratory

  • To run on Google Colaboratory, click here

Execution Environment

The OS is macOS. Note that the options may differ from Linux or Unix commands.

!sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90
!python -V
Python 3.9.17

CSS settings are applied to make pandas tables more readable in HTML format.

from IPython.core.display import HTML

style = """
<style>
    .dataframe thead tr:only-child th {
        text-align: right;
    }

    .dataframe thead th {
        text-align: left;
        padding: 5px;
    }

    .dataframe tbody tr th {
        vertical-align: top;
        padding: 5px;
    }

    .dataframe tbody tr:hover {
        background-color: #ffff99;
    }

    .dataframe {
        background-color: white;
        color: black;
        font-size: 16px;
    }

</style>
"""
HTML(style)

Import basic libraries and use the watermark to check their versions. Additionally, set the seed for random numbers.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import random
import numpy as np

from pprint import pprint
from watermark import watermark

seed = 123
random_state = 123

random.seed(seed)
np.random.seed(seed)

# Function to display 0.0 instead of -0.0 when rounding
def ppprint(A):
    """
    A: np.array
        Matrix to display
    """
    pprint(np.where(A == -0.0, 0, A))

print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.17.2

numpy: 1.25.2

Watermark: 2.4.3

Implementation and Detailed Explanation with Python

Singular Value Decomposition (SVD)

Singular Value Decomposition decomposes an arbitrary $ m \times n $ matrix $ A $ as follows:

$$ A = U \Sigma V^T $$

Here, $ U $ is an $ m \times m $ orthogonal matrix, $ \Sigma $ is an $ m \times n $ diagonal matrix, and $ V $ is an $ n \times n $ orthogonal matrix. The diagonal elements of $ \Sigma $ are the singular values of $ A $, which are non-negative real numbers.

Example of SVD Calculation

For example, let $ A $ be a $ 5 \times 2 $ matrix:

$$ A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ 7 & 6 \\ 2 & 0 \\ 3 & 1 \\ \end{pmatrix} $$

We will find the SVD of this matrix. The SVD itself can be calculated using numpy’s linalg module. Below is the implementation of SVD on a suitable matrix $ A $.

import numpy as np

A = np.array([[1, 2], [3, 4], [7, 6], [2, 0], [3, 1]])

U, S, V_T = np.linalg.svd(A, full_matrices=True)

print("=" * 40)
print("U : ")
pprint(U.round(2))

print("=" * 40)
print("S : ")
pprint(S.round(2))

# Creating a diagonal matrix from the singular value vector S
Sigma = np.zeros(A.shape)
for i in range(len(S)):
    Sigma[i, i] = S[i]

print("=" * 40)
print("Sigma : ")
pprint(Sigma.round(2))

print("=" * 40)
print("V_T : ")
pprint(V_T.round(2))
========================================
U :
array([[-0.19, -0.37, -0.82, -0.21, -0.33],
       [-0.44, -0.45, -0.05,  0.57,  0.52],
       [-0.83,  0.06,  0.37, -0.25, -0.34],
       [-0.13,  0.59, -0.27,  0.66, -0.36],
       [-0.26,  0.55, -0.35, -0.35,  0.62]])
========================================
S :
array([11.13,  2.24])
========================================
Sigma :
array([[11.13,  0.  ],
       [ 0.  ,  2.24],
       [ 0.  ,  0.  ],
       [ 0.  ,  0.  ],
       [ 0.  ,  0.  ]])
========================================
V_T :
array([[-0.75, -0.66],
       [ 0.66, -0.75]])

From this result, $ U $, $ S $, and $ V^T $ are obtained. Note that $ S $ is output as a vector of the diagonal elements of the diagonal matrix, so it needs to be converted into a diagonal matrix.

Example of Using SVD: Recommendation Systems

In recommendation systems, methods using SVD are very effective. For example, by decomposing the user-item rating matrix $ R $ using SVD, it is possible to reduce dimensions and extract latent features. This allows us to extract the user’s rating tendencies and item features, achieving highly accurate recommendations.

For instance, based on the size of the singular values, we arrange the singular values in descending order and select the top $ k $ singular values. $ k $ is a parameter that represents the new dimensionality. Using the column vectors or row vectors of $ U $, $ \Sigma $, and $ V^T $ corresponding to the top $ k $ singular values, we approximate the original matrix $ A $. Specifically,

$$ A_k = U[:, :k] \Sigma [:k, :k] V^T[:k, :] $$

This means extracting the first $ k $ columns from each matrix.

By this procedure, it is possible to approximate the original matrix $ A $ using only the information from the larger singular values. This enables more efficient data representation for purposes such as dimensionality reduction or noise reduction.

Briefly, here are the advantages and disadvantages:

Advantages

  1. It enables dimensionality reduction of high-dimensional data.
  2. It allows for the extraction of latent features, helping to understand the essential structure of the data.

Disadvantages

  1. High computational cost, especially for large datasets.
  2. Low tolerance for missing data.

Principal Component Analysis (PCA)

Principal Component Analysis transforms data into an orthogonal coordinate system to maximize the variance of the data.

PCA Using Eigenvalue Decomposition of Covariance Matrix

PCA is executed on a data matrix $ \mathbf{X} $ through the following steps:

  1. Transform the data matrix $ \mathbf{X} $ so that the mean is 0 for each column, and calculate the covariance matrix $ \mathbf{C} $.

    $ \mathbf{X}_i $ denotes the $ i $-th column of $ \mathbf{X} $.

$$ C_{i j} = \mathrm{E}\left[\left(\mathbf{X}_i - \mu_i \cdot \mathbf{1} \right)\left(\mathbf{X}_j - \mu_j \cdot \mathbf{1} \right)\right] = \mathrm{E}\left(\mathbf{X}_i \mathbf{X}_j\right) - \mathrm{E}\left(\mathbf{X}_i\right) \mathrm{E}\left(\mathbf{X}_j\right) $$

Here, $ \mathbf{1} $ is a vector with elements of 1.

$$ \mu_i = \mathrm{E}\left(\mathbf{X}_i\right) $$

$ \mu_i $ is the mean of $ \mathbf{X}_i $.

Generally, the covariance matrix is expressed as follows:

$$ \mathbf{C} = \mathrm{E}\left[(\mathbf{X} - \mathrm{E}[\mathbf{X}])^T (\mathbf{X} - \mathrm{E}[\mathbf{X}])\right] $$

  1. Calculate the eigenvalues $ \lambda_i $ and eigenvectors $ \mathbf{v}_i $ of the covariance matrix.

  2. Select eigenvectors in descending order of eigenvalues to form the new orthogonal basis.

PCA Using Rayleigh Quotient

To connect with the concept of transforming the coordinate system to maximize variance, we can consider using the Rayleigh quotient. Specifically, we find the vector $ \mathbf{v} $ that maximizes the Rayleigh quotient with respect to $ \mathbf{X}^T \mathbf{X} $:

$$ \mathbf{v}_1 = \underset{\mathbf{v} \neq \mathbf{0}}{\arg \max} \frac{||\mathbf{X} \mathbf{v}||^2}{||\mathbf{v}||^2} = \underset{\mathbf{v} \neq \mathbf{0}}{\arg \max} \frac{\mathbf{v}^T \mathbf{X}^T \mathbf{X} \mathbf{v}}{\mathbf{v}^T \mathbf{v}} $$

The vector $ \mathbf{v}_1 $ calculated in this way corresponds to the eigenvector of the largest eigenvalue of $ \mathbf{X}^T \mathbf{X} $ due to the properties of the Rayleigh quotient. This can be confirmed as follows:

$$ \left(\mathbf{X} \mathbf{v}_1\right)^{T} \cdot \left(\mathbf{X} \mathbf{v}_1\right) = \mathbf{v}_1^{T} \mathbf{X}^{T} \mathbf{X} \mathbf{v}_1 = \mathbf{v}_1^{T} \lambda_1 \mathbf{v}_1 = \lambda_1 ||\mathbf{v}_1||^2 $$

The above equation represents the variance after projecting $ \mathbf{X} $ onto $ \mathbf{v}_1 $. This value corresponds to finding $ \mathbf{v}_1 $ that maximizes the variance of $ \mathbf{X} $.

Next, we calculate $ \mathbf{X}_2 $ by subtracting the data projected onto $ \mathbf{v}_1 $ from $ \mathbf{X} $:

$$ \mathbf{X}_2 = \mathbf{X} - \mathbf{X} \mathbf{v}_1 \mathbf{v}_1^{T} $$

Then, to find $ \mathbf{v}_2 $, we calculate a vector orthogonal to $ \mathbf{v}_1 $ as follows:

$$ \mathbf{v}_2 = \underset{\mathbf{v} \neq \mathbf{0}}{\arg \max} \frac{||\mathbf{X}_2 \mathbf{v}||^2}{||\mathbf{v}||^2} = \underset{\mathbf{v} \neq \mathbf{0}}{\arg \max} \frac{\mathbf{v}^T \mathbf{X}_2^T \mathbf{X}_2 \mathbf{v}}{\mathbf{v}^T \mathbf{v}} $$

The vector $ \mathbf{v}_2 $ calculated in this way corresponds to the eigenvector of the second largest eigenvalue of $ \mathbf{X}^T \mathbf{X} $.

By finding the top $ k $ eigenvectors in this manner, it is possible to transform the data into an orthogonal coordinate system that maximizes the variance.

Example of PCA Calculation

Next, we implement PCA in Python. We will prepare a suitable dataset and perform PCA, setting the number of principal components to 2.

First, we show an example of reducing dimensions of a handwritten digits dataset using PCA with scikit-learn.

from sklearn.decomposition import PCA

# Sample data matrix
X = np.array([[1, 2, 1, 4, 2], [2, 5, 4, 2, 1], [2, 1, -1, -2, 3], [4, 8, 1, 2, -1]])

# Number of principal components
k = 2

pca = PCA(n_components=k)
result = pca.fit_transform(X)

print("Data after PCA:")
pprint(result.round(2))
Data after PCA:
array([[ 1.26,  3.03],
       [-1.85,  1.19],
       [ 4.98, -2.15],
       [-4.38, -2.08]])

Next, we implement PCA using the above method without scikit-learn’s PCA.

# Sample data matrix
X = np.array([[1, 2, 1, 4, 2], [2, 5, 4, 2, 1], [2, 1, -1, -2, 3], [4, 8, 1, 2, -1]])

# Create data with mean 0 by subtracting the mean from observations
C = X - np.mean(X, axis=0)

# Calculate the covariance matrix
cov = np.cov(C.T)

# Calculate eigenvalues and eigenvectors of the covariance matrix
vals, vecs = np.linalg.eig(cov)

# Sort eigenvalues in descending order and get their indices
idx = np.argsort(vals)[::-1]
sorted_vals = vals[idx]

# Get eigenvectors corresponding to the top 2 eigenvalues
top_vecs = vecs[:, idx[:2]]

# Project data onto principal components
PC = C.dot(top_vecs)

print("Data after PCA:")
pprint(PC.real.round(2))
Data after PCA:
array([[ 1.26, -3.03],
       [-1.85, -1.19],
       [ 4.98,  2.15],
       [-4.38,  2.08]])

The above two results match except for the sign of the second column.

Example of Using PCA: Recommendation Systems

PCA is used for dimensionality reduction of data and is also effective in recommendation systems. For example, by reducing the dimensions of the user rating matrix, we can reduce the computational cost while still achieving high accuracy recommendations.

Advantages

  1. Reduces computational cost by dimensionality reduction.
  2. Maximizes data variance, minimizing information loss.

Disadvantages

  1. Difficult to capture nonlinear relationships as it is based on linear transformation.
  2. Important information may be lost during the dimensionality reduction process.

Conclusion

In this article, we explained Singular Value Decomposition and Principal Component Analysis with Python implementation.

Singular Value Decomposition and Principal Component Analysis are very effective techniques for dimensionality reduction and feature extraction. SVD is used in recommendation systems to extract latent features of users and items, achieving highly accurate recommendations. On the other hand, PCA reduces the dimensions of data by maximizing variance, thereby extracting important features and reducing computational cost. Both methods have advantages and disadvantages, so it is important to choose the appropriate method according to the purpose of use.