[Linear Algebra] Singular Value Decomposition and Low Rank Approximation

Singular value decomposition and low-rank approximation

I am going to summarize my knowledge of linear algebra, which is mainly needed to understand recommendation systems. The user-item matrices (preference matrices) used in recommendation systems are often discussed under the assumption that low-rank approximations hold.

It is based on the implicit assumption that users can often be classified into a finite number of clusters. For example, even if there are one million users of all book e-commerce sites, it is possible to categorize them to some extent, such as users who often buy programming books, users who buy math books, users who buy medical books, and users who buy weekly magazines.

The singular value decomposition is necessary when using the low-rank approximation. It extracts singular value vectors belonging to $k$ singular values with large values, and predicts the user’s preference for the observed item while compressing the data.

github

The file in jupyter notebook format is here .

google colaboratory

To run it in google colaboratory here

singular value decomposition

Singular value decomposition is defined for a general $m \times n$ matrix $A$ that is not a square matrix as follows.

$$ A=U\Sigma V^{\mathrm{T}} $$

$$ \mathbf{A}=\mathbf{U} \cdot \operatorname{diag}\left(\sigma_{1}, \ldots, \sigma_{r}, \mathbf{0}\right) \cdot \mathbf{V}^{*} $$

The $\sigma_1, \sigma_2 \cdots$ are singular values and are usually defined by numbering them in order from largest to smallest

$$ \sigma_{1}\geqq \cdots \geqq \sigma_{r}>0 $$

$r$ is the rank of the matrix $A$

$$ r=\operatorname{rank}(\mathbf{A}) $$

$U$ and $V$ are unitary matrices of $m \times m$ and $n \times n$

$$ \mathbf{U}^{-1}=\mathbf{U}^{*} $$

$$ \mathbf{V}^{-1}=\mathbf{V}^{*} $$

If $A$ is a symmetric matrix, the singular value and eigenvalue of $A$ will be the same.

Image of the matrix

The visual representation of the matrix after decomposition is as follows.

$$ A=U\Sigma V^{T}=U\left(\begin{array}{ccc|c} \sigma_{1} & & 0 & \\ & \ddots & & 0 \\ 0 & & \sigma_{r} & \\ \hline & 0 & & 0 \end{array}\right) V^{\mathrm{T}} \\ = \left( \mathbf{u}_{1} \cdots \mathbf{u}_{r}\right) \left(\begin{array}{llll} \sigma_{1} & & & \\ & & \ddots & \\ & & & \sigma_{r} \end{array}\right) \left(\begin{array}{c} \mathbf{v}_{1}^{T} \\ \vdots \\ \mathbf{v}_r^T \end{array}\right) $$

Eigenvalue decomposition and singular value decomposition

Since $U$ and $V$ in $A=U\Sigma V^{T}$ are unitary matrices, we can use the vectors $u$ and $v$ and multiply them as follows.

$$ \begin{aligned} \mathbf{A} \mathbf{v} &=\sigma \mathbf{u} \\ \mathbf{A}^{T} \mathbf{u} &=\sigma \mathbf{v} \end{aligned} $$

Further transformations.

$$ \begin{aligned} &\mathbf{A}^{T} \mathbf{A} \mathbf{v}=\sigma \mathbf{A}^{T} \mathbf{u}=\sigma^{2} \mathbf{v} \\ &\mathbf{A} \mathbf{A}^{T} \mathbf{u}=\sigma \mathbf{A} \mathbf{v}=\sigma^{2} \mathbf{u} \end{aligned} $$

$\mathbf{u}$ and $\mathbf{v}$ are eigenvectors of $\mathbf{A}^{T} \mathbf{A}$ and $\mathbf{A}^{T} $, respectively. $\mathbf{u}$ and $\mathbf{v}$ are called left and right singular vectors, respectively, and $\mathbf{u}$ and $\mathbf{v}$ are eigenvectors of $\mathbf{AA^{T}}$, $\mathbf{A^{T}A}$. Also, the square of the singular value of $\mathbf{A}$ becomes the eigenvalue of $\mathbf{AA^T}$ and $\mathbf{A^TA}$

From this property, we can link the singular value decomposition to the eigenvalue decomposition. First, perform the eigenvalue decomposition of $\mathbf{A^TA}$ to find the eigenvectors. The eigenvector is $\mathbf{v_i}$ and the eigenvalue is $\lambda_i$ Since $\lambda_i = \sigma^2$

$$ \mathbf{u}_i = \frac{1}{\sqrt{\lambda_i}}\mathbf{A} \mathbf{v_i} $$

and we can find $\mathbf{u}_i$. It is possible to obtain $\mathbf{v}_i$ in the same way, and the singular value decomposition can be calculated using the eigenvalue decomposition algorithm.

Operator norm (operator norm)

The norm defined for a matrix $A$, the

$$ ||\mathbf{A}||=\max _{x \in \mathbb{C}^{n},|x|=1}||\mathbf{A} \mathbf{x}|| $$

is called the operator norm. Using singular value decomposition, we can easily obtain the operator norm.

$$ \begin{aligned} ||A x||^{2} &=||U \Sigma V^{*} x||^{2} \\ &=x^{*} V \Sigma^{*} U^{*} U \Sigma V^{*} x=||\Sigma y||^{2} \\ &=\sigma_{1}^{2}\left|y_{1}\right|^{2}+\cdots+\sigma_{r}^{2}\left|y_{r}\right|^{2} \quad (y = x^{*} V) \end{aligned} $$

Therefore.

$$ \begin{aligned} ||\mathbf{A}||&=\max_{x \in \mathbb{C}^{n},||x||=1}||A x|| \\ &=\max_{||y||=1} \sqrt{\sigma_{1}^{2}\left|y_{1}\right|^{2}+\sigma_{2}^{2}\left|y_{2}\right|^{2}+\cdots+\sigma_{r}^{2}\left|y_{r}\right|^{2}} \\ &=\sigma_{1} \quad\left(\sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{r} \right) \end{aligned} $$

and the operator norm is the largest singular value.

Operator norm of unitary matrix

The operator norm of a unitary matrix is 1. This is obvious from the definition: if the unitary matrix is $U (U^{T}U=I)$, then

$$ ||U x||^{2}=x^{T} U^{T} U x=|||x||^{2} = 1 $$

It becomes.

Trigonometric inequality

$$ ||A+B|| \leq ||A||+||B|| $$

Inequalities about products

$$ ||A B|| \leq||A|| \cdot||B|| $$

Eckart-Young’s theorem.

Low rank approximation

Approximating $A$ by taking the matrix $A$ from $k$ values with large singular values is called low-rank approximation.

$$ A_{k}=U \Sigma_{k} V^{\mathrm{T}} \equiv U \operatorname{diag}\left(\sigma_{1}, \ldots, \sigma_{k}, 0\right) V^{\mathrm{T}}=\sum_{i=1}^{k} \sigma_{i} u_{i} v_{i}^{\mathrm{T}}(0 \leq k<r) $$

In other words, we say that $\sigma_{k+1}, \cdots, \sigma_{r}$ is 0.

Details of the theorem

I would like to note that Eckert-Young’s theorem is often cited in recommendation system papers when the discussion is based on low-rank approximations. You can find the details in reference [1].

$$ \min_{\operatorname{rank}||X|| \leq k}||X-A||=\left||A_{k}-A\right||=\sigma_{k+1}(A)=\min _{\operatorname{rank}(X)=k}||X-A|| $$

This theorem states that there is a low-rank approximation matrix $A_k$ on a sphere of radius $\sigma_{k+1}$ centered at $A$. It holds for all $k$. I believe that defining the norm in terms of the operator norm shows a very clear and intuitive result. The proof is on [1], please refer to it.

Notes.

$$ \begin{aligned} \left|A_{k}-A\right| &=\left|U\left(\Sigma_{k}-\right) V^{T}\right|=\left|\left(\Sigma_{k}-\right) \right| \\ &=\left\|\operatorname{diag}\left(0, \sigma_{k+1}, \ldots, \sigma_{r}, 0\right)\right|=\sigma_{k+1} \end{aligned} $$

Frobenius norm

The Frobenius norm is a measure of the size of a matrix, defined as the sum of the squares of all its components. It is written as $||A||_{\mathrm{F}}$.

$$ ||A||_{\mathrm{F}}=\sqrt{\sum_{i, j} a_{i j}^{2}} $$

Relationship to Singular Values

The Frobenius norm is equal to the sum of squares of singular values.

$$ ||A||_{\mathrm{F}}^{2}=\sum_{i=1}^{r} \sigma_{i}^{2} $$

Relationship with traces

The trace of the product with the transpose matrix is the Frobenius norm, as is clear from the actual calculation here.

$$ ||A||_{\mathrm{F}}^{2}=\operatorname{tr}\left(A A^{\top}\right)=\operatorname{tr}\left(A^{\top} A\right) $$

Implementation with python

Let’s try to implement the singular value decomposition using Python.

Author’s environment

The author’s OS is macOS, and the options are different from those of Linux and Unix commands.

!sw_vers

ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G103

Python -V

Python 3.8.5

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import scipy
import numpy as np
import pandas as pd

print('matplotlib version :', matplotlib.__version__)
print('scipy version :', scipy.__version__)
print('numpy version :', np.__version__)

matplotlib version : 3.3.2
scipy version : 1.5.2
numpy version : 1.19.2

Prepare A as desired.

Create a 4x5 matrix of integers.

A = np.array([])
for i in range(4):
  for j in range(5):
    A = np.append(A, np.random.randint(1, 10))
A = A.reshape((4,5))
A

array([[2., 2., 4., 4., 3.],
       [4., 7., 8., 4., 6.],
       [6., 7., 5., 6., 4.],
       [5., 7., 4., 1., 2.]])

Check $\text{rank}(A)$.

np.linalg.matrix_rank(A)

Singular Value Decomposition

We will now perform the singular value decomposition.

u, S, v_T = np.linalg.svd(A)

Check $U$, $\Sigma$, and $V$.

u.round(2)

array([[-0.3 , 0.57, 0.14, 0.75],
       [-0.62, 0.27, -0.66, -0.33],
       [-0.59, -0.01, 0.73, -0.36],
       [-0.43, -0.78, -0.14, 0.44]])

S.round(2)

array([21.37, 4.48, 3.14, 0.55])

v_T.T.round(2)

array([[-0.41, -0.39, 0.41, 0.42, 0.58],
       [-0.56, -0.56, -0.07, -0.44, -0.41],
       [-0.51, 0.28, -0.52, 0.59, -0.22],
       [-0.36, 0.55, 0.68, -0.08, -0.31],
       [-0.37, 0.38, -0.29, -0.53, 0.59]])

Reproduction check

Verify that the original matrix can be reproduced by multiplying it by three matrices.

(u @ np.append(np.diag(S), [[0], [0], [0], [0]], axis=1) @ v_T).round(2)

array([[2., 2., 4., 4., 3.],
       [4., 7., 8., 4., 6.],
       [6., 7., 5., 6., 4.],
       [5., 7., 4., 1., 2.]])

Subtract from the original $A$.

A.round(2) - (u @ np.append(np.diag(S), [[0], [0], [0], [0]], axis=1) @ v_T).round(2)

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

and the result is the original $A$.

We also check that $U$ and $V$ are unitary matrices.

(u @ u.T).round(2)

array([ 1., 0., -0., 0.],
       [ 0., 1., 0., 0.],
       [-0., 0., 1., -0.],
       [ 0., 0., -0., 1.]])

(v_T @ v_T.T).round(2)

array([[ 1., 0., 0., 0., 0.]],
       [ 0., 1., 0., 0., -0.],
       [ 0., 0., 1., -0., -0.],
       [ 0., 0., -0., 1., -0.],
       [ 0., -0., -0., -0., 1.]])

This is the unit matrix, and we have successfully confirmed it.

Calculating the operator norm

To calculate the operator norm in python, we just need to calculate the largest singular value. Let $A$ be the matrix we defined earlier.

S[0].round(2)

21.37

Calculating the Frobenius norm

The Frobenius norm can be easily calculated with numpy’s linalg.

np.linalg.norm(A, 'fro').round(2)

22.07

It can also be computed using the following property. You can see that the two results are consistent.

$$ \|A|_{\mathrm{F}}^{2}=\operatorname{tr}\left(A A^{\top}\right)=\operatorname{tr}\left(A^{\top} A\right) $$

np.sqrt(np.trace(A.T @ A)).round(2)

22.07

Spectral radius

The absolute value of the largest eigenvalue of a square matrix is called the spectral radius, which also affects the convergence of the page rank and the convergence of linear equations.

np.abs(np.max(np.linalg.eig(np.array([[1,2,1], [2,1,3], [3,1,2]]))[0]))).round(2)

5.24

Summary

Linear algebra is really deep. I have had the opportunity to study it twice now, once as a student and once as an office worker, but there are still many things I do not understand. When I read papers on recommendation systems as well, the theory progresses by using various properties, so I want to understand them one by one.

References.

[1]Modern Linear Algebra