PageRank and Google matrix

I recently had to look up PageRank and Google Matrices, so I’ll try to summarize them for your notes.

The textbook is as follows, and the expressions of the formulas are also adapted from this.

I actually bought this in 2013 (8 years ago), and it had been lying on my bookshelf for a long time. I got a good chance to read it this time, so I’ll summarize the main points.

I also referred to the following PDF.

github

  • The file in jupyter notebook format is here

google colaboratory

  • If you want to run it in google colaboratory here

Author’s environment

! sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G103
Python -V
Python 3.8.5

Import the basic libraries and check their versions.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import scipy
import numpy as np
import pandas as pd
import pandas as pd
import networkx as nx

from IPython.display import SVG, display

print('matplotlib version :', matplotlib.__version__)
print('scipy version :', scipy.__version__)
print('numpy version :', np.__version__)
print('pandas version :', pd.__version__)
print('nx version :', nx.__version__)
matplotlib version : 3.3.2
scipy version : 1.5.2
numpy version : 1.19.2
pandas version : 1.1.3
nx version : 2.5

PageRank

PageRank is a method of ranking web pages created by Larry Page and Sergey Brin, two of the founders of Google. The basic idea is to evaluate the value of a website based on the relationship between the links it has and the links it receives. Sites that are linked to good websites will have higher value. You can find explanations of this in a search, so I will skip it.

The PageRank of a site is expressed in the following formula.

$$ r\left(P_{i}\right)=\sum_{P_{j}\in B_{P_{i}}}\frac{r\left(P_{j}\right)}{\left|P_{j}\right|} $$

The figure is very easy to understand.

Site A has a PageRank of 10, Site B has a PageRank of 20, and Site C has a PageRank of 30. Site A has two links out, Site B has one, and Site C has three. Now consider calculating the PageRank of site D. Site D has links from Site A, Site B, and Site C. Site A has a PageRank of 10 and links to two sites, so the PageRank contribution to one link is $\displaystyle \frac{10}{2}=5$. Similarly, from site B, $\displaystyle \frac{20}{1}=20$ and from site C, $\displaystyle \frac{30}{3}=10$, for a total of 35 points.


The problem here is that in the above, site A has 10 points, but how do we determine this in the first place? The problem is how to determine this in the first place.

This is where the random surfer model comes into play. In the random surfer model, the hyperlinks of a site are randomly followed, and when the process is repeated infinitely, the ranking of the site is determined in the order of the percentage of visits. The underlying theory is a Markov chain, where the surfer transitions to the next state according to a probability matrix. The transition probability depends only on the previous state, which is a Markov chain.

The fact that the next state depends only on the current state is called a Markov decision process, and can be expressed in general terms as follows

$$ P\left(X_{t+1}=S_{j}\mid X_{t}=S_{i_{t}}, X_{t-1}=S_{i_{t-1}}, \ldots, X_{0}=S_{i_{0}}\right)=P\left(X_{t+1}=S_{j}\mid X_{t}=S_{i_{t}}\right) $$

Also, the transition probability matrix represents the probability of transitioning from $i$ to $j$ by $S_{ij}$ and

$$ \mathbf{S} = \left(\begin{array}{cccccc} 0 & 1 / 2 & 1 / 2 & 0 & 0 & 0 \\ 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 \\ 1 / 3 & 1 / 3 & 0 & 0 & 1 / 3 & 0 \\ 0 & 0 & 0 & 0 & 1 / 2 & 1 / 2 \\ 0 & 0 & 0 & 1 / 2 & 0 & 1 / 2 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{array}\right) $$

which can be expressed as The sum of each row is always 1.

If the state vector at some time $k$ is $\pi$, then the state at $k+1$ is

$$ \boldsymbol{\pi}^{(k+1)T}=\boldsymbol{\pi}^{(k)T} \mathbf{S} $$

which becomes where $T$ denotes the transpose matrix. The $\pi$ represents the probability of each component being in a state. In other words, if the above calculation is repeated infinitely many times and $\pi$ converges to a single vector, then each component of the vector normalized so that the sum of the vectors is 1 corresponds to a PageRank. Then, if $H$ is a stochastic, irreducible matrix called the Google matrix, such that

$$ \begin{aligned} \mathbf{G} &=\alpha \mathbf{S}+(1-\alpha) \frac{\mathbf{E}}{n} \\ &=\alpha \mathbf{S}+(1-\alpha) 1 / n \mathbf{e}^{T} \\ &=\alpha\left(\mathbf{H}+1 / n \mathbf{a e}^{T}\right)+(1-\alpha) 1 / n \mathbf{e e}^{T} \\ &=\alpha \mathbf{H}+(\alpha \mathbf{a}+(1-\alpha) \mathbf{e}) 1 / n \mathbf{e}^{T} \end{aligned} $$

The Perron-Frobenius theorem proves that a vector converges to a vector if it can be expressed as $\mathbf{E}$ is a matrix with all components equal to 1, and $\alpha$ is the fraction of random jumps from one site to another, even if there are no links.

When $\pi$ converges to some vector.

$$ \boldsymbol{\pi}^{T}=\boldsymbol{\pi}^{T}(\alpha \mathbf{S}+(1-\alpha) \mathbf{E}) $$

which corresponds to the eigenvector with eigenvalue 1 of the Google matrix. In the end, PageRank comes down to finding the eigenvector of eigenvalue 1 of the Google matrix. And in the case of the Google matrix, it has been proven that such eigenvectors exist.

Power law

The power method is an algorithm to find the eigenvalue with the largest absolute value. If the largest eigenvalue and the eigenvalue with the next largest magnitude have the same magnitude, the convergence will be poor. However, according to the textbook, the convergence is not bad for a general Google matrix. See wiki .

Specifically, the product of the Google matrix and

$$ \boldsymbol{\pi}^{(k)T}= \boldsymbol{\pi}^{(k-1)T} \mathbf{G} $$

We use the fact that $\mathbf{\pi}$ converges to the eigenvector belonging to the largest eigenvalue of the Google matrix by iterating.

The initial value of $\mathbf{\pi}$ is given in the Larry Page paper as

$$ \displaystyle \boldsymbol{\pi}^{(0) T}=\frac{1}{n} \mathbf{e}^{T}. $$

The Google matrix is a stochastic matrix, so the largest eigenvalue will always be 1.

Actual calculations

Let’s try to calculate PageRank in practice, not just in theory. First, since the relationship between links and links to links is represented by a directed graph, we consider a simple network structure as follows.

G = nx.DiGraph()

G.add_nodes_from([0, 1, 2, 3, 4])
G.add_edges_from([(1, 3), (3, 5), (3, 4), (0, 3), (5, 3), (4, 4), (0, 1), (0, 5)])

svg = SVG(nx.nx_agraph.to_agraph(G).draw(prog='fdp', format='svg'))
display(svg)

Somehow, it seems that node 3 is linked to many sites, and node 4 is linked to node 3, so these two are likely to rank high. Node 2 is not linked to any sites, so it is likely to be strongly affected by the random jump weight $\alpha$.

Adjacency matrix, probability matrix, Google matrix

The adjacency matrix $P_{ij}$ is as follows. It is a matrix with a component of 1 if there is an edge from node $i$ to node $j$ and 0 otherwise.

$$ \mathbf{P} = \left(\begin{array}{cccccc} 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{array}\right) $$

Consider this as a probability matrix $H_{ij}$ so that the sum of all rows is 1.

$$ \mathbf{H} = \left(\begin{array}{cccccc} 0 & 1/3 & 0 & 1/3 & 0 & 1/3 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 / 2 & 1 / 2 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{array}\right) $$

However, this means that the components in the second row are all zero, and the convergence by power law will be poor. Therefore, we modify the row where all components are 0 to have a component of $\displaystyle \frac{1}{n}$.

$$ \mathbf{S} = \mathbf{H} + \frac{\mathbf{a}\mathbf{e^T}}{n} = \left(\begin{array}{cccccc} 0 & 1/3 & 0 & 1/3 & 0 & 1/3 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 \\ 0 & 0 & 0 & 0 & 1 / 2 & 1 / 2 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{array}\right) $$

The $\displaystyle \mathbf{a}$ will be a vector with a component of 1 if all the elements in the $i$ line are 0, and 0 otherwise.

In the form,

$$ \mathbf{G}=\alpha \mathbf{S}+(1-\alpha) \mathbf{e e}^{T} / n $$

this G is the concrete Google matrix.

Next, let’s actually find the eigenvectors of the Google matrix and calculate the PageRank with $\alpha = 0.7$.

# Compute S
S = np.array([.
  [0, 1/3, 0, 1/3, 0, 1/3],
  [0, 0, 0, 1, 0, 0],
  [1/6, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6],
  [0, 0, 0, 0, 1/2, 1/2],
  [0, 0, 0, 0, 0, 1, 0],
  [0, 0, 0, 1, 0, 0],
]
)
S
array([[0., 0.333333333, 0., 0.3333333, 0., 0.3333333, 0,
        0.333333333],
       [0. , 0. , 0. , 1. , 0. ,
        0. ],
       [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
        0.16666667],
       [0. , 0. , 0. , 0. , 0.5 ,
        0.5 ],
       [0. , 0. , 0. , 0. , 0. , 1,
        0. ],
       [0. , 0. , 0. , 0. , 1. , 0,
        0. ]])
alpha = 0.7
G = alpha * S + (1 - alpha) * np.ones((6,6)) / 6
G
array([[0.05 , 0.28333333, 0.05 , 0.28333333, 0.05 ,
        0.28333333],
       [0.05 , 0.05 , 0.05 , 0.05 , 0.75 , 0.05 ,
        0.05 ],
       [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
        0.16666667],
       [0.05 , 0.05 , 0.05 , 0.05 , 0.05 , 0.4 ,
        0.4 ],
       [0.05 , 0.05 , 0.05 , 0.05 , 0.05 , 0.75 ,
        0.05 ],
       [0.05 , 0.05 , 0.05 , 0.05 , 0.75 , 0.05 ,
        0.05 ])

Using this Google matrix, calculate the eigenvectors with eigenvalue 1.

Compute eigenvectors by power method

Usually, Google matrix is too large to be calculated analytically. Therefore, we will use the power method.

piT = np.array([1 for i in range(6)]) / 6
piT
array([0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
       0.16666667])
# For now, let's run it five times
for i in range(5):
  piT = np.dot(piT, G)
piT
array([0.05660377, 0.06981132, 0.05660377, 0.22191693, 0.44758184,
       0.14748236])
# Next, let's do it 10 times
for i in range(10):
  piT = np.dot(piT, G)
piT
array([0.05660377, 0.06981132, 0.05660377, 0.22191678, 0.44758216,
       0.14748219])

The values are almost unchanged, and we can see that five times is enough for this level of convergence.

Calculating eigenvectors with eigenvalue 1 using numpy

To verify the result of the power law, let’s calculate the eigenvectors in numpy.

ret = np.linalg.eig(G.T)
ret = np.linalg.eig(G.T)
(array([ 1.000000e+00+0.000000e+00j, -4.94974747e-01+0.000000e+00j,
         4.94974747e-01+0.00000000e-00j, -2.92237077e-16+8.50412879e-09j,
        -2.92237077e-16-8.50412879e-09j, 1.16666667e-01+0.00000000e+00j]),
 array([[ 1.06476080e-01+0.00000000e+00j, 5.11651005e-17+0.00000000e+00j,
          5.09626608e-17+0.00000000e-00j, 3.17996523e-16-2.57714020e-08j,
          3.17996523e-16+2.57714020e-08j, -1.74821640e-01+0.00000000e+00j],
        [ 1.31320499e-01+0.00000000e+00j, 3.46882037e-18+0.00000000e+00j,
         -3.72987590e-17+0.00000000e+00j, -7.07106781e-01+0.00000000e+00j,
         -7.07106781e-01-0.00000000e+00j, -5.24464919e-01+0.00000000e+00j],
        [ 1.06476080e-01+0.00000000e+00j, -9.53925603e-17+0.00000000e+00j,
          6.92427457e-18+0.00000000e-00j, 1.10198775e-15+7.88171169e-23j,
          1.10198775e-15-7.88171169e-23j, -1.74821640e-01+0.00000000e+00j],
        [ 4.17442646e-01+0.00000000e+00j, 7.94104488e-01+0.00000000e+00j,
          4.75963149e-01+0.00000000e-00j, -1.15889325e-15+3.43618693e-08j,
         -1.15889325e-15-3.43618693e-08j, 4.01061409e-01+0.00000000e+00j],
        [ 8.41936688e-01+0.00000000e+00j, -2.32587819e-01+0.00000000e+00j,
         -8.12519920e-01+0.00000000e+00j, 7.02009397e-16-1.71809347e-08j,
          7.02009397e-16+1.71809347e-08j, -2.05672517e-01+0.00000000e+00j],
        [ 2.77425425e-01+0.00000000e+00j, -5.61516668e-01+0.00000000e+00j,
          3.36556771e-01+0.00000000e+00j, 7.07106781e-01+8.59046734e-09j,
          7.07106781e-01-8.59046734e-09j, 6.78719308e-01+0.00000000e+00j]))

The first component of ret is the eigenvalue and the second component is the eigenvector corresponding to the eigenvalue. Since eigenvalue 1 is the first column, we can normalize the eigenvector to eigenvalue 1 as follows

pr = ret[1][:, 0] / np.sum(ret[1][:, 0])
np.abs(pr)
array([0.05660377, 0.06981132, 0.05660377, 0.22191678, 0.44758216,
       0.14748219])

We can see that the result is almost identical to the power law result.

A little analysis of the results shows that node 4 has the highest PageRank, followed by node 3. This is as we initially expected. Node 4 is only getting a link from one node, node 3, but node 3 is getting links from three nodes, from which it is getting only one link, so it is thought that it is getting a high PageRank.

Summary

It is said that everything started with a single Linux server, a crawler made with python, and PageRank, and Google, too, can only have respect for the two founders. The theory of PageRank itself is probably not that difficult, but I am simply impressed by the skills of the engineers and managers who actually implemented it on servers, designed the system for a group of servers including the crawler, and created a company with a market capitalization of over 100 trillion yen.

References.

Original work on PageRank published in 1998.

  • [1] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp. 107 Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp. 107-117, 1998.