[numpy] 6. Sampling

Numpy personal tips

numpy is another indispensable tool for data analysis. I’ll leave a note as a personal reminder. For more information

Official page for more details.

github

The file in jupyter notebook format on github is here .

Author’s environment

The author’s environment and import method are as follows.

!sw_vers

ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022

Python -V

Python 3.7.3

%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib
import matplotlib.pyplot as plt

import numpy as np

print(np.__version__)
print(matplotlib.__version__)

1.16.2
3.0.3

Sampling from a uniform distribution.

np.random.seed(N)

Seed value to generate a random number. If you set this, the next time you use np.random to generate a random number, you will get the same value.

for i in range(5):
  np.random.seed(1)
  print(np.random.rand())

0.417022004702574
0.417022004702574
0.417022004702574
0.417022004702574
0.417022004702574

np.random.rand(N,M,…)

np.random.rand(2,3)

array([[7.20324493e-01, 1.14374817e-04, 3.02332573e-01],
       [1.46755891e-01, 9.23385948e-02, 1.86260211e-01]])

np.random.random(shape)

np.random.random((2,3))

array([[0.34556073, 0.39676747, 0.53881673]]
       [0.41919451, 0.6852195 , 0.20445225]])

np.random.random(10)

array([0.87811744, 0.02738759, 0.67046751, 0.4173048 , 0.55868983,
       0.14038694, 0.19810149, 0.80074457, 0.96826158, 0.31342418])

np.random.randint(N[,size])

np.random.randint(100,size=100)

array([22, 57, 1, 0, 60, 81, 8, 88, 13, 47, 72, 30, 71, 3, 70, 21, 49,
       57, 3, 68, 24, 43, 76, 26, 52, 80, 41, 82, 15, 64, 68, 25, 98, 87,
        7, 26, 25, 22, 9, 67, 23, 27, 37, 57, 83, 38, 8, 32, 34, 10, 23,
       15, 87, 25, 71, 92, 74, 62, 46, 32, 88, 23, 55, 65, 77, 3, 0, 77,
        6, 52, 85, 70, 2, 76, 91, 21, 75, 7, 77, 72, 75, 76, 43, 20, 30,
       36, 7, 45, 68, 57, 82, 96, 13, 10, 23, 81, 7, 24, 74, 92])

Random substitution, extraction.

np.random.choice(a,[ size])

Extract randomly from array a by size.

a = np.arange(10)
print('a : ',a)
print('Extract one : ',np.random.choice(a))
print('Extract two : ',np.random.choice(a,2))
print('Extract three : ',np.random.choice(a,3))

a : [0 1 2 3 4 5 6 7 8 9].
Extract one : 4
Extract two : [0 1].
Extract 3 : [9 8 2].

np.random.shuffle(a)

Shuffles the array randomly. Replaces the original array with a destructive method.

a = np.arange(9)
b = np.random.shuffle(a)

print(a)
print(b)

[7 0 6 8 5 4 2 1 3].
None

For arrays larger than matrices, only the first axis is shuffled.

a = np.arange(9).reshape(-1,3)
np.random.shuffle(a)
print(a)

[[3 4 5]]
 [0 1 2]]
 [6 7 8]]

np.random.permutation(a)

Shuffle the array randomly. Creates a copy with a non-destructive method and returns its object.

a = np.arange(9)
b = np.random.permutation(a)

print(a)
print(b)

[0 1 2 3 4 5 6 7 8]
[4 3 1 5 0 8 7 6 2]]

For arrays larger than matrices, shuffling is done only for the first axis.

a = np.arange(9).reshape(-1,3)
b = np.random.permutation(a)

print('before')
print(a)
print('after')
print(b)

before
[[0 1 2]]
 [3 4 5]]
 [6 7 8]]
``after
[[3 4 5]]
 [0 1 2]
 [6 7 8]]

Sampling from a probability distribution

Sampling from a particular probability distribution and finding the probability value is done using scipy, but I will describe an example of sampling using numpy.

np.random.normal([loc, scale, size])

This samples from a normal distribution. Specify the mean and standard deviation. The formula for the normal distribution is as follows. $\mu$ is the mean and $\sigma$ is the standard deviation.

$$ P(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)} $$

x = np.random.normal(1,2,10000)

# Find the mean and standard deviation from the sampled values.
# Roughly match the set values
print('Mean : ',np.mean(x))
print('Standard deviation : ',np.std(x))

mean : 1.025193167329272
Standard deviation : 2.0025521905807424

# Display the histogram and check the probability density distribution
plt.hist(x, bins=20)

(array([1.000e+00, 4.000e+00, 1.000e+01, 4.300e+01, 9.900e+01, 2.520e+02,
        4.990e+02, 8.380e+02, 1.165e+03, 1.511e+03, 1.646e+03, 1.404e+03,
        1.085e+03, 7.080e+02, 4.090e+02, 2.140e+02, 7.600e+01, 2.600e+01,
        5.000e+00, 5.000e+00]),
 array([-7.27230178, -6.47052873, -5.66875567, -4.86698262, -4.06520957,
        -3.26343652, -2.46166347, -1.65989042, -0.85811737, -0.05634432,
         0.74542873, 1.54720178, 2.34897483, 3.15074788, 3.95252093,
         4.75429398, 5.55606703, 6.35784008, 7.15961313, 7.96138618,
         8.76315923]),
 <a list of 20 Patch objects>)

While we’re at it, let’s check that the values obtained analytically by this histogram and formula are correct.

# Normalize the histogram
plt.hist(x, bins=20, density=1,range=(-10,10))

# Calculate and plot the normal distribution
mu = 1
sigma = 2

def get_norm(x):
  import math
  return math.exp(-1 * (x - mu)**2 / 2 / sigma**2) / math.sqrt(2 * np.pi * sigma**2)

def get_norm_list(x):
  return [get_norm(i) for i in x].

x1 = np.linspace(-10,10,1000)
y1 = get_norm_list(x1)

plt.plot(x1,y1)

[<matplotlib.lines.Line2D at 0x11264ea20>]

You can see that they are roughly in agreement.

np.random.binomal(n,p[,size])

Binomial distribution. It takes a positive integer $n$ and a probability $p$ as parameters.

$$ P(k)=\frac{n!}{k!(n-k)!} P^k(1-P)^{n-k} $$ P(k)=\frac{n!}{k!

x = np.random.binomial(30,0.2,10000)

plt.hist(x, bins=12)

(array([ 100., 337., 815., 1331., 1721., 1807., 2635., 662., 341,
         153., 63., 35.]),
 array([ 0., 1.16666667, 2.333333333, 3.5 , 4.666666667,
         5.83333333, 7. , 8.16666667, 9.3333333, 10.5 ,
        11.666666667, 12.8333333333, 14.]),
 <a list of 12 Patch objects>)

np.random.poisson(lambda[,size])

Poisson distribution. It takes $\lambda$ as a parameter and is expressed as follows. $$ P(k) = \frac{e^{-\lambda}\lambda^k}{k!} $$

x = np.random.poisson(5,10000)

plt.hist(x, bins=15)

(array([ 389., 842., 1419., 1839., 1780., 1416., 1028., 648., 356,
         160., 74., 28., 15., 4., 2.]),
 array([ 0. , 1.0666666667, 2.133333333, 3.2 , 4.26666667,
         5.333333333, 6.4 , 7.46666667, 8.53333333, 9.6 ,
        10.666666667, 11.73333333, 12.8 , 13.86666667, 14.93333333,
        16. ]),
 <a list of 15 Patch objects>)

np.random.beta(a,b[,size])

Beta distribution. It takes two parameters, $\alpha$ and $\beta$.

$\alpha$ and $\beta$ P(x)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} $$

where $B(\alpha,\beta)$ is the beta function and $$ B(\alpha,\beta) = \int_0^1x^{\alpha-1}(1-x)^{\beta-1}dx $$

x = np.random.beta(2,4,10000)

plt.hist(x,bins=20)

(array([197., 570., 775., 916., 880., 988., 928., 934., 787., 685., 615,
        490., 426., 288., 206., 154., 82., 55., 18., 6.]),
 array([0.00344225, 0.04987531, 0.09630837, 0.14274143, 0.18917449,
        0.23560755, 0.28204061, 0.32847367, 0.37490673, 0.42133979,
        0.46777285, 0.51420591, 0.56063897, 0.60707203, 0.665350509,
        0.69993815, 0.74637121, 0.79280427, 0.83923733, 0.88567039,
        0.93210345]),
 <a list of 20 Patch objects>)

np.random.gamma(a,b[,size])

Gamma distribution. It has two parameters, $\alpha$ and $\beta$. The formula is as follows.

$$ P(x) = \frac{x^{\alpha - 1}\exp(-\frac{x}{\beta})}{\Gamma(\alpha)\beta^\alpha}\quad (x \geq 0) $$

$\Gamma$ is the gamma function.

$$ \Gamma(x) = \int_0^\infty t^{x-1}e^{-t} dt \quad (x \geq 0) $$

The superfunction is defined as

x = np.random.gamma(3,2,10000)

plt.hist(x, bins=20)

(array([7.280e+02, 1.951e+03, 2.248e+03, 1.848e+03, 1.290e+03, 7.900e+02,
        5.320e+02, 2.760e+02, 1.720e+02, 7.500e+01, 3.900e+01, 2.100e+01,
        1.600e+01, 6.000e+00, 2.000e+00, 3.000e+00, 2.000e+00, 0.000e+00,
        0.000e+00, 1.000e+00]),
 array([ 0.214581 , 1.90909253, 3.60360406, 5.29811559, 6.99262712,
         8.68713865, 10.38165018, 12.07616171, 13.77067324, 15.46518477,
        17.1596963 , 18.85420783, 20.54871936, 22.24323089, 23.93774242,
        25.63225395, 27.32676548, 29.02127701, 30.71578854, 32.41030007,
        34.10481161]),
 <a list of 20 Patch objects>)

np.random.multivariate_normal(mean, cov[, size, check_valid, tol])

Specify the mean and covariance matrix and sample from a correlated multidimensional normal distribution.

If the covariance matrix is

$$ \left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) $$

then we are sampling from two uncorrelated normal distributions.

$$ \left( \begin{array}{cc} 1 & 1 \\ 1 & 1 \end{array} \right) $$

This is a perfect correlation with

$$ \left( \begin{array}{cc} 1 & 0.5 \\ 0.5 & 1 \end{array} \right) $$

to sample from two normal distributions with a covariance of 0.5.

$$ \left( \begin{array}{cc} 1 & -0.7 \\ -0.7 & 1 \end{array} \right) $$

This results in a negative correlation in

a = np.array([0,0])
b = np.array([[1,0],[0,1]])

sample = np.random.multivariate_normal(a,b,size=1000)

import seaborn as sns

sns.set(style='darkgrid')
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')

<seaborn.axisgrid.JointGrid at 0x11cb76978>

a = np.array([0,0])
b = np.array([[1,0.5],[0.5,1]])

sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')

<seaborn.axisgrid.JointGrid at 0x11ef2e518>

a = np.array([0,0])
b = np.array([[1,0.8],[0.8,1]])

sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')

<seaborn.axisgrid.JointGrid at 0x102082908>

a = np.array([0,0])
b = np.array([[1,0.95],[0.95,1]])

sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')

<seaborn.axisgrid.JointGrid at 0x11f1860f0>

a = np.array([0,0])
b = np.array([[1,-0.7],[-0.7,1]])

sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')

<seaborn.axisgrid.JointGrid at 0x11f085320>

There are many more probability distributions out there that are necessary for machine learning, but this is the end of sampling in numpy. More details will be explained in scipy .