Numpy personal tips
numpy is another indispensable tool for data analysis. I’ll leave a note as a personal reminder. For more information
- Official page for more details.
Contents
- 1. basic operations
- 2. trigonometric functions
- 3. exponential and logarithmic
- 4. statistical functions
- 5. linear algebra
- 6. Sampling <= here and now
- 7. misc
github
- The file in jupyter notebook format on github is here .
Author’s environment
The author’s environment and import method are as follows.
!sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022
Python -V
Python 3.7.3
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
print(np.__version__)
print(matplotlib.__version__)
1.16.2
3.0.3
Sampling from a uniform distribution.
np.random.seed(N)
Seed value to generate a random number. If you set this, the next time you use np.random to generate a random number, you will get the same value.
for i in range(5):
np.random.seed(1)
print(np.random.rand())
0.417022004702574
0.417022004702574
0.417022004702574
0.417022004702574
0.417022004702574
np.random.rand(N,M,…)
np.random.rand(2,3)
array([[7.20324493e-01, 1.14374817e-04, 3.02332573e-01],
[1.46755891e-01, 9.23385948e-02, 1.86260211e-01]])
np.random.random(shape)
np.random.random((2,3))
array([[0.34556073, 0.39676747, 0.53881673]]
[0.41919451, 0.6852195 , 0.20445225]])
np.random.random(10)
array([0.87811744, 0.02738759, 0.67046751, 0.4173048 , 0.55868983,
0.14038694, 0.19810149, 0.80074457, 0.96826158, 0.31342418])
np.random.randint(N[,size])
np.random.randint(100,size=100)
array([22, 57, 1, 0, 60, 81, 8, 88, 13, 47, 72, 30, 71, 3, 70, 21, 49,
57, 3, 68, 24, 43, 76, 26, 52, 80, 41, 82, 15, 64, 68, 25, 98, 87,
7, 26, 25, 22, 9, 67, 23, 27, 37, 57, 83, 38, 8, 32, 34, 10, 23,
15, 87, 25, 71, 92, 74, 62, 46, 32, 88, 23, 55, 65, 77, 3, 0, 77,
6, 52, 85, 70, 2, 76, 91, 21, 75, 7, 77, 72, 75, 76, 43, 20, 30,
36, 7, 45, 68, 57, 82, 96, 13, 10, 23, 81, 7, 24, 74, 92])
Random substitution, extraction.
np.random.choice(a,[ size])
Extract randomly from array a by size.
a = np.arange(10)
print('a : ',a)
print('Extract one : ',np.random.choice(a))
print('Extract two : ',np.random.choice(a,2))
print('Extract three : ',np.random.choice(a,3))
a : [0 1 2 3 4 5 6 7 8 9].
Extract one : 4
Extract two : [0 1].
Extract 3 : [9 8 2].
np.random.shuffle(a)
Shuffles the array randomly. Replaces the original array with a destructive method.
a = np.arange(9)
b = np.random.shuffle(a)
print(a)
print(b)
[7 0 6 8 5 4 2 1 3].
None
For arrays larger than matrices, only the first axis is shuffled.
a = np.arange(9).reshape(-1,3)
np.random.shuffle(a)
print(a)
[[3 4 5]]
[0 1 2]]
[6 7 8]]
np.random.permutation(a)
Shuffle the array randomly. Creates a copy with a non-destructive method and returns its object.
a = np.arange(9)
b = np.random.permutation(a)
print(a)
print(b)
[0 1 2 3 4 5 6 7 8]
[4 3 1 5 0 8 7 6 2]]
For arrays larger than matrices, shuffling is done only for the first axis.
a = np.arange(9).reshape(-1,3)
b = np.random.permutation(a)
print('before')
print(a)
print('after')
print(b)
before
[[0 1 2]]
[3 4 5]]
[6 7 8]]
``after
[[3 4 5]]
[0 1 2]
[6 7 8]]
Sampling from a probability distribution
Sampling from a particular probability distribution and finding the probability value is done using scipy, but I will describe an example of sampling using numpy.
np.random.normal([loc, scale, size])
This samples from a normal distribution. Specify the mean and standard deviation. The formula for the normal distribution is as follows. $\mu$ is the mean and $\sigma$ is the standard deviation.
$$ P(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)} $$
x = np.random.normal(1,2,10000)
# Find the mean and standard deviation from the sampled values.
# Roughly match the set values
print('Mean : ',np.mean(x))
print('Standard deviation : ',np.std(x))
mean : 1.025193167329272
Standard deviation : 2.0025521905807424
# Display the histogram and check the probability density distribution
plt.hist(x, bins=20)
(array([1.000e+00, 4.000e+00, 1.000e+01, 4.300e+01, 9.900e+01, 2.520e+02,
4.990e+02, 8.380e+02, 1.165e+03, 1.511e+03, 1.646e+03, 1.404e+03,
1.085e+03, 7.080e+02, 4.090e+02, 2.140e+02, 7.600e+01, 2.600e+01,
5.000e+00, 5.000e+00]),
array([-7.27230178, -6.47052873, -5.66875567, -4.86698262, -4.06520957,
-3.26343652, -2.46166347, -1.65989042, -0.85811737, -0.05634432,
0.74542873, 1.54720178, 2.34897483, 3.15074788, 3.95252093,
4.75429398, 5.55606703, 6.35784008, 7.15961313, 7.96138618,
8.76315923]),
<a list of 20 Patch objects>)
While we’re at it, let’s check that the values obtained analytically by this histogram and formula are correct.
# Normalize the histogram
plt.hist(x, bins=20, density=1,range=(-10,10))
# Calculate and plot the normal distribution
mu = 1
sigma = 2
def get_norm(x):
import math
return math.exp(-1 * (x - mu)**2 / 2 / sigma**2) / math.sqrt(2 * np.pi * sigma**2)
def get_norm_list(x):
return [get_norm(i) for i in x].
x1 = np.linspace(-10,10,1000)
y1 = get_norm_list(x1)
plt.plot(x1,y1)
[<matplotlib.lines.Line2D at 0x11264ea20>]
You can see that they are roughly in agreement.
np.random.binomal(n,p[,size])
Binomial distribution. It takes a positive integer $n$ and a probability $p$ as parameters.
$$ P(k)=\frac{n!}{k!(n-k)!} P^k(1-P)^{n-k} $$ P(k)=\frac{n!}{k!
x = np.random.binomial(30,0.2,10000)
plt.hist(x, bins=12)
(array([ 100., 337., 815., 1331., 1721., 1807., 2635., 662., 341,
153., 63., 35.]),
array([ 0., 1.16666667, 2.333333333, 3.5 , 4.666666667,
5.83333333, 7. , 8.16666667, 9.3333333, 10.5 ,
11.666666667, 12.8333333333, 14.]),
<a list of 12 Patch objects>)
np.random.poisson(lambda[,size])
Poisson distribution. It takes $\lambda$ as a parameter and is expressed as follows. $$ P(k) = \frac{e^{-\lambda}\lambda^k}{k!} $$
x = np.random.poisson(5,10000)
plt.hist(x, bins=15)
(array([ 389., 842., 1419., 1839., 1780., 1416., 1028., 648., 356,
160., 74., 28., 15., 4., 2.]),
array([ 0. , 1.0666666667, 2.133333333, 3.2 , 4.26666667,
5.333333333, 6.4 , 7.46666667, 8.53333333, 9.6 ,
10.666666667, 11.73333333, 12.8 , 13.86666667, 14.93333333,
16. ]),
<a list of 15 Patch objects>)
np.random.beta(a,b[,size])
Beta distribution. It takes two parameters, $\alpha$ and $\beta$.
$\alpha$ and $\beta$ P(x)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} $$
where $B(\alpha,\beta)$ is the beta function and $$ B(\alpha,\beta) = \int_0^1x^{\alpha-1}(1-x)^{\beta-1}dx $$
x = np.random.beta(2,4,10000)
plt.hist(x,bins=20)
(array([197., 570., 775., 916., 880., 988., 928., 934., 787., 685., 615,
490., 426., 288., 206., 154., 82., 55., 18., 6.]),
array([0.00344225, 0.04987531, 0.09630837, 0.14274143, 0.18917449,
0.23560755, 0.28204061, 0.32847367, 0.37490673, 0.42133979,
0.46777285, 0.51420591, 0.56063897, 0.60707203, 0.665350509,
0.69993815, 0.74637121, 0.79280427, 0.83923733, 0.88567039,
0.93210345]),
<a list of 20 Patch objects>)
np.random.gamma(a,b[,size])
Gamma distribution. It has two parameters, $\alpha$ and $\beta$. The formula is as follows.
$$ P(x) = \frac{x^{\alpha - 1}\exp(-\frac{x}{\beta})}{\Gamma(\alpha)\beta^\alpha}\quad (x \geq 0) $$
$\Gamma$ is the gamma function.
$$ \Gamma(x) = \int_0^\infty t^{x-1}e^{-t} dt \quad (x \geq 0) $$
The superfunction is defined as
x = np.random.gamma(3,2,10000)
plt.hist(x, bins=20)
(array([7.280e+02, 1.951e+03, 2.248e+03, 1.848e+03, 1.290e+03, 7.900e+02,
5.320e+02, 2.760e+02, 1.720e+02, 7.500e+01, 3.900e+01, 2.100e+01,
1.600e+01, 6.000e+00, 2.000e+00, 3.000e+00, 2.000e+00, 0.000e+00,
0.000e+00, 1.000e+00]),
array([ 0.214581 , 1.90909253, 3.60360406, 5.29811559, 6.99262712,
8.68713865, 10.38165018, 12.07616171, 13.77067324, 15.46518477,
17.1596963 , 18.85420783, 20.54871936, 22.24323089, 23.93774242,
25.63225395, 27.32676548, 29.02127701, 30.71578854, 32.41030007,
34.10481161]),
<a list of 20 Patch objects>)
np.random.multivariate_normal(mean, cov[, size, check_valid, tol])
Specify the mean and covariance matrix and sample from a correlated multidimensional normal distribution.
If the covariance matrix is
$$ \left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) $$
then we are sampling from two uncorrelated normal distributions.
$$ \left( \begin{array}{cc} 1 & 1 \\ 1 & 1 \end{array} \right) $$
This is a perfect correlation with
$$ \left( \begin{array}{cc} 1 & 0.5 \\ 0.5 & 1 \end{array} \right) $$
to sample from two normal distributions with a covariance of 0.5.
$$ \left( \begin{array}{cc} 1 & -0.7 \\ -0.7 & 1 \end{array} \right) $$
This results in a negative correlation in
a = np.array([0,0])
b = np.array([[1,0],[0,1]])
sample = np.random.multivariate_normal(a,b,size=1000)
import seaborn as sns
sns.set(style='darkgrid')
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')
<seaborn.axisgrid.JointGrid at 0x11cb76978>
a = np.array([0,0])
b = np.array([[1,0.5],[0.5,1]])
sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')
<seaborn.axisgrid.JointGrid at 0x11ef2e518>
a = np.array([0,0])
b = np.array([[1,0.8],[0.8,1]])
sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')
<seaborn.axisgrid.JointGrid at 0x102082908>
a = np.array([0,0])
b = np.array([[1,0.95],[0.95,1]])
sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')
<seaborn.axisgrid.JointGrid at 0x11f1860f0>
a = np.array([0,0])
b = np.array([[1,-0.7],[-0.7,1]])
sample = np.random.multivariate_normal(a,b,size=1000)
sns.jointplot(x=sample[:,0], y=sample[:,1], kind='kde')
<seaborn.axisgrid.JointGrid at 0x11f085320>
There are many more probability distributions out there that are necessary for machine learning, but this is the end of sampling in numpy. More details will be explained in scipy .