Numpy personal tips

numpy is another indispensable tool for data analysis. I’ll leave a note as a personal reminder. For more information

Contents

github

  • The file in jupyter notebook format on github here

numpy import

The author’s environment and import method are as follows.

!sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G6020
Python -V
Python 3.7.3
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import numpy as np

np.__version__
'1.16.2'

Useful functions.

np.sqrt(x)

Calculates the square root.

np.sqrt(4)
2.0

np.cbrt(x)

Compute the cube root.

np.cbrt(8)
2.0

np.square(x)

Compute the square.

np.square(2)
4

np.absolute(x)

Calculates absolute values. Complex numbers are supported.

print(np.absolute(-4))
print(np.absolute([1,-2,-4]))
print(np.absolute(complex(1,1))) # => sqrt(2)
4
[1 2 4] 1.41421356237
1.4142135623730951

np.convolve(x,y)

Compute the convolution.

a = np.array([1,2,3])
b = np.array([0,1.5,2])

print(np.convolve(a,b, mode='full')) # defalut mode = full
print(np.convolve(a,b, mode='same'))
print(np.convolve(a,b, mode='valid'))
[0. 1.5 5. 8.5 6. ]
[1.5 5. 8.5]
[5.]

np.diff(a,N)

Takes the difference between elements. It works like the derivative of a discrete value.

$$ d1_a[x] = a[x+1] - a[x]. $$ d1_a[x] = a[x+1] - a[x]

By substituting an integer for N, we can find the value of the Nth derivative.

$$ d2_a[x] = d1_a[x+1] - d1_a[x] $$

a = np.random.randint(10,size=10)
print('a : ', a)
print('First derivative : ', np.diff(a))
print('Second derivative : ', np.diff(a, 2))
a : [0 1 5 8 0 8 4 1 7 0].
First derivative : [ 1 4 3 -8 8 -4 -3 6 -7]
Second derivative : [ 3 -1 -11 16 -12 1 9 -13]

np.cumsum(a)

This is the addition of the elements. The concept is similar to the integration of discrete values. $$. s_a[x] = \sum_{i=0}^{x}a[x]. $$

a = np.random.randint(10,size=10)
print('a : ', a)
print('integral : ', np.cumsum(a))
a : [9 3 9 4 5 1 4 1 1 4].
Integral : [ 9 12 21 25 30 31 35 36 37 41]

np.heaviside(x,c)

Heaviside staircase function.

$$ H_c(x)= { \begin{cases} 1\ (x \gt 0)\\ c\ (x = 0)\\ 0\ (x \lt 0) \blur1} } $$

We won’t have much use for this in data analysis, but we’ll mention it in passing.

np.heaviside(a, 10)

which corresponds to $c=10$.

a = [i for i in range(-2,3)].
print(a)
print(np.heaviside(a, 10))
[ -2, -1, 0, 1, 2]
[ 0. 0. 10. 1. 1.]

np.interp(x’,x,y)

Returns the linearly interpolated value.

x = [0,1,2].
y = [2,100,50].

x1 = [0.5, 1.8].
y1 = np.interp(x1, x,y)

By defining it this way, we can find the value of $x=0.5$ for the line connecting $(x,y) = (0,2), (1,100)$ and the value of $x=1.8$ for the line connecting $(x,y) = (1,100), (2,50)$. We can find the value of $x=1.8$ for the line connecting $(x,y) = (1,100),(2,50)$.

import matplotlib.pyplot as plt

x = [0,1,2].
y = [2,100,50].

plt.grid()
plt.plot(x,y,marker='o')

x1 = [0.5, 1.8].
y1 = np.interp(x1, x,y)

print('x1 : ', x1)
print('y1 : ', y1)
plt.scatter(x1,y1,marker='^',c='red')
x1 : [0.5, 1.8].
y1 : [51. 60.]]





<matplotlib.collections.PathCollection at 0x11d4edf98>

Array operations.

ndarray.reshape(N,M,…)

Changes the shape of the array. The total size of the array must match before and after the change.

Converts a one-dimensional array of size 12 to a two-dimensional array of size 3x4.

a = np.arange(12)
b = a.reshape(3,4)

print('before shape : ',a.shape)
print('after shape : ',b.shape)
print('shape : ',a.shape) print('shape : ',b.shape) ```` before shape : (12,)
after shape : (3, 4)

np.tile(a,(N,M,…))

Place a on the tile. It’s easier to understand if you look at the example.

a = np.arange(5)

np.tile(a,(2,1))
array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

ndarray.flatten()

Converts a quadratic or larger array to a one-dimensional array, creates a copy, flattens it to one dimension, and returns the object.

a = np.arange(12).reshape(3,4)
b = a.flatten()

print('a : ',a)
print('b : ',b)
a : [[ 0 1 2 3]]
 [ 4 5 6 7]
 [ 8 9 10 11]]
b : [ 0 1 2 3 4 5 6 7 8 9 10 11]]

ndarray.ravel()

Converts a quadratic or larger array to a one-dimensional array, flattening it to one dimension without creating a copy. It is generally recommended to use this method as it is less demanding on hardware than flatten.

a = np.arange(12).reshape(3,4)
b = a.ravel()

print('a : ',a)
print('b : ',b)
a : [[ 0 1 2 3]]
 [ 4 5 6 7]
 [ 8 9 10 11]]
b : [ 0 1 2 3 4 5 6 7 8 9 10 11]]
a = np.array([1,2,3])
a = [1,2,3]]
b = a[:]

b[0] = 100

print(id(a))
print(id(b))
print(id(a[0]))
print(id(b[0]))
print(a)
print(b)

print(type(a))
print(type(a[:]))
print(id(a))
print(id(a[:]))
4778197384
4786913096
4487054752
4487057920
[1, 2, 3]
[100, 2, 3]
<class 'list'>
<class 'list'>
4778197384
4786912840

Difference between ndarray.flatten and ndarray.ravel

In the case of view, if the original array is changed, the array referring to it will also be changed.

view and copy

In general, assignments of type ndarray will copy the address. Thus, if the referent is changed, the destination will also be changed. Also, the memory address will match, and the memory size used by the object will be the same.

The ndarray type also has a method called base, but it is set to None.

import sys

a = np.arange(10000)
b = a

a[1] = 100

print('a = ', a)
print('b = ', b)
print('id(a) = ', id(a))
print('id(b) = ', id(b))
print('a mem = ', sys.getsizeof(a))
print('b mem = ', sys.getsizeof(b))
print('a base = ', a.base)
print('b base = ', b.base)
a = [ 0 100 2 ... 9997 9998 9999]
b = [ 0 100 2 ... 9997 9998 9999]
id(a) = 4786990496
id(b) = 4786990496
a mem = 80096
b mem = 80096
a base = None
b base = None
import sys

a = np.arange(10000)
b = a.copy()

a[1] = 100

print('a = ', a)
print('b = ', b)
print('id(a) = ', id(a))
print('id(b) = ', id(b))
print('a mem = ', sys.getsizeof(a))
print('b mem = ', sys.getsizeof(b))
print('a base = ', a.base)
print('b base = ', b.base)
a = [ 0 100 2 ... 9997 9998 9999]
b = [ 0 1 2 ... 9997 9998 9999]
id(a) = 4787275936
id(b) = 4787275616
a mem = 80096
b mem = 80096
a base = None
b base = None

Next, let’s create b by slicing the array of 10 from the beginning of a. Then another object with a different memory address will be created. However, if we set a[1]=100, b will also be changed. Thus, we can see that b is a different object, but it is an object that refers to a.

Also, the size is quite small, 96 bytes. dtype is int64, so if the sliced value is stored as it is, it should be at least 8 bytes x 20 = 160 bytes, but it is less than that. If the sliced value is stored as it is, we want at least 8 bytes x 20 = 160 bytes, but it is less than that, so we can assume that the memory address of a is stored.

If you use

import sys

a = np.arange(10000)
b = a[:20].

a[1] = 100

print('a = ', a)
print('b = ', b)
print('id(a) = ', id(a))
print('id(b) = ', id(b))
print('a mem = ', sys.getsizeof(a))
print('b mem = ', sys.getsizeof(b))
print('a base = ', a.base)
print('b base = ', b.base)
a = [ 0 100 2 ... 9997 9998 9999]
b = [ 0 100 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
  18 19]
id(a) = 4787359952
id(b) = 4787275936
a mem = 80096
b mem = 96
a base = None
b base = [ 0 100 2 ... 9997 9998 9999]

Let’s also try reshape.

import sys

a = np.arange(10000)
b = a.reshape(100,100)

a[1] = 100

print('a = ', a)
print('b = ', b)
print('id(a) = ', id(a))
print('id(b) = ', id(b))
print('a mem = ', sys.getsizeof(a))
print('b mem = ', sys.getsizeof(b))
print('a base = ', a.base)
print('b base = ', b.base)
a = [ 0 100 2 ... 9997 9998 9999]
b = [[ 0 100 2 ...   97 98 99]
 [ 100 101 102 ...  197 198 199]
 [ 200 201 202 ...  297 298 299]
 ...
 [9700 9701 9702 ... 9797 9798 9799]
 [9800 9801 9802 ... 9897 9898 9899]
 [9900 9901 9902 ... 9997 9998 9999]]
id(a) = 4787121328
id(b) = 4787120768
a mem = 80096
b mem = 112
a base = None
b base = [ 0 100 2 ... 9997 9998 9999]

The following shows the difference between flatten and ravel in terms of base and memory usage. As mentioned above, there is no base array in flatten, and the base is the original array in ravel. Also, since flatten is a copy, the size of the object occupying memory is the same as the original object, but in the case of ravel, it is minimized.

This is a bit of a low-level layer, but it makes a big difference as an engineer if you don’t understand it properly.

import sys

a = np.arange(120000,dtype='int64')
b1 = a.reshape(300,400)

for i in [0,1]:

  if i == 0:
    print('######## flatten ########')
    b2 = b1.flatten()
  elif i == 1:
    print('######## ravel ########')
    b2 = b1.ravel()

  print('id')
  print('a : ', id(a))
  print('b1 : ', id(b1))
  print('b2 : ', id(b2))
  print('')

  print('base')
  print('b1 : ', b1.base)
  print('b2 : ', b2.base)
  print('')

  print('Memory used by object')
  print('a : ',sys.getsizeof(a))
  print('b1 : ',sys.getsizeof(b1))
  print('b2 : ',sys.getsizeof(b2))
  print('')
######## flatten ########
id
a : 4787350864
b1 : 4787434960
b2 : 4787435120

base
b1 : [ 0 1 2 ... 119997 119998 119999]
b2 : None

Memory used by the object
a : 960096
b1 : 112
b2 : 960096

######## ravel ########
id
a : 4787350864
b1 : 4787434960
b2 : 4787275616

base
b1 : [ 0 1 2 ... 119997 119998 119999]
b2 : [ 0 1 2 ... 119997 119998 119999]

Memory used by the object
a : 960096
b1 : 112
b2 : 96

The biggest difference is that the memory used is 960k bytes for flatten and 96 bytes for ravel. A considerable memory reduction effect can be expected.

np.hstack

This is a concatenation of ndarray types. It concatenates in the horizontal (horizontal, axis=1) direction. Used rather often.

a = np.array([1,2,3])
b = np.array([4,5,6])

print('a : ',a)
print('b : ',b)
print('hstack : ',np.hstack((a,b)))
a : [1 2 3].
b : [4 5 6].
hstack : [1 2 3 4 5 6]]

An error will occur if the element sizes in the direction you want to combine are not correct. For example For example, shape=(1,2) and shape=(2,1) will result in an error.

a = np.array([[1,2]])
b = np.array([[1],[2]])

Try:
  print('[error occurrence]')
  print('hstack : ',np.hstack((a,b)))
except Exception as e:
  print(e)
[error occurrence].
All the input array dimensions except for the concatenation axis must match exactly.

np.vstack

Concatenation of type ndarray. It concatenates in the vertical (vertical, axis=0) direction. This is also used quite often. This is also used quite often, and will result in an error if the size does not match the direction you want to concatenate.

a = np.array([1,2,3])
b = np.array([4,5,6])

print('a : ',a)
print('b : ',b)
print('hstack : ',np.vstack((a,b)))
a : [1 2 3].
b : [4 5 6].
hstack : [[1 2 3]]
 [4 5 6]]

np.r_, np.c_

This is also array concatenation, which is easier than vstack and hstack, and I use this one more often.

I often use this one because it is easier than vstack and hstack. c_ is especially useful because it can create a two-dimensional array with each element from two one-dimensional arrays. Sometimes I forget…

x = [i for i in range(5)].
y = [i for i in range(5,10)]]

print('np.c_ :',np.c_[x,y])
print()
print('np.r_ :',np.r_[x,y])
np.c_ : [[0 5]]
 [1 6]
 [2 7]
 [3 8]]
 [4 9]]

np.r_ : [0 1 2 3 4 5 6 7 8 9]]

Create a sequential number.

np.arange([start, ]stop, [step, ]dtype=None)

Create a sequence of integers or equidistant sequences. The argument concept is the same as in python’s range(). See numpy.arange for details on the arguments.

np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(4,12)
array([ 4, 5, 6, 7, 8, 9, 10, 11])
np.arange(3,12,2)
array([ 3, 5, 7, 9, 11])
np.arange(1.5,4.5,0.5)
array([1.5, 2. , 2.5, 3. , 3.5, 4. ])

np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)

This function generates an isoperimetric sequence by specifying the range to be divided and the number of points to be divided. This is a very useful function, and I use it everywhere. If endpoint=True, stop is included in the split point, and the split point is

The split point is $$. start, start + \frac{stop - start}{num -1}, start + \frac{stop - start}{num -1} \times 2,start + \frac{stop - start}{num -1} \times 3, \cdots $$.

With endpoint=False, stop is not included in the split point, and the split point becomes

$$. start, start + \frac{stop - start}{num}, start + \frac{stop - start}{num} \times 2,start + \frac{stop - start}{num} \times 3, \cdots $$.

See numpy.linspace for details.

np.linspace(0,1,3)
array([0. , 0.5, 1. ])
np.linspace(0,1,3,endpoint=False)
array([0., 0.333333333, 0.666666667])
np.linspace(0,-11,20)
array([ 0. , -0.57894737, -1.15789474, -1.73684211,
        -2.31578947, -2.89473684, -3.47368421, -4.05263158,
        -4.63157895, -5.21052632, -5.78947368, -6.36842105,
        -6.94736842, -7.52631579, -8.10526316, -8.68421053,
        -9.26315789, -9.84210526, -10.42105263, -11.])

This function can be used to easily draw graphs and other objects.

x = np.linspace(- np.pi, np.pi, 100)
y = np.sin(x)

plt.grid()
plt.title('$y=\sin x$')
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x11d5db940>]