Python Tips

A personal note on some useful notations for using python.

github

  • The jupyter notebook format file on github is here.

google colaboratory

  • If you want to run it in google colaboratory here 010/010_nb.ipynb)

Author’s environment

sw_vers
ProductName: Mac OS X

    ProductName: Mac OS X
    ProductVersion: 10.14.6
    BuildVersion: 18G95
Python -V
Python 3.5.5 :: Anaconda, Inc.

Fast aggregate retrieval after groupby in pandas

A pandas specialist once told me about a fast way to get the results (DataFrame type) after a groupby.

Basic library loading.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import time
import json

import matplotlib.pyplot as plt
import numpy as np
count = 8
df_list = [].

for k in range(count):
  num = 10 ** k
  a1 = [{'a': i, 'b':i ** 2, 'c': i ** 3 % 9973} for i in range(num)]
  df_list.append(pd.DataFrame(a1))

Here’s a fast way to extract it using a combination of groupy and for that I was taught.

time_list = [].
for i, df in enumerate(df_list):
  start_time = time.time()
  for c, _ in df.groupby('c'):
    pass
  time_list.append(time.time() - start_time)

I’m ashamed to say that this is the method I’ve been using.

time_list_02 = [].
c_list = df_list[count - 1]['c'].unique().tolist()

for i, df in enumerate(df_list):
  for c in c_list:
    df[df['c'] == c].count()
  time_list_02.append(time.time() - start_time)

Compare results

Let’s plot and compare the results. Note that the $x$ axis is the exponent of the number of rows in the DataFrame and the $y$ axis is the logarithm.

import matplotlib.pyplot as plt

plt.yscale('log')
plt.plot(range(count), time_list, label='modified')
plt.plot(range(count), time_list_02, label='previous')
plt.grid()
plt.legend()
plt.show()

Even with a large amount of data, I have confirmed that the speed is about one digit faster. There is a lot of depth to the pandas that we usually use without thinking about it. I have to thank the person who told me about this! It really helped me!