Python Tips
A personal note on some useful notations for using python.
github
- The jupyter notebook format file on github is here .
google colaboratory
- If you want to run it in google colaboratory here 010/010_nb.ipynb)
Author’s environment
sw_vers
ProductName: Mac OS X
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G95
Python -V
Python 3.5.5 :: Anaconda, Inc.
Fast aggregate retrieval after groupby in pandas
A pandas specialist once told me about a fast way to get the results (DataFrame type) after a groupby.
Basic library loading.
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import time
import json
import matplotlib.pyplot as plt
import numpy as np
count = 8
df_list = [].
for k in range(count):
num = 10 ** k
a1 = [{'a': i, 'b':i ** 2, 'c': i ** 3 % 9973} for i in range(num)]
df_list.append(pd.DataFrame(a1))
Here’s a fast way to extract it using a combination of groupy and for that I was taught.
time_list = [].
for i, df in enumerate(df_list):
start_time = time.time()
for c, _ in df.groupby('c'):
pass
time_list.append(time.time() - start_time)
I’m ashamed to say that this is the method I’ve been using.
time_list_02 = [].
c_list = df_list[count - 1]['c'].unique().tolist()
for i, df in enumerate(df_list):
for c in c_list:
df[df['c'] == c].count()
time_list_02.append(time.time() - start_time)
Compare results
Let’s plot and compare the results. Note that the $x$ axis is the exponent of the number of rows in the DataFrame and the $y$ axis is the logarithm.
import matplotlib.pyplot as plt
plt.yscale('log')
plt.plot(range(count), time_list, label='modified')
plt.plot(range(count), time_list_02, label='previous')
plt.grid()
plt.legend()
plt.show()
Even with a large amount of data, I have confirmed that the speed is about one digit faster. There is a lot of depth to the pandas that we usually use without thinking about it. I have to thank the person who told me about this! It really helped me!