[python] LightGBM (regression problem) template

LightGBM template

LightGBM is often used to analyze table data, and since I’ve been using it a lot recently, I’ll put it together as a template.

(Update: 2022/2/5: A good DataScientist taught me how to analyze features using Shap, so I’ve added it to the template)

Getting the data
Creating a Model
Cross Validation
Evaluation of TEST
Analysis by Shap

github

Files in jupyter notebook format on github are here

google colaboratory

To run it in google colaboratory here

Author’s environment

!sw_vers

ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G9323

!Python -V

Python 3.8.5

import sys
sys.executable

'/Users/hiroshi.wayama/anaconda3/envs/lgbm2/bin/python'

Load the library

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import time
import json
import random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import japanize_matplotlib
import snap

import lightgbm as lgb

from sklearn import datasets
from sklearn import metrics

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.datasets import fetch_california_housing
import pandas as pd

import seaborn as sns
sns.set_style('darkgrid')
sns.set(font='IPAexGothic')

import warnings
warnings.filterwarnings('ignore')

Loading data

We used to use the Boston home price dataset as the data to load, but since it was recently depricated, we will use the newly added California home prices.

california_housing = fetch_california_housing(as_frame=True)
df = pd.DataFrame(california_housing['data'], columns=california_housing.feature_names)
df_target = pd.Series(California_housing['target'])

# template
# Set the explanatory variables
X_data = df
# set the target variable
Y_data = df_target

X_data.shape, Y_data.shape

((20640, 8), (20640,))

X_data.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.5555556	37.88	- 122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	- 122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	- 122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	- 122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	- 122.25

Y_data.head()

0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
Name: MedHouseVal, dtype: float64

Constants

Set the seed value and other constants to fix the random value.

# const
seed = 123
random_state = 123
n_splits=5
test_size=0.2

Function to fix the seed.

Function to set the seed in bulk.

def set_seed(seed=seed):
  os.environ["PYTHONHASHSEED"] = str(seed)
  np.random.seed(seed)
  random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  # Comment out if it degrades GPU performance
  torch.backends.cudnn.deterministic = True
  torch.use_deterministic_algorithms = True

Split the data.

Split the data for Train and Test.

X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=test_size, random_state=random_state)

X_train.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
9950	4.5694	28.0	6.219512	1.030488	504.0	3.073171	38.38	- 122.33
3547	5.6392	18.0	5.951644	1.034816	3010.0	2.911025	34.26	- 118.60
4448	1.7292	47.0	3.628032	1.032345	1452.0	3.913747	34.07	-118.21
6984	4.6226	36.0	5.126238	0.985149	988.0	2.445545	33.96	-118.02
4432	2.4375	49.0	4.024390	0.942073	1405.0	4.283537	34.08	-118.20

X_test.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
19121	3.7917	40.0	4.959799	1.030151	1039.0	2.610553	38.24	- 122.64
20019	4.0217	9.0	5.804577	1.000000	1749.0	3.079225	36.09	-119.05
15104	4.0882	12.0	5.360360	1.070571	3321.0	4.986486	32.85	-116.98
3720	2.2377	27.0	3.376582	1.023207	3403.0	3.589662	34.20	- 118.42
8938	4.4211	41.0	5.656904	1.165272	1047.0	2.190377	34.01	-118.47

Y_train.head()

9950 2.875
3547 2.715
4448 1.917
6984 2.197
4432 1.140
Name: MedHouseVal, dtype: float64

Y_test.head()

19121 1.516
20019 0.992
15104 1.345
3720 2.317
8938 4.629
Name: MedHouseVal, dtype: float64

Creating a model using Train data.

Cross Validation

Perform cross validation on Train data. Cross Validation is a method that divides the whole data, creates a model using a part of it, and validates it with the rest of the data. In K-Fold Cross Validation, the entire data is divided into K parts, K-1 parts are used for training data, and the remaining part is used for validation.

For each fold, we also plot a scatter plot of the loss function and the predicted and true values computed from the model.

%%time

params = {
  'random_state': random_state,
  'objective': 'regression',
  'boosting_type': 'gbdt',
  'metric': {'rmse'},
  'verbosity': -1,
  'bagging_freq': 1,
  'feature_fraction': 0.8,
  'max_depth': 8,
  'min_data_in_leaf': 25,
  'num_leaves': 256,
  'learning_rate': 0.07,
  'lambda_l1': 0.2,
  'lambda_l2': 0.5,
}

model_list = [].
kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)

for _index, (_train_index, _val_index) in enumerate(kf.split(X_train, Y_train)):

   _X_train = X_train.iloc[_train_index].
   _Y_train = Y_train.iloc[_train_index].

   _X_val = X_train.iloc[_val_index]
   _Y_val = Y_train.iloc[_val_index]

   lgb_train = lgb.Dataset(_X_train, _Y_train)
   lgb_val = lgb.Dataset(_X_val, _Y_val, reference=lgb_train)

   lgb_results = {}

   model = lgb.train(
     params,
     train_set=lgb_train,
     valid_sets=[lgb_train, lgb_val],
     verbose_eval=-1,
     num_boost_round=1000,
     early_stopping_rounds=100,
     valid_names=['Train', 'Val'],
     evals_result=lgb_results
   )

  # Save each model in CV
   model_list.append(model)

   # Transition of the loss function
   loss_train = lgb_results['Train']['rmse']
   loss_val = lgb_results['Val']['rmse']
   best_iteration = model.best_iteration

   plt.figure()
   plt.xlabel('Iteration')
   plt.ylabel('rmse')
   plt.plot(loss_train, label='train loss')
   plt.plot(loss_val, label='valid loss')
   plt.title('kFold : {} \n RMSE'.format(_index))
   plt.legend()
   plt.show()

  # scatter plot
   plt.figure(figsize=(4,4))
   y_val = model.predict(_X_val, num_iteration=model.best_iteration)
   plt.plot(y_val, y_val, color = 'red', label = '$y=x$')
   plt.scatter(y_val,_Y_val, s=1)
   plt.xlabel('Predicted value')
   plt.ylabel('True value')
   plt.title('kFold : {} \n predicted value vs true value'.format(_index))
   plt.legend()
   plt.show()

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[828] Train's rmse: 0.173173 Val's rmse: 0.448933

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[494] Train's rmse: 0.219891 Val's rmse: 0.453579

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[793] Train's rmse: 0.172299 Val's rmse: 0.467271

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[388] Train's rmse: 0.246327 Val's rmse: 0.447831

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[462] Train's rmse: 0.233192 Val's rmse: 0.423386

CPU times: user 1min 18s, sys: 4.58 s, total: 1min 22s
Wall time: 16.8 s

Visualize the importance of features.

Visualize the importance of the features stored in the model. Refer to this site.

https://www.sairablog.com/article/lightgbm-sklearn-kaggle-classification.html

Feature_importances = pd.DataFrame()

for fold, model in enumerate(model_list):

    DataFrame() tmp = pd.
    tmp['feature'] = model.feature_name()
    tmp['importance'] = model.feature_importance(importance_type='gain')
    tmp['fold'] = fold

    feature_importances = feature_importances.append(tmp)

order = list(feature_importances.groupby("feature")["importance"].mean().sort_values(ascending=False).index)

plt.figure(figsize=(4, 4))
sns.barplot(x='importance', y='feature', data=feature_importances, order=order)
plt.title('LGBM importance')
plt.tight_layout()
plt.show()

Inference with TEST data

We will use TEST data to make predictions for the model generated by each Fold. The predictions are averaged over each Fold.

_test_score_array = np.zeros(len(X_test))

for model in model_list:
  y_pred = model.predict(X_test, num_iteration=model.best_iteration)

  _test_score_array += y_pred / n_splits

_test_score_array

array([2.08291787, 0.88297742, 1.44321429, ... , 0.82898221, 2.08941514,
       3.45456004])

Visualize the importance of features in TEST results.

### Prepare a dataframe to store the feature importance.
# Extract from https://www.sairablog.com/article/lightgbm-sklearn-kaggle-classification.html
top_feature_num = 10
feature_importances = pd.DataFrame({
  'feature' : model.feature_name(),
  'importance': model.feature_importance(importance_type='gain'),
})

order = list(feature_importances.groupby("feature")["importance"].mean().sort_values(ascending=False).index)[:top_feature_num])

plt.figure(figsize=(4, 6))
sns.barplot(x='importance', y='feature', data=feature_importances, order=order)
plt.title('LGBM importance')
plt.tight_layout()
plt.show()

Parsing with Shap

I didn’t know about Shap until recently, but I was told about this analysis method and I’ll add it to the template as soon as possible.

Using Shap, it is possible to quantitatively determine the impact of features on the results. It is often used to explain why a model was created.

import shap

shap.initjs()
Explainer(model, _X_train) explainer = shap.
shap_values = explainer(_X_train, check_additivity=False)

100%|===================| 13179/13210 [05:39<00:00]]

summary plot

With the summary, we can visualize the impact of each feature.

Blue indicates a positive impact on the target variable, red indicates a negative impact.

You can use

shap.summary_plot(shap_values, _X_train)

We can also see the total impact of the positive and negative values as follows

shap.summary_plot(shap_values, _X_train, plot_type="bar")

Analysis by waterfall

For each piece of data, we can visualize the effect of features.

shap.plots.waterfall(shap_values[0], max_display=20)

shap_values.shape

(13210, 8)

Y_val = model.predict(_X_train, num_iteration=model.best_iteration)
df_result = pd.DataFrame({
  'index': _X_train.index.tolist(),
  'train_score': Y_val,
})

Compare the top data with the bottom data to see which features work better.

top = df_result.sort_values(by='train_score', ascending=False).index.tolist()
bottom = df_result.sort_values(by='train_score').index.tolist()

Display the top 5 results.

for i in top[:5]:
  shap.plots.waterfall(shap_values[i], max_display=20)

Show bottom 5 results.

for i in bottom[:5]:
  shap.plots.waterfall(shap_values[i], max_display=20)

Feature dependency check

We can visualize the correlation between each feature and its SHAP value.

Feature_importances = pd.DataFrame({
  'feature' : model.feature_name(),
  DataFrame({ 'feature' : model.feature_name(), 'importance': model.feature_importance(importance_type='gain'),
})
feature_list = list(feature_importances.groupby("feature")["importance"].mean().sort_values(ascending=False).index)
for feature in feature_list:
  shap.dependence_plot(feature, shap_values.values, _X_train)