LightGBM template

LightGBM is often used to analyze table data, and since I’ve been using it a lot recently, I’ll put it together as a template.

(Update: 2022/2/5: A good DataScientist taught me how to analyze features using Shap, so I’ve added it to the template)

  • Getting the data
  • Creating a Model
  • Cross Validation
  • Evaluation of TEST
  • Analysis by Shap

github

  • Files in jupyter notebook format on github are here

google colaboratory

  • To run it in google colaboratory here

Author’s environment

!sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G9323
!Python -V
Python 3.8.5
import sys
sys.executable
'/Users/hiroshi.wayama/anaconda3/envs/lgbm2/bin/python'

Load the library

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import time
import json
import random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import japanize_matplotlib
import snap

import lightgbm as lgb

from sklearn import datasets
from sklearn import metrics

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.datasets import fetch_california_housing
import pandas as pd

import seaborn as sns
sns.set_style('darkgrid')
sns.set(font='IPAexGothic')

import warnings
warnings.filterwarnings('ignore')

Loading data

We used to use the Boston home price dataset as the data to load, but since it was recently depricated, we will use the newly added California home prices.

california_housing = fetch_california_housing(as_frame=True)
df = pd.DataFrame(california_housing['data'], columns=california_housing.feature_names)
df_target = pd.Series(California_housing['target'])

# template
# Set the explanatory variables
X_data = df
# set the target variable
Y_data = df_target
X_data.shape, Y_data.shape
((20640, 8), (20640,))
X_data.head()
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitude
08.325241.06.9841271.023810322.02.555555637.88- 122.23
18.301421.06.2381370.9718802401.02.10984237.86- 122.22
27.257452.08.2881361.073446496.02.80226037.85- 122.24
35.643152.05.8173521.073059558.02.54794537.85- 122.25
43.846252.06.2818531.081081565.02.18146737.85- 122.25
Y_data.head()
0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
Name: MedHouseVal, dtype: float64

Constants

Set the seed value and other constants to fix the random value.

# const
seed = 123
random_state = 123
n_splits=5
test_size=0.2

Function to fix the seed.

Function to set the seed in bulk.

def set_seed(seed=seed):
  os.environ["PYTHONHASHSEED"] = str(seed)
  np.random.seed(seed)
  random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  # Comment out if it degrades GPU performance
  torch.backends.cudnn.deterministic = True
  torch.use_deterministic_algorithms = True

Split the data.

Split the data for Train and Test.

X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=test_size, random_state=random_state)
X_train.head()
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitude
99504.569428.06.2195121.030488504.03.07317138.38- 122.33
35475.639218.05.9516441.0348163010.02.91102534.26- 118.60
44481.729247.03.6280321.0323451452.03.91374734.07-118.21
69844.622636.05.1262380.985149988.02.44554533.96-118.02
44322.437549.04.0243900.9420731405.04.28353734.08-118.20
X_test.head()
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitude
191213.791740.04.9597991.0301511039.02.61055338.24- 122.64
200194.02179.05.8045771.0000001749.03.07922536.09-119.05
151044.088212.05.3603601.0705713321.04.98648632.85-116.98
37202.237727.03.3765821.0232073403.03.58966234.20- 118.42
89384.421141.05.6569041.1652721047.02.19037734.01-118.47
Y_train.head()
9950 2.875
3547 2.715
4448 1.917
6984 2.197
4432 1.140
Name: MedHouseVal, dtype: float64
Y_test.head()
19121 1.516
20019 0.992
15104 1.345
3720 2.317
8938 4.629
Name: MedHouseVal, dtype: float64

Creating a model using Train data.

Cross Validation

Perform cross validation on Train data. Cross Validation is a method that divides the whole data, creates a model using a part of it, and validates it with the rest of the data. In K-Fold Cross Validation, the entire data is divided into K parts, K-1 parts are used for training data, and the remaining part is used for validation.

For each fold, we also plot a scatter plot of the loss function and the predicted and true values computed from the model.

%%time

params = {
  'random_state': random_state,
  'objective': 'regression',
  'boosting_type': 'gbdt',
  'metric': {'rmse'},
  'verbosity': -1,
  'bagging_freq': 1,
  'feature_fraction': 0.8,
  'max_depth': 8,
  'min_data_in_leaf': 25,
  'num_leaves': 256,
  'learning_rate': 0.07,
  'lambda_l1': 0.2,
  'lambda_l2': 0.5,
}

model_list = [].
kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)

for _index, (_train_index, _val_index) in enumerate(kf.split(X_train, Y_train)):

   _X_train = X_train.iloc[_train_index].
   _Y_train = Y_train.iloc[_train_index].

   _X_val = X_train.iloc[_val_index]
   _Y_val = Y_train.iloc[_val_index]

   lgb_train = lgb.Dataset(_X_train, _Y_train)
   lgb_val = lgb.Dataset(_X_val, _Y_val, reference=lgb_train)

   lgb_results = {}

   model = lgb.train(
     params,
     train_set=lgb_train,
     valid_sets=[lgb_train, lgb_val],
     verbose_eval=-1,
     num_boost_round=1000,
     early_stopping_rounds=100,
     valid_names=['Train', 'Val'],
     evals_result=lgb_results
   )

  # Save each model in CV
   model_list.append(model)

   # Transition of the loss function
   loss_train = lgb_results['Train']['rmse']
   loss_val = lgb_results['Val']['rmse']
   best_iteration = model.best_iteration

   plt.figure()
   plt.xlabel('Iteration')
   plt.ylabel('rmse')
   plt.plot(loss_train, label='train loss')
   plt.plot(loss_val, label='valid loss')
   plt.title('kFold : {} \n RMSE'.format(_index))
   plt.legend()
   plt.show()

  # scatter plot
   plt.figure(figsize=(4,4))
   y_val = model.predict(_X_val, num_iteration=model.best_iteration)
   plt.plot(y_val, y_val, color = 'red', label = '$y=x$')
   plt.scatter(y_val,_Y_val, s=1)
   plt.xlabel('Predicted value')
   plt.ylabel('True value')
   plt.title('kFold : {} \n predicted value vs true value'.format(_index))
   plt.legend()
   plt.show()
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[828] Train's rmse: 0.173173 Val's rmse: 0.448933
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[494] Train's rmse: 0.219891 Val's rmse: 0.453579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[793] Train's rmse: 0.172299 Val's rmse: 0.467271
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[388] Train's rmse: 0.246327 Val's rmse: 0.447831
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[462] Train's rmse: 0.233192 Val's rmse: 0.423386
CPU times: user 1min 18s, sys: 4.58 s, total: 1min 22s
Wall time: 16.8 s

Visualize the importance of features.

Visualize the importance of the features stored in the model. Refer to this site.

Feature_importances = pd.DataFrame()

for fold, model in enumerate(model_list):

    DataFrame() tmp = pd.
    tmp['feature'] = model.feature_name()
    tmp['importance'] = model.feature_importance(importance_type='gain')
    tmp['fold'] = fold

    feature_importances = feature_importances.append(tmp)

order = list(feature_importances.groupby("feature")["importance"].mean().sort_values(ascending=False).index)

plt.figure(figsize=(4, 4))
sns.barplot(x='importance', y='feature', data=feature_importances, order=order)
plt.title('LGBM importance')
plt.tight_layout()
plt.show()

Inference with TEST data

We will use TEST data to make predictions for the model generated by each Fold. The predictions are averaged over each Fold.

_test_score_array = np.zeros(len(X_test))

for model in model_list:
  y_pred = model.predict(X_test, num_iteration=model.best_iteration)

  _test_score_array += y_pred / n_splits

_test_score_array
array([2.08291787, 0.88297742, 1.44321429, ... , 0.82898221, 2.08941514,
       3.45456004])

Visualize the importance of features in TEST results.

### Prepare a dataframe to store the feature importance.
# Extract from https://www.sairablog.com/article/lightgbm-sklearn-kaggle-classification.html
top_feature_num = 10
feature_importances = pd.DataFrame({
  'feature' : model.feature_name(),
  'importance': model.feature_importance(importance_type='gain'),
})

order = list(feature_importances.groupby("feature")["importance"].mean().sort_values(ascending=False).index)[:top_feature_num])

plt.figure(figsize=(4, 6))
sns.barplot(x='importance', y='feature', data=feature_importances, order=order)
plt.title('LGBM importance')
plt.tight_layout()
plt.show()

Parsing with Shap

I didn’t know about Shap until recently, but I was told about this analysis method and I’ll add it to the template as soon as possible.

Using Shap, it is possible to quantitatively determine the impact of features on the results. It is often used to explain why a model was created.

import shap

shap.initjs()
Explainer(model, _X_train) explainer = shap.
shap_values = explainer(_X_train, check_additivity=False)
100%|===================| 13179/13210 [05:39<00:00]]

summary plot

With the summary, we can visualize the impact of each feature.

Blue indicates a positive impact on the target variable, red indicates a negative impact.

You can use

shap.summary_plot(shap_values, _X_train)

We can also see the total impact of the positive and negative values as follows

shap.summary_plot(shap_values, _X_train, plot_type="bar")

Analysis by waterfall

For each piece of data, we can visualize the effect of features.

shap.plots.waterfall(shap_values[0], max_display=20)
shap_values.shape
(13210, 8)
Y_val = model.predict(_X_train, num_iteration=model.best_iteration)
df_result = pd.DataFrame({
  'index': _X_train.index.tolist(),
  'train_score': Y_val,
})

Compare the top data with the bottom data to see which features work better.

top = df_result.sort_values(by='train_score', ascending=False).index.tolist()
bottom = df_result.sort_values(by='train_score').index.tolist()

Display the top 5 results.

for i in top[:5]:
  shap.plots.waterfall(shap_values[i], max_display=20)

Show bottom 5 results.

for i in bottom[:5]:
  shap.plots.waterfall(shap_values[i], max_display=20)

Feature dependency check

We can visualize the correlation between each feature and its SHAP value.

Feature_importances = pd.DataFrame({
  'feature' : model.feature_name(),
  DataFrame({ 'feature' : model.feature_name(), 'importance': model.feature_importance(importance_type='gain'),
})
feature_list = list(feature_importances.groupby("feature")["importance"].mean().sort_values(ascending=False).index)
for feature in feature_list:
  shap.dependence_plot(feature, shap_values.values, _X_train)

We have prepared a template for using LightGBM for regression problems. It will be updated if there are more items to be added in the future.

Reference articles