Linear regression (multiple regression) of two variables with scikit-learn

scikit-learn allows you to do linear regression easily, so I’ll leave this as a reminder. Here we will try to run a linear regression with two explanatory variables using scikit-learn, which is called multiple regression because it has two variables. The regression is called multiple regression because there are two variables, and single regression because there is only one explanatory variable.

sickit-learn Table of Contents

  1. official data set
  2. Creating the data
  3. Linear regression](/article/library/sklearn/linear_regression/) <= this section 4.
  4. logistic regression

github

google colaboratory

Author’s environment

The author’s OS is macOS, and the options are different from those of Linux and Unix commands.

!sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022
python -V
Python 3.7.3

Load the required libraries.

import numpy as np
import scipy
from scipy.stats import binom

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

print("numpy version :", np.__version__)
print("matplotlib version :", matplotlib.__version__)
print("sns version :",sns.__version__)
numpy version : 1.16.2
matplotlib version : 3.0.3
sns version : 0.9.0
import sklearn

sklearn.__version__
'0.20.3'

Get the data

We will run a linear regression with two explanatory variables. You can use sklearn’s datasets for the data, but we will create our own datasets for practice.

The dataset can be used as follows. I remember that the official scikit-learn page uses this dataset.

from sklearn.datasets import load_boston

boston = load_boston()

Creating the data

Assume we have a linear regression equation like the following. We have two explanatory variables and one objective variable.

$$ y = a_0 + a_1 * x_1 + a_2 * x_2 $$

In creating the data, we assume $a_0=5, a_1=2, a_2=1$ and add a few random numbers that follow a normal distribution.

from mpl_toolkits.mplot3d import Axes3D

x1 = [0.01 * i for i in range(100)].
x2 = [0.01 * i for i in range(100)].

x1 = np.linspace(-3,3,10)
x2 = np.linspace(1,5,10)

# We want to add a random number for each individual element of the array, so we cancel the broadcast.
# a_0=5, a_1=2, a_2=1
def get_y(x1, x2):
  return np.array([.
    [2 * __x1 + __x2 + np.random.randn() + 5 for __x1, __x2 in zip(_x1,_x2)] for _x1, _x2 in zip(x1,x2)
  ])

X1, X2 = np.meshgrid(x1, x2)
Y = get_y(X1, X2)

X1 = X1.reshape(-1)
X2 = X2.reshape(-1)
Y = Y.reshape(-1)

fig = plt.figure()
ax = Axes3D(fig)

ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_zlabel("$f(x_1, x_2)$")

# ax.plot_wireframe(X1, X2, Y)
ax.plot(X1, X2, Y, "o", color="#ff0000", ms=4, mew=0.5)
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x12689aac8>]
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
X = np.array([
  X1,
  X2
]).T

print(X.shape)
print(Y.shape)
lr.fit(X,Y)

print('coefficient : ',lr.coef_)
print('offset', lr.intercept_)
(100, 2)
(100,)
Coefficient : [2.01485573 1.03137701].
Offset 4.967149651408958

and we get the parameters $a_0,a_1,a_2$. As a result, we get the following linear regression equation.

$$ y = 5 + 2 * x_1 + x_2 $$

Using this parameter, let’s create a new test data set and calculate the mean squared error.

from sklearn.metrics import mean_squared_error

Y_predict = lr.predict(X)
print("MSE : {:.2f}".format(mean_squared_error(Y_predict, Y)))
MSE : 0.81

The result is MSE : 0.81. It’s very easy. sklearn is great.

Summary

You can learn a lot from linear regression alone. Also, in the practice of data analysis, simple models are preferred over complex models (I guess it doesn’t depend on the data analysis…). If linear regression can explain it, I think it is enough.