Logistic regression with scikit-learn

Logistic regression can be easily performed with scikit-learn, so I’ll leave it as a reminder. scikit-learn can be used for fitting and predicting. Logistic regression is called regression, but I think it is a method for solving classification problems.

sickit-learn description table of contents 1.

  1. official data set
  2. creating data
  3. linear regression
  4. Logistic regression <= this section

github

google colaboratory

Author’s environment

The author’s OS is macOS, and the options are different from those of Linux and Unix commands.

! sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022
Python -V
Python 3.7.3
import sklearn

sklearn.__version__
'0.20.3'

Load the required libraries.

import numpy as np
import scipy
from scipy.stats import binom

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

print("numpy version :", np.__version__)
print("matplotlib version :", matplotlib.__version__)
print("sns version :",sns.__version__)
numpy version : 1.16.2
matplotlib version : 3.0.3
sns version : 0.9.0

Overview

Logistic regression is applied to binary classification problems. It is used to predict whether a person will buy a desired product or not, whether a person will vote for a certain person or not, and so on. It is a regression method that is often used in fields such as marketing.

For example, let the probability of purchasing a certain product be $p$, and let the explanatory variables be $x_1,x_2,x_3 \cdots$, and apply the linear regression equation to the log odds of the probability $p$. The definition of log odds will be explained later.

$$ a_0 x_0 + a_1 x_1 + a_2 x_2 \cdots a_n x_n = \log \frac{p}{1-p} $$

Solving this for $p$, we get

$$ \displaystyle p = \frac{1}{1 + \exp^{ -\sum_{i=0}^n a_i x_i}} $$

and the expression for the probability density is in the form of a logistic function.

Implementation

Now we will implement the logistic regression in scikit-learn. First, we will create the appropriate data. For simplicity, we will restrict ourselves to the one-dimensional case.

x = np.linspace(0,1,30)
y = np.array(list(map(lambda x: 1 if x > 0.5 else 0,x)))

plt.grid()
plt.plot(x,y, "o")
plt.show()

It’s a bit rough, but let’s try to predict the data by means of logistic regression obtained from this data.

from sklearn.linear_model import LogisticRegression

x = np.linspace(0,1,30)
y = np.array(list(map(lambda x: 1 if x > 0.5 else 0,x)))

# In the current version, if you don't specify the solver, a warning will be shown. You can find more details on the official page, but the default solver does not support L1 normalization.
lr = LogisticRegression(solver='lbfgs', penalty='l2')

x = x.reshape(30,-1)
lr.fit(x,y)

# Try to predict
for i in range(10):
  print('x = {:.1f}, predit ='.format(i * 0.1),lr.predict([[i * 0.1]])[0])
x = 0.0, predit = 0
x = 0.1, predit = 0
x = 0.2, predit = 0
x = 0.3, predit = 0
x = 0.4, predit = 0
x = 0.5, predit = 1
x = 0.6, predit = 1
x = 0.7, predit = 1
x = 0.8, predit = 1
x = 0.9, predit = 1

So, the prediction is correct. After several runs, the value at $x=0.5$ will be a bit blurry.

Logit function

Is this word “odds” the same as the odds you often hear in horse racing? I don’t play horse racing, so I don’t know. Please tell me. Anyway, if the probability of an event happening is $p$, then

$$ \frac{p}{1-p} $$

is said to be the odds. Let its logarithm be

$$ \log p - \log(1-p) $$

is said to be the logarithmic odds.

Form of the logit function

In general

$$ y = \log \frac{x}{1-x} \frac{x}{1-x}, (0 < x < 1) $$

Let’s take a look at an overview of the logit function using scipy. It diverges at $x=0$ and $x=1$.

from scipy.special import logit

x = np.linspace(0,1,100)
y = logit(x)

plt.grid()
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x1213b5dd8>]

Logistic functions (sigmoid functions)

In general, you can use

$$ f(x)= \frac{1}{1+e^{-x}} $$

is called the logistic function. It is also called the sigmoid function. It is the inverse function of the logit function. We will try to graph the logistic function, since it seems to be available as a module in scipy.

from scipy.special import expit

x = np.linspace(-8,8,100)
y = expit(x)

plt.grid()
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x11f01d080>]