Creating a scikit-learn dataset

scikit-learn provides a function to create not only the original dataset, but also the dataset itself. So, it will do the sampling for you. You can sample the dataset you need for a regression or classification problem.

In most cases, the data is already given to you, but sometimes you need a simple data set for analysis, so you can use it from time to time.

If you have a small amount of variables, you can use numpy to quickly create a dataset, but if you have many variables, or if you want to correlate between variables, you may want to use this.

sickit-learn index

  1. official data set
  2. Create data <= this section
  3. linear regression
  4. logistic regression

github

  • The file in jupyter notebook format is here

google colaboratory

  • To run it in google colaboratory here datasets/ds_nb.ipynb)

Please refer to the official page for details.

Author’s environment

The author’s OS is macOS, and the options are different from those of Linux and Unix commands.

! sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G6020
Python -V
Python 3.7.3
import sklearn

sklearn.__version__
'0.20.3'
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

print('matplotlib version :', matplotlib.__version__)
print('numpy version :', np.__version__)
matplotlib version : 3.0.3
numpy version : 1.16.2

Make a dataset for the classification problem.

make_blobs

The official page is here. From the description, it seems to create a dataset from an isotropic Gaussian distribution. It creates a few lumps with no correlation between variables. It is very simple and convenient.

You can specify the number of samples, the number of features, the number of clusters, their standard deviations, etc.

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1)

plt.grid()
plt.scatter(X[:, 0], X[:, 1], c=y, marker='o')
plt.show()

and we now have a dataset for four classification problems. This is useful.

classification

The algorithm for data generation in classification seems to be a bit complicated, but it seems that the starting point for the initial data is generated from a Gaussian distribution, which is then transformed to produce the final data. It seems that the starting point of the data is generated from a Gaussian distribution, which is then transformed to produce the final data. I will look into this when I get a chance and add a note.

from sklearn.datasets import make_classification

X, y = make_classification(
  n_samples=200,
  n_features=2,
  n_informative=2,
  n_redundant=0,
  n_clusters_per_class=1,
  n_classes=3)

plt.grid()
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y)
plt.show()

Create a dataset for the regression problem.

The official page is here. Similarly, it is simple and convenient.

You can flexibly set the number of samples, number of features, number of features that are strongly correlated with the target variable, noise, bias, etc.

from sklearn.datasets import make_regression

X, y, coef = make_regression(
  n_samples=200,
  n_features=2,
  n_informative=2,
  noise=6.0,
  bias=-2.0,
  coef=True)

plt.grid()
plt.plot(X[:, 0], y, "o", c="red")
plt.show()

The output is a data set that seems to be suitable for regression problems.

References