How to use the sickit-learn dataset
scikit-learn is a must-have library for machine learning and data analysis. In this section, I’ll write down how to use the datasets that come with scikit-learn by default.
sickit-learn Contents 1.
- Official datasets <= this section
- Create data
- Linear regression](/article/library/sklearn/linear_regression/)
- Logistic regression
github
- Files in jupyter notebook format are here
google colaboratory
- To run it in google colaboratory here datasets/ds_nb.ipynb)
environment
The author’s OS is macOS, and the options are different from those of Linux and Unix.
My environment
!sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022
python -V
Python 3.7.3
import sklearn
sklearn.__version__
'0.20.3'
We also import pandas to display the data.
import pandas as pd
pd.__version__
'1.0.3'
We will also import matplotlib for displaying images. We’ll save the images as svg for better web appearance.
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
Overview
scikit-learn provides us with the datasets we need for machine learning. Here is an overview of the sample data according to the official website.
- toy dataset
- the actual dataset
toy datasets
toy probably means that the data is too simple to be used for actual machine learning model generation.
boston house price data
- target: house prices
- Regression problem
from sklearn.datasets import load_boston
boston = load_boston()
Since this is the first time, we’ll look at the data a little more carefully.
type(boston)
Bunch
We can see that the data type is sklearn.utils.Bunch.
dir(boston)
['DESCR', 'data', 'feature_names', 'filename', 'target']
We can see that it has the properties DESCR, data, feature_names, filename, and target. Let’s look at the attribute values one by one, DESCR is a description of the data, filename is the absolute path to the data file, so we’ll omit it.
boston.data
This is the actual data stored in the file. It contains each feature to be analyzed. It is also called explanatory variables.
boston.data
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ... , 1.5300e+01, 3.9690e+02,
4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, ... , 1.7800e+01, 3.9690e+02,
9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, ... , 1.7800e+01, 3.9283e+02,
4.0300e+00],
... ,
[6.0760e-02, 0.0000e+00, 1.1930e+01, ... , 2.1000e+01, 3.9690e+02,
5.6400e+00],
[1.0959e-01, 0.0000e+00, 1.1930e+01, ... , 2.1000e+01, 3.9345e+02,
6.4800e+00],
[4.7410e-02, 0.0000e+00, 1.1930e+01, ... , 2.1000e+01, 3.9690e+02,
7.8800e+00]])
boston.feature_names
The name of each feature.
boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
boston.target
The value of the target to predict. According to the official website, in the case of boston, this is the Median Value.
boston.target
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,
18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,
16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,
13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8,
7.2, 10.5, 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,
12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5, 5. , 11.9,
27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3, 7. , 7.2, 7.5, 10.4,
8.8, 8.4, 16.7, 14.2, 20.8, 13.4, 11.7, 8.3, 10.2, 10.9, 11. ,
9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8,
10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,
15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,
19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,
29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,
20.6, 21.2, 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,
23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])
Load in pandas.
df = pd.DataFrame(data=boston.data, columns=boston.feature_names)
df['MV'] = pd.DataFrame(data=boston.target)
df.shape
(506, 14)
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
This means that the number of data is 506. The statistics for each feature are as follows
df.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
Data of Iris
- target: Types of Iris
- Classification problem
from sklearn.datasets import load_iris
iris = load_iris()
print(type(iris))
print(dir(iris))
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['IRIS'] = pd.DataFrame(data=iris.target)
df.shape
(150, 5)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | IRIS | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Since the first five pieces of data have only zeroes, we can randomly sample the data as follows.
df.sample(frac=1, random_state=0).reset_index().head()
index | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | IRIS | |
---|---|---|---|---|---|---|
0 | 114 | 5.8 | 2.8 | 5.1 | 2.4 | 2 |
1 | 62 | 6.0 | 2.2 | 4.0 | 1.0 | 1 |
2 | 33 | 5.5 | 4.2 | 1.4 | 0.2 | 0 |
3 | 107 | 7.3 | 2.9 | 6.3 | 1.8 | 2 |
4 | 7 | 5.0 | 3.4 | 1.5 | 0.2 | 0 |
Each feature value is shown below. I’ve translated them into Japanese, but it’s not very clear.
English name | Japanese name |
---|---|
sepal length | sepal length |
sepal width | width of sepal |
petal length | length of petal |
petal width | width of petal |
The value of the target IRIS is 0,1,2, which can be checked with iris.target_names
.
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
It corresponds to the index in this list, which is shown in the table below.
index | IRIS |
---|---|
0 | setosa |
1 | versicolor |
2 | virginica |
Data of diabetic patients
- target: diabetic status from baseline
- Regression problem
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(type(diabetes))
print(dir(diabetes))
print(diabetes.feature_names)
print(diabetes.data.shape)
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'data_filename', 'feature_names', 'target', 'target_filename']
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
(442, 10)
df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df['QM'] = diabetes.target # QM : quantitative measure
df.head()
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | QM | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 | 151.0 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 | 75.0 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 | 141.0 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 | 206.0 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 | 135.0 |
Handwritten data
- target:Numbers from 0~9
- Classification problem
The data is stored in digits.images
and digits.data
, where images is a two-dimensional array and data is a one-dimensional array of 8x8.
from sklearn.datasets import load_digits
digits = load_digits()
print(type(digits))
print(dir(digits))
print(digits.data.shape)
print(digits.images.shape)
print(digits.target_names)
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'images', 'target', 'target_names']
(1797, 64)
(1797, 8, 8)
[0 1 2 3 4 5 6 7 8 9]
The first data stored in the file is as follows.
print(digits.images[0])
[[ 0. 0. 5. 13. 9. 1. 0. 0.]
[ 0. 0. 13. 15. 10. 15. 5. 0.]
[ 0. 3. 15. 2. 0. 11. 8. 0.]
[ 0. 4. 12. 0. 0. 8. 8. 0.]
[ 0. 5. 8. 0. 0. 9. 8. 0.]
[ 0. 4. 11. 0. 1. 12. 7. 0.]
[ 0. 2. 14. 5. 10. 12. 0. 0.]
[ 0. 0. 6. 13. 10. 0. 0. 0.]]
Try to convert digits.images[0]
into an image.
plt.imshow(digits.images[0], cmap='gray')
plt.grid(False)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x125974fd0>
It kind of looks like zero, doesn’t it? I’ll try changing the hue.
plt.imshow(digits.images[0])
plt.grid(False)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x12594af28>
Is it easier to see than grayscale? Doesn’t it change… Of course, for these data, the correct data is given.
print(digits.target[0])
0
Physiological data and athletic performance data
- target: Physiological data (weight, waist, pulse) (Japanese translation may be inaccurate)
- Regression problem
A problem to find physical characteristics such as weight and waist from athletic performance.
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
print(type(linnerud))
print(dir(linnerud))
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'data_filename', 'feature_names', 'target', 'target_filename', 'target_names']
df1 = pd.DataFrame(data=linnerud.data, columns=linnerud.feature_names)
df2 = pd.DataFrame(data=linnerud.target, columns=linnerud.target_names)
df1.head()
Chins | Situps | Jumps | |
---|---|---|---|
0 | 5.0 | 162.0 | 60.0 |
1 | 2.0 | 110.0 | 60.0 |
2 | 12.0 | 101.0 | 101.0 |
3 | 12.0 | 105.0 | 37.0 |
4 | 13.0 | 155.0 | 58.0 |
df2.head()
Weight | Waist | Pulse | |
---|---|---|---|
0 | 191.0 | 36.0 | 50.0 |
1 | 189.0 | 37.0 | 52.0 |
2 | 193.0 | 38.0 | 58.0 |
3 | 162.0 | 35.0 | 62.0 |
4 | 189.0 | 35.0 | 46.0 |
wine data
- target: types of wine
- Classification problem
from sklearn.datasets import load_wine
wine = load_wine()
print(type(wine))
print(dir(wine))
print(wine.feature_names)
print(wine.target)
print(wine.target_names)
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'feature_names', 'target', 'target_names']
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['class_0' 'class_1' 'class_2']
Try to load it with pandas. Add wine.target
with the name of the target as WINE.
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df['WINE'] = pd.DataFrame(data=wine.target)
df.head()
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | WINE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
Sampling the first five results in all of the WINE columns being zero, so we’ll try random sampling.
df.sample(frac=1, random_state=0).reset_index().head()
index | alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | WINE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 54 | 13.74 | 1.67 | 2.25 | 16.4 | 118.0 | 2.60 | 2.90 | 0.21 | 1.62 | 5.85 | 0.92 | 3.20 | 1060.0 | 0 |
1 | 151 | 12.79 | 2.67 | 2.48 | 22.0 | 112.0 | 1.48 | 1.36 | 0.24 | 1.26 | 10.80 | 0.48 | 1.47 | 480.0 | 2 |
2 | 63 | 12.37 | 1.13 | 2.16 | 19.0 | 87.0 | 3.50 | 3.10 | 0.19 | 1.87 | 4.45 | 1.22 | 2.87 | 420.0 | 1 |
3 | 55 | 13.56 | 1.73 | 2.46 | 20.5 | 116.0 | 2.96 | 2.78 | 0.20 | 2.45 | 6.25 | 0.98 | 3.03 | 1120.0 | 0 |
4 | 123 | 13.05 | 5.80 | 2.13 | 21.5 | 86.0 | 2.62 | 2.65 | 0.30 | 2.01 | 2.60 | 0.73 | 3.10 | 380.0 | 1 |
Breast Cancer Data
- target: Benign/malignant cancer
- Classification question
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
print(type(bc))
print(dir(bc))
print(bc.feature_names)
print(bc.target_names)
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']
There are quite a few attributes. This is a classification problem, whether malignant or benign.
df = pd.DataFrame(data=bc.data, columns=bc.feature_names)
df['MorB'] = pd.DataFrame(data=bc.target) # MorB means maligant or benign
df.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | MorB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
Let’s try random sampling.
df.sample(frac=1, random_state=0).reset_index().head()
index | mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | MorB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 512 | 13.40 | 20.52 | 88.64 | 556.7 | 0.11060 | 0.14690 | 0.14450 | 0.08172 | 0.2116 | ... | 29.66 | 113.30 | 844.4 | 0.15740 | 0.38560 | 0.51060 | 0.20510 | 0.3585 | 0.11090 | 0 |
1 | 457 | 13.21 | 25.25 | 84.10 | 537.9 | 0.08791 | 0.05205 | 0.02772 | 0.02068 | 0.1619 | ... | 34.23 | 91.29 | 632.9 | 0.12890 | 0.10630 | 0.13900 | 0.06005 | 0.2444 | 0.06788 | 1 |
2 | 439 | 14.02 | 15.66 | 89.59 | 606.5 | 0.07966 | 0.05581 | 0.02087 | 0.02652 | 0.1589 | ... | 19.31 | 96.53 | 688.9 | 0.10340 | 0.10170 | 0.06260 | 0.08216 | 0.2136 | 0.06710 | 1 |
3 | 298 | 14.26 | 18.17 | 91.22 | 633.1 | 0.06576 | 0.05220 | 0.02475 | 0.01374 | 0.1635 | ... | 25.26 | 105.80 | 819.7 | 0.09445 | 0.21670 | 0.15650 | 0.07530 | 0.2636 | 0.07676 | 1 |
4 | 37 | 13.03 | 18.42 | 82.61 | 523.8 | 0.08983 | 0.03766 | 0.02562 | 0.02923 | 0.1467 | ... | 22.81 | 84.46 | 545.9 | 0.09701 | 0.04619 | 0.04833 | 0.05013 | 0.1987 | 0.06169 | 1 |
5 rows × 32 columns
A benign negative result can be seen in MorB.