[pandas] Using pandas to Set Each Element Stored as a List into Columns and Expand as One-Hot Encoding

Using pandas to Set Each Element Stored as a List into Columns and Expand as One-Hot Encoding

During data analysis, I encountered a situation where elements in pandas were stored as lists, and I wanted to create a DataFrame with one-hot encoding for those lists. It was quite challenging, so I’m taking notes on it.

github

The Jupyter notebook file is available on github here

google colaboratory

To run on Google Colaboratory, use this link

Execution Environment

My OS is macOS. The options might differ from Linux or Unix commands.

!sw_vers

ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90

!python -V

Python 3.9.17

Import the basic libraries and check their versions.

%matplotlib inline

import pandas as pd

print('pandas version :', pd.__version__)

pandas version : 2.0.3

Preparing Sample Data

df = pd.DataFrame(
    {
        "user_id": ["A", "B", "C"],
        "item_id": [["PC", "Book", "Water"], ["Book", "Table"], ["Desk", "CD"]],
    }
)

df.head()

	user_id	item_id
0	A	[PC, Book, Water]
1	B	[Book, Table]
2	C	[Desk, CD]

Using MultiLabelBinarizer

In conclusion, we use the MultiLabelBinarizer library from scikit-learn.

By using fit_transform, we can easily achieve one-hot encoding. Additionally, obtaining the corresponding column names is straightforward.

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit_transform(df.item_id)

array([[1, 0, 0, 1, 0, 1],
       [1, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 0]])

mlb.classes_

array(['Book', 'CD', 'Desk', 'PC', 'Table', 'Water'], dtype=object)

Finally, combine them. Extract it from df with pop, and then join them together.

out_df = df.join(pd.DataFrame(mlb.fit_transform(df.pop("item_id")), columns=mlb.classes_))

out_df

	user_id	Book	CD	Desk	PC	Table	Water
0	A	1	0	0	1	0	1
1	B	1	0	0	0	1	0
2	C	0	1	1	0	0	0

Reference Site

https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list