Using pandas to Set Each Element Stored as a List into Columns and Expand as One-Hot Encoding

During data analysis, I encountered a situation where elements in pandas were stored as lists, and I wanted to create a DataFrame with one-hot encoding for those lists. It was quite challenging, so I’m taking notes on it.

github

  • The Jupyter notebook file is available on github here

google colaboratory

Execution Environment

My OS is macOS. The options might differ from Linux or Unix commands.

!sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90
!python -V
Python 3.9.17

Import the basic libraries and check their versions.

%matplotlib inline

import pandas as pd

print('pandas version :', pd.__version__)
pandas version : 2.0.3

Preparing Sample Data

df = pd.DataFrame(
    {
        "user_id": ["A", "B", "C"],
        "item_id": [["PC", "Book", "Water"], ["Book", "Table"], ["Desk", "CD"]],
    }
)

df.head()
user_iditem_id
0A[PC, Book, Water]
1B[Book, Table]
2C[Desk, CD]

Using MultiLabelBinarizer

In conclusion, we use the MultiLabelBinarizer library from scikit-learn.

By using fit_transform, we can easily achieve one-hot encoding. Additionally, obtaining the corresponding column names is straightforward.

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit_transform(df.item_id)
array([[1, 0, 0, 1, 0, 1],
       [1, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 0]])
mlb.classes_
array(['Book', 'CD', 'Desk', 'PC', 'Table', 'Water'], dtype=object)

Finally, combine them. Extract it from df with pop, and then join them together.

out_df = df.join(pd.DataFrame(mlb.fit_transform(df.pop("item_id")), columns=mlb.classes_))

out_df
user_idBookCDDeskPCTableWater
0A100101
1B100010
2C011000

Reference Site