One-hot encoding list elements in pandas columns

I encountered a situation during data analysis where I had a pandas column containing lists and needed to create a one-hot encoded DataFrame from it. It was quite a struggle, so I’m leaving a note here.

github

  • The jupyter notebook file is available here

google colaboratory

  • To run in google colaboratory, click here

Execution Environment

The author’s OS is macOS. Options differ from Linux and Unix commands.

!sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90
!python -V
Python 3.9.17

Import basic libraries and check their versions.

%matplotlib inline

import pandas as pd

print('pandas version :', pd.__version__)
pandas version : 2.0.3

Preparation of Sample Data

df = pd.DataFrame(
    {
        "user_id": ["A", "B", "C"],
        "item_id": [["PC", "Book", "Water"], ["Book", "Table"], ["Desk", "CD"]],
    }
)

df.head()
user_iditem_id
0A[PC, Book, Water]
1B[Book, Table]
2C[Desk, CD]

Using MultiLabelBinarizer

In conclusion, we use the MultiLabelBinarizer library from scikit-learn.

As shown below, fit_transform allows for easy one-hot encoding. Also, the corresponding column names can be easily obtained.

from sklearn.preprocessing import MultiLabelBinarizer


mlb = MultiLabelBinarizer()
mlb.fit_transform(df.item_id)
array([[1, 0, 0, 1, 0, 1],
       [1, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 0]])
mlb.classes_
array(['Book', 'CD', 'Desk', 'PC', 'Table', 'Water'], dtype=object)

All that’s left is to combine them. Extract element from df with pop and finally combine with join.

out_df = df.join(pd.DataFrame(mlb.fit_transform(df.pop("item_id")), columns=mlb.classes_))

out_df
user_idBookCDDeskPCTableWater
0A100101
1B100010
2C011000

Reference Site