Using pandas to Set Each Element Stored as a List into Columns and Expand as One-Hot Encoding
During data analysis, I encountered a situation where elements in pandas were stored as lists, and I wanted to create a DataFrame with one-hot encoding for those lists. It was quite challenging, so I’m taking notes on it.
github
- The Jupyter notebook file is available on github here
google colaboratory
- To run on Google Colaboratory, use this link
Execution Environment
My OS is macOS. The options might differ from Linux or Unix commands.
!sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
!python -V
Python 3.9.17
Import the basic libraries and check their versions.
%matplotlib inline
import pandas as pd
print('pandas version :', pd.__version__)
pandas version : 2.0.3
Preparing Sample Data
df = pd.DataFrame(
{
"user_id": ["A", "B", "C"],
"item_id": [["PC", "Book", "Water"], ["Book", "Table"], ["Desk", "CD"]],
}
)
df.head()
user_id | item_id | |
---|---|---|
0 | A | [PC, Book, Water] |
1 | B | [Book, Table] |
2 | C | [Desk, CD] |
Using MultiLabelBinarizer
In conclusion, we use the MultiLabelBinarizer
library from scikit-learn.
By using fit_transform
, we can easily achieve one-hot encoding. Additionally, obtaining the corresponding column names is straightforward.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(df.item_id)
array([[1, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 0, 0]])
mlb.classes_
array(['Book', 'CD', 'Desk', 'PC', 'Table', 'Water'], dtype=object)
Finally, combine them. Extract it from df with pop, and then join them together.
out_df = df.join(pd.DataFrame(mlb.fit_transform(df.pop("item_id")), columns=mlb.classes_))
out_df
user_id | Book | CD | Desk | PC | Table | Water | |
---|---|---|---|---|---|---|---|
0 | A | 1 | 0 | 0 | 1 | 0 | 1 |
1 | B | 1 | 0 | 0 | 0 | 1 | 0 |
2 | C | 0 | 1 | 1 | 0 | 0 | 0 |