How to Use Recommendation Systems and the implicit Library
recommender systems analyze user preferences and behavior to suggest the most suitable items for each individual. Among them, Python library implicit is particularly well-known for its efficient calculations and ease of use. This article explains how to use the implicit library and provides a specific example using the movielens-100k dataset.
Source Code
The source code used in this article is as follows.
github
- For the Jupyter notebook file, click here
google colaboratory
- To run it on Google Colaboratory, click here
Execution Environment
The OS is macOS. Note that the options differ from Linux or Unix commands.
!sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
!python -V
Python 3.9.17
To make pandas tables more readable, CSS settings are applied to HTML tables.
Import the basic libraries and use watermark to check their versions. Also, set the random seed.
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import random
import numpy as np
import pandas as pd
import implicit
seed = 123
random_state = 123
random.seed(seed)
np.random.seed(seed)
from watermark import watermark
print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version : 3.9.17
IPython version : 8.17.2
numpy : 1.25.2
pandas : 2.0.3
implicit: 0.7.0
Watermark: 2.4.3
Implementation Examples of implicit in Recommendation Systems
What is a Recommendation System?
A recommendation system is a system that recommends items based on user preferences. For example, Netflix recommends new movies based on the movies a user has watched. There are broadly two types of recommendation systems: collaborative filtering and content-based filtering.
Collaborative Filtering
Collaborative filtering is a method that recommends items based on users’ past actions and ratings. Specifically, it uses a user-item interaction matrix.
Content-Based Filtering
Content-based filtering is a method that recommends items based on the features and attributes of the items.
Overview of the implicit Library
implicit is a library written in Python, used primarily for implementing collaborative filtering algorithms. implicit mainly supports the following algorithms:
- ALS (Alternating Least Squares)
- BPR (Bayesian Personalized Ranking)
Overview of the Movielens-100k Dataset
Movielens-100k is a movie rating dataset that contains 100,000 rating data points. By using this dataset, we can evaluate the performance of recommendation systems.
Details and Implementation of the ALS Algorithm
ALS (Alternating Least Squares) is a method for factorizing the user and item matrices. In ALS, the user and item matrices are alternately updated to approximate the prediction matrix.
ALS Formulation
The basic idea of ALS is to find the user matrix $\mathbf{U}$ and the item matrix $\mathbf{I}$. The rating matrix $\mathbf{R}$ is approximated as follows:
$$ \mathbf{R} \approx \mathbf{U} \mathbf{I}^T $$
Here, ALS solves the following minimization problem:
$$ \min_{\mathbf{U}, \mathbf{I}} || \mathbf{R} - \mathbf{U} \mathbf{I}^T ||^2_F + \lambda ( || \mathbf{U} ||^2_F + || \mathbf{I} ||^2_F ) $$
where $|| \cdot ||_F$ denotes the Frobenius norm and $\lambda$ is the regularization parameter.
ALS Implementation
Next, here is an example of implementing ALS using the implicit library.
import implicit
from scipy.sparse import coo_matrix
from pprint import pprint
# Load and preprocess data
df = pd.read_csv("./ml-100k/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
rows = df["user_id"].astype(int)
cols = df["item_id"].astype(int)
values = df["rating"].astype(float)
df.head()
user_id | item_id | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
# Create the rating matrix
R = coo_matrix((values, (rows, cols)))
# Convert coo_matrix to csr_matrix
R = R.tocsr()
# Train the ALS model
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50)
model.fit(R)
# User and item matrices
U = model.user_factors
I = model.item_factors
# Display results
pprint(U.round(2))
pprint(I.round(2))
0%| | 0/50 [00:00<?, ?it/s]
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0.65, 1.81, 0.38, ..., 1.04, -0.1 , 2. ],
[ 0.31, 0.4 , 0.77, ..., -0.69, 0.15, -0.02],
...,
[ 0.12, -0.03, 0.29, ..., -0.41, 0.63, -0.01],
[ 0.89, -0.79, -0.77, ..., -0.89, 1.31, 0.18],
[ 0.78, 0.97, 0.26, ..., 1.08, 0.38, 0.3 ]], dtype=float32)
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0.08, -0.1 , -0.04, ..., -0.08, 0.07, -0. ],
[ 0.06, 0.07, 0.14, ..., 0.06, 0.08, 0.04],
...,
[ 0. , 0. , -0.01, ..., 0.01, 0. , -0. ],
[-0. , 0.01, 0.01, ..., -0.01, 0.01, 0. ],
[-0.01, 0.01, -0. , ..., 0.01, -0.01, 0.01]], dtype=float32)
Details and Implementation of BPR Algorithm
BPR is a method for optimizing ranking. BPR aims to maximize the pairwise preferences of users.
BPR Formulation
BPR maximizes the probability that a user prefers one item over another. Specifically, it maximizes the following log-likelihood function:
$$ \sum_{(u,i,j) \in D} \ln \sigma (\hat{x}_{u,i} - \hat{x}_{u,j}) + \lambda || \Theta ||^2 $$
where $\sigma$ is the sigmoid function, $\hat{x}_{u,i}$ is the score of user $u$ for item $i$, $D$ is the dataset, and $\Theta$ are the model parameters.
BPR Implementation
Next, here is an example of implementing BPR using the implicit library.
import implicit
from scipy.sparse import coo_matrix
from pprint import pprint
# Load and preprocess data
df = pd.read_csv("./ml-100k/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
rows = df["user_id"].astype(int)
cols = df["item_id"].astype(int)
values = df["rating"].astype(float)
df.head()
user_id | item_id | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
# Create the rating matrix
R = coo_matrix((values, (rows, cols)))
# Convert coo_matrix to csr_matrix
R = R.tocsr()
# Train BPR model
model = implicit.bpr.BayesianPersonalizedRanking(factors=20, regularization=0.1, iterations=50)
model.fit(R)
# User and item matrices
U = model.user_factors
I = model.item_factors
# Display results
pprint(U.round(2))
pprint(I.round(2))
0%| | 0/50 [00:00<?, ?it/s]
array([[ 0. , 0. , 0. , ..., 0. , 0. , 1. ],
[-0.02, -0.01, 0.13, ..., -0.12, 0.09, 1. ],
[-0.01, 0.03, -0.32, ..., 0.29, -0.21, 1. ],
...,
[ 0.1 , -0.03, -0.15, ..., 0.12, -0.18, 1. ],
[-0.2 , 0.08, -0.04, ..., 0.04, 0.15, 1. ],
[ 0.24, -0.1 , 0.26, ..., -0.24, -0. , 1. ]], dtype=float32)
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0.12, -0.04, -0. , ..., -0.01, -0.1 , 0.61],
[ 0.2 , -0.07, 0.18, ..., -0.16, -0.02, -0.1 ],
...,
[-0. , 0. , -0.03, ..., 0.03, -0.03, -0.05],
[ 0. , 0. , -0.01, ..., 0. , -0.02, -0.07],
[-0.01, -0.02, -0.01, ..., -0. , -0.01, -0.12]], dtype=float32)
Implementation Example
As a specific example, build ALS and BPR models using the movielens-100k dataset. The steps are shown below.
Data Preparation
First, load and preprocess the data.
import implicit
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split
# Load and preprocess data
df = pd.read_csv("./ml-100k/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
# Split data into training and test sets
# Set stratify to True to ensure the distribution remains the same after splitting
train, test = train_test_split(df, test_size=0.2, stratify=df["user_id"], shuffle=True, random_state=seed)
# Create the rating matrix
train_matrix = coo_matrix((train["rating"], (train["user_id"], train["item_id"])))
test_matrix = coo_matrix((test["rating"], (test["user_id"], test["item_id"])))
# Convert coo_matrix to csr_matrix
train_matrix = train_matrix.tocsr()
test_matrix = test_matrix.tocsr()
Training and Evaluating the ALS Model
Next, train and evaluate the ALS model.
def get_precision(true_matrix, pred_matrix, k=10):
"""
Function to calculate precision
Parameters:
- true_matrix (coo_matrix): The actual rating matrix
- pred_matrix (ndarray): The predicted rating matrix
- k (int): The number of top_k items to consider for precision calculation
Returns:
- precision (float): The precision score
"""
# Convert the actual rating matrix to a list
true_items = true_matrix.tolil().rows
# Get the indices of the predicted items
pred_items = np.argsort(-pred_matrix, axis=1)[:, :k]
# Calculate precision for each user
precisions = []
for user_id in range(len(true_items)):
true_set = set(true_items[user_id])
pred_set = set(pred_items[user_id])
if len(true_set) > 0:
precision = len(true_set & pred_set) / min(len(true_set), k)
precisions.append(precision)
# Calculate average precision
return np.mean(precisions)
# Train the ALS model
als_model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50)
als_model.fit(train_matrix)
# Make predictions for the test data
test_predictions = als_model.recommend_all(test_matrix)
# Example usage
true_matrix = test_matrix # The actual rating matrix for the test data
pred_matrix = als_model.recommend_all(test_matrix) # Predictions made by the ALS model
precision = get_precision(true_matrix, pred_matrix)
print(f"ALS Model Precision: {precision:.3f}")
0%| | 0/50 [00:00<?, ?it/s]
ALS Model Precision: 0.039
Training and Evaluating BPR Model
Similarly, train and evaluate BPR model.
# Train BPR model
bpr_model = implicit.bpr.BayesianPersonalizedRanking(factors=20, regularization=0.1, iterations=50)
bpr_model.fit(train_matrix)
# Make predictions for the test data
test_predictions = bpr_model.recommend_all(test_matrix)
# Evaluate precision
precision = get_precision(test_matrix, test_predictions)
print(f"BPR Model Precision : {precision:.3f}")
0%| | 0/50 [00:00<?, ?it/s]
BPR Model Precision : 0.039
Conclusion
In this article, I implemented ALS and BPR algorithms using the implicit library and introduced a specific example with the movielens-100k dataset. Although this is mainly a memo for myself, I hope it will be helpful to someone.
References
- “Collaborative Filtering for Implicit Feedback Datasets”, Hu, Y., Koren, Y., and Volinsky, C., 2008.
- “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L., 2009.
- Movielens Dataset: https://grouplens.org/datasets/movielens/100k/
Memo
Creating a sparse matrix in LIL format (3x3 matrix)
import numpy as np
from scipy.sparse import lil_matrix
# Create a Numpy array
dense_array = np.array([[1, 0, 0], [0, 0, 3], [4, 0, 0]])
# Convert the Numpy array to a sparse matrix in LIL format
lil_matrix = lil_matrix(dense_array)
print(lil_matrix)
(0, 0) 1
(1, 2) 3
(2, 0) 4