How to Use Recommendation Systems and the implicit Library

recommender systems analyze user preferences and behavior to suggest the most suitable items for each individual. Among them, Python library implicit is particularly well-known for its efficient calculations and ease of use. This article explains how to use the implicit library and provides a specific example using the movielens-100k dataset.

Source Code

The source code used in this article is as follows.

github

For the Jupyter notebook file, click here

google colaboratory

To run it on Google Colaboratory, click here

Execution Environment

The OS is macOS. Note that the options differ from Linux or Unix commands.

!sw_vers

ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90

!python -V

Python 3.9.17

To make pandas tables more readable, CSS settings are applied to HTML tables.

Import the basic libraries and use watermark to check their versions. Also, set the random seed.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import random
import numpy as np
import pandas as pd

import implicit

seed = 123
random_state = 123

random.seed(seed)
np.random.seed(seed)

from watermark import watermark

print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.17.2

numpy   : 1.25.2
pandas  : 2.0.3
implicit: 0.7.0

Watermark: 2.4.3

Implementation Examples of implicit in Recommendation Systems

What is a Recommendation System?

A recommendation system is a system that recommends items based on user preferences. For example, Netflix recommends new movies based on the movies a user has watched. There are broadly two types of recommendation systems: collaborative filtering and content-based filtering.

Collaborative Filtering

Collaborative filtering is a method that recommends items based on users’ past actions and ratings. Specifically, it uses a user-item interaction matrix.

Content-Based Filtering

Content-based filtering is a method that recommends items based on the features and attributes of the items.

Overview of the implicit Library

implicit is a library written in Python, used primarily for implementing collaborative filtering algorithms. implicit mainly supports the following algorithms:

ALS (Alternating Least Squares)
BPR (Bayesian Personalized Ranking)

Overview of the Movielens-100k Dataset

Movielens-100k is a movie rating dataset that contains 100,000 rating data points. By using this dataset, we can evaluate the performance of recommendation systems.

Details and Implementation of the ALS Algorithm

ALS (Alternating Least Squares) is a method for factorizing the user and item matrices. In ALS, the user and item matrices are alternately updated to approximate the prediction matrix.

ALS Formulation

The basic idea of ALS is to find the user matrix $\mathbf{U}$ and the item matrix $\mathbf{I}$. The rating matrix $\mathbf{R}$ is approximated as follows:

$$ \mathbf{R} \approx \mathbf{U} \mathbf{I}^T $$

Here, ALS solves the following minimization problem:

$$ \min_{\mathbf{U}, \mathbf{I}} || \mathbf{R} - \mathbf{U} \mathbf{I}^T ||^2_F + \lambda ( || \mathbf{U} ||^2_F + || \mathbf{I} ||^2_F ) $$

where $|| \cdot ||_F$ denotes the Frobenius norm and $\lambda$ is the regularization parameter.

ALS Implementation

Next, here is an example of implementing ALS using the implicit library.

import implicit
from scipy.sparse import coo_matrix
from pprint import pprint

# Load and preprocess data
df = pd.read_csv("./ml-100k/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
rows = df["user_id"].astype(int)
cols = df["item_id"].astype(int)
values = df["rating"].astype(float)

df.head()

	user_id	item_id	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

# Create the rating matrix
R = coo_matrix((values, (rows, cols)))

# Convert coo_matrix to csr_matrix
R = R.tocsr()

# Train the ALS model
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50)
model.fit(R)

# User and item matrices
U = model.user_factors
I = model.item_factors

# Display results
pprint(U.round(2))
pprint(I.round(2))

  0%|          | 0/50 [00:00<?, ?it/s]

array([[ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.65,  1.81,  0.38, ...,  1.04, -0.1 ,  2.  ],
       [ 0.31,  0.4 ,  0.77, ..., -0.69,  0.15, -0.02],
       ...,
       [ 0.12, -0.03,  0.29, ..., -0.41,  0.63, -0.01],
       [ 0.89, -0.79, -0.77, ..., -0.89,  1.31,  0.18],
       [ 0.78,  0.97,  0.26, ...,  1.08,  0.38,  0.3 ]], dtype=float32)
array([[ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.08, -0.1 , -0.04, ..., -0.08,  0.07, -0.  ],
       [ 0.06,  0.07,  0.14, ...,  0.06,  0.08,  0.04],
       ...,
       [ 0.  ,  0.  , -0.01, ...,  0.01,  0.  , -0.  ],
       [-0.  ,  0.01,  0.01, ..., -0.01,  0.01,  0.  ],
       [-0.01,  0.01, -0.  , ...,  0.01, -0.01,  0.01]], dtype=float32)

Details and Implementation of BPR Algorithm

BPR is a method for optimizing ranking. BPR aims to maximize the pairwise preferences of users.

BPR Formulation

BPR maximizes the probability that a user prefers one item over another. Specifically, it maximizes the following log-likelihood function:

$$ \sum_{(u,i,j) \in D} \ln \sigma (\hat{x}_{u,i} - \hat{x}_{u,j}) + \lambda || \Theta ||^2 $$

where $\sigma$ is the sigmoid function, $\hat{x}_{u,i}$ is the score of user $u$ for item $i$, $D$ is the dataset, and $\Theta$ are the model parameters.

BPR Implementation

Next, here is an example of implementing BPR using the implicit library.

import implicit
from scipy.sparse import coo_matrix
from pprint import pprint

# Load and preprocess data
df = pd.read_csv("./ml-100k/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
rows = df["user_id"].astype(int)
cols = df["item_id"].astype(int)
values = df["rating"].astype(float)

df.head()

	user_id	item_id	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

# Create the rating matrix
R = coo_matrix((values, (rows, cols)))

# Convert coo_matrix to csr_matrix
R = R.tocsr()

# Train BPR model
model = implicit.bpr.BayesianPersonalizedRanking(factors=20, regularization=0.1, iterations=50)
model.fit(R)

# User and item matrices
U = model.user_factors
I = model.item_factors

# Display results
pprint(U.round(2))
pprint(I.round(2))

  0%|          | 0/50 [00:00<?, ?it/s]

array([[ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [-0.02, -0.01,  0.13, ..., -0.12,  0.09,  1.  ],
       [-0.01,  0.03, -0.32, ...,  0.29, -0.21,  1.  ],
       ...,
       [ 0.1 , -0.03, -0.15, ...,  0.12, -0.18,  1.  ],
       [-0.2 ,  0.08, -0.04, ...,  0.04,  0.15,  1.  ],
       [ 0.24, -0.1 ,  0.26, ..., -0.24, -0.  ,  1.  ]], dtype=float32)
array([[ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.12, -0.04, -0.  , ..., -0.01, -0.1 ,  0.61],
       [ 0.2 , -0.07,  0.18, ..., -0.16, -0.02, -0.1 ],
       ...,
       [-0.  ,  0.  , -0.03, ...,  0.03, -0.03, -0.05],
       [ 0.  ,  0.  , -0.01, ...,  0.  , -0.02, -0.07],
       [-0.01, -0.02, -0.01, ..., -0.  , -0.01, -0.12]], dtype=float32)

Implementation Example

As a specific example, build ALS and BPR models using the movielens-100k dataset. The steps are shown below.

Data Preparation

First, load and preprocess the data.

import implicit
import pandas as pd
import numpy as np

from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split

# Load and preprocess data
df = pd.read_csv("./ml-100k/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])

# Split data into training and test sets
# Set stratify to True to ensure the distribution remains the same after splitting
train, test = train_test_split(df, test_size=0.2, stratify=df["user_id"], shuffle=True, random_state=seed)

# Create the rating matrix
train_matrix = coo_matrix((train["rating"], (train["user_id"], train["item_id"])))
test_matrix = coo_matrix((test["rating"], (test["user_id"], test["item_id"])))

# Convert coo_matrix to csr_matrix
train_matrix = train_matrix.tocsr()
test_matrix = test_matrix.tocsr()

Training and Evaluating the ALS Model

Next, train and evaluate the ALS model.

def get_precision(true_matrix, pred_matrix, k=10):
    """
    Function to calculate precision

    Parameters:
    - true_matrix (coo_matrix): The actual rating matrix
    - pred_matrix (ndarray): The predicted rating matrix
    - k (int): The number of top_k items to consider for precision calculation

    Returns:
    - precision (float): The precision score
    """
    # Convert the actual rating matrix to a list
    true_items = true_matrix.tolil().rows

    # Get the indices of the predicted items
    pred_items = np.argsort(-pred_matrix, axis=1)[:, :k]

    # Calculate precision for each user
    precisions = []
    for user_id in range(len(true_items)):
        true_set = set(true_items[user_id])
        pred_set = set(pred_items[user_id])

        if len(true_set) > 0:
            precision = len(true_set & pred_set) / min(len(true_set), k)
            precisions.append(precision)

    # Calculate average precision
    return np.mean(precisions)

# Train the ALS model
als_model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50)
als_model.fit(train_matrix)

# Make predictions for the test data
test_predictions = als_model.recommend_all(test_matrix)

# Example usage
true_matrix = test_matrix  # The actual rating matrix for the test data
pred_matrix = als_model.recommend_all(test_matrix)  # Predictions made by the ALS model

precision = get_precision(true_matrix, pred_matrix)
print(f"ALS Model Precision: {precision:.3f}")

  0%|          | 0/50 [00:00<?, ?it/s]

ALS Model Precision: 0.039

Training and Evaluating BPR Model

Similarly, train and evaluate BPR model.

# Train BPR model
bpr_model = implicit.bpr.BayesianPersonalizedRanking(factors=20, regularization=0.1, iterations=50)
bpr_model.fit(train_matrix)

# Make predictions for the test data
test_predictions = bpr_model.recommend_all(test_matrix)

# Evaluate precision
precision = get_precision(test_matrix, test_predictions)
print(f"BPR Model Precision : {precision:.3f}")

  0%|          | 0/50 [00:00<?, ?it/s]

BPR Model Precision : 0.039

Conclusion

In this article, I implemented ALS and BPR algorithms using the implicit library and introduced a specific example with the movielens-100k dataset. Although this is mainly a memo for myself, I hope it will be helpful to someone.

References

“Collaborative Filtering for Implicit Feedback Datasets”, Hu, Y., Koren, Y., and Volinsky, C., 2008.
“BPR: Bayesian Personalized Ranking from Implicit Feedback”, Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L., 2009.
Movielens Dataset: https://grouplens.org/datasets/movielens/100k/

Memo

Creating a sparse matrix in LIL format (3x3 matrix)

import numpy as np
from scipy.sparse import lil_matrix

# Create a Numpy array
dense_array = np.array([[1, 0, 0], [0, 0, 3], [4, 0, 0]])

# Convert the Numpy array to a sparse matrix in LIL format
lil_matrix = lil_matrix(dense_array)

print(lil_matrix)

  (0, 0)	1
  (1, 2)	3
  (2, 0)	4

[Recommendaer Systems] How to use implicit