On Evaluation Methods in Recommender Systems
Overview
This article details the evaluation methods in recommender systems.
Using equations and Python code, specific examples of the definition, properties, and applications of evaluation methods are provided.
Additionally, the advantages and disadvantages of the evaluation methods are discussed, and an implementation example using the “movielens-100k” dataset is introduced.
This is content I had always intended to summarize as a memo, so I will briefly and concisely document it here.
Source Code
github
- For the Jupyter notebook file, click here
google colaboratory
- To run on Google Colaboratory, click here
Execution Environment
The OS used is macOS. Please note that the options differ from those of Linux or Unix commands.
!sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
!python -V
Python 3.9.17
We will import basic libraries and use watermark to check their versions. Additionally, we will set the seed for random numbers.
import random
import numpy as np
seed = 123
random_state = 123
random.seed(seed)
np.random.seed(seed)
from watermark import watermark
print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version : 3.9.17
IPython version : 8.17.2
numpy : 1.25.2
matplotlib: 3.8.1
pandas : 2.0.3
scipy : 1.11.2
Watermark: 2.4.3
Evaluation Methods in Recommender Systems
Overview
Recommender systems are technologies for providing users with personalized items. Accurately evaluating their performance is directly linked to system improvement and user satisfaction. Based on my experience in constructing recommender systems, this article comprehensively explains evaluation methods, from basic metrics to offline and online evaluations, and user experience assessments. Specific evaluation methods are explained with equations and Python code examples.
1. Introduction
Recommender systems are used in many fields, such as online shopping, video streaming services, and music streaming. Providing users with suitable items enhances user engagement and directly contributes to business success. Selecting and executing evaluation methods are essential to understanding and optimizing system performance. The following sections explain specific evaluation methods.
2. Basic Evaluation Metrics for Recommender Systems
Basic metrics for evaluating the performance of recommender systems include accuracy and error metrics.
2.1 Accuracy Evaluation
Basic accuracy metrics for recommender systems include Precision, Recall, and F1 Score. These metrics are also widely used in evaluating classification problems.
- Precision $$ \text{Precision} = \frac{TP}{TP + FP} $$ Here, $TP$ is True Positives, and $FP$ is False Positives.
from sklearn.metrics import precision_score
y_true_list = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred_list = [1, 0, 1, 0, 0, 1, 0, 1, 0, 1]
precision = precision_score(y_true_list, y_pred_list)
print("Precision:", round(precision, 2))
Precision: 0.8
- Recall $$ \text{Recall} = \frac{TP}{TP + FN} $$ Here, $FN$ is False Negatives.
from sklearn.metrics import recall_score
recall = recall_score(y_true_list, y_pred_list)
print("Recall:", round(recall, 2))
Recall: 0.67
- F1 Score $$ \text{F1 Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
from sklearn.metrics import f1_score
f1 = f1_score(y_true_list, y_pred_list)
print("F1 Score:", round(f1, 2))
F1 Score: 0.73
2.2 Error Metrics
Error metrics include Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). These metrics are important for rating predictions and regression problems.
- Mean Absolute Error (MAE) $$ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y}_i \right| $$ Here, $N$ is the number of samples, $y_i$ is the actual value, and $\hat{y}_i$ is the predicted value.
from sklearn.metrics import mean_absolute_error
y_true_list = [3.5, 2.0, 4.0, 3.0, 5.0]
y_pred_list = [3.7, 2.1, 3.9, 3.2, 4.8]
mae = mean_absolute_error(y_true_list, y_pred_list)
print("MAE:", round(mae, 2))
MAE: 0.16
- Root Mean Squared Error (RMSE) $$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y}_i \right)^2} $$
from sklearn.metrics import mean_squared_error
import numpy as np
mse = mean_squared_error(y_true_list, y_pred_list)
rmse = np.sqrt(mse)
print("RMSE:", round(rmse, 2))
RMSE: 0.17
3. Specific Evaluation Metrics for Recommender Systems
Recommender systems have specific evaluation metrics that focus on user behavior and ranking accuracy, different from general accuracy and error metrics.
3.1 Hit Rate
Hit Rate shows the proportion of items of interest to users included in the recommendation list. It is one of the basic success metrics for recommender systems.
- Definition of Hit Rate $$ \text{Hit Rate} = \frac{\text{Number of Hits}}{\text{Total Number of Users}} $$
def hit_rate(recommended_list, relevant_list):
hits = sum([1 for rec in recommended_list if rec in relevant_list])
return hits / len(recommended_list)
recommended_list = [1, 2, 3, 4, 5]
relevant_list =
[1, 2, 3, 6, 7]
hr = hit_rate(recommended_list, relevant_list)
print("Hit Rate:", round(hr, 2))
Hit Rate: 0.6
3.2 Mean Average Precision (MAP)
Mean Average Precision (MAP) is an evaluation metric that considers the ranking information of recommendation results. It can more accurately evaluate the usefulness for users.
- Definition of MAP $$ \text{MAP} = \frac{1}{|U|} \sum_{u \in U} \text{AP}(u) $$ Here, $\text{AP}(u)$ is the average precision of user $u$, and $|U|$ is the number of users.
def average_precision(recommended_list, relevant_list):
hits = 0
sum_precisions = 0
for i, rec in enumerate(recommended_list):
if rec in relevant_list:
hits += 1
sum_precisions += hits / (i + 1)
return sum_precisions / len(relevant_list)
recommended_list = [1, 2, 3, 4, 5]
relevant_list = [1, 2, 3]
ap = average_precision(recommended_list, relevant_list)
print("Average Precision:", round(ap, 2))
3.3 nDCG (Normalized Discounted Cumulative Gain)
nDCG (Normalized Discounted Cumulative Gain) is an evaluation metric that considers the importance of ranking. Items in higher positions are given more importance.
- Definition of NDCG $$ \text{nDCG} = \frac{DCG}{IDCG} $$ Here, $DCG$ is the discounted cumulative gain, and $IDCG$ is the ideal discounted cumulative gain.
def dcg(recommended_list, relevant_list):
return sum((1 if rec in relevant_list else 0) / np.log2(idx + 2) for idx, rec in enumerate(recommended_list))
def ndcg(recommended_list, relevant_list):
dcg_val = dcg(recommended_list, relevant_list)
idcg_val = dcg(sorted(relevant_list, reverse=True), relevant_list)
return dcg_val / idcg_val
recommended_list = [1, 2, 3, 4, 5]
relevant_list = [1, 2, 6]
ndcg_val = ndcg(recommended_list, relevant_list)
print("NDCG:", round(ndcg_val, 2))
NDCG: 0.77
4. Offline Evaluation and Online Evaluation
There are two types of evaluation for recommender systems: offline evaluation and online evaluation. Each method has its own advantages and disadvantages.
4.1 Offline Evaluation
Offline evaluation uses pre-collected data to evaluate the system.
Advantages and Disadvantages
Advantages include low cost and quick evaluation. Disadvantages include potential differences from actual user behavior.
Application Methods and Examples
Simulations using past data to evaluate system performance. For example, using the MovieLens dataset for evaluation.
import pandas as pd
from sklearn.model_selection import train_test_split
# Load data
ratings = pd.read_csv(
"https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
sep="\t",
names=["user_id", "item_id", "rating", "timestamp"],
)
# Split into training and test data
train_data, test_data = train_test_split(ratings, test_size=0.2)
# Display sample of training data
display(train_data.head())
user_id | item_id | rating | timestamp | |
---|---|---|---|---|
39002 | 76 | 1155 | 2 | 882607017 |
34031 | 506 | 772 | 1 | 874873247 |
58728 | 643 | 739 | 3 | 891449476 |
31812 | 21 | 854 | 5 | 874951657 |
15197 | 269 | 414 | 3 | 891449624 |
4.2 Online Evaluation
Online evaluation evaluates the system using actual users.
A/B Testing
A/B testing is used to compare different versions of the system. Users are randomly divided into groups, each provided with a different version, and their effects are compared.
Advantages and Disadvantages
Advantages include reflecting actual user behavior. Disadvantages include time and cost required for implementation.
# Example of A/B testing simulation
# Assume click rates for different user groups
group_a_clicks = np.random.binomial(1, 0.1, 1000) # Group A click rate 10%
group_b_clicks = np.random.binomial(1, 0.15, 1000) # Group B click rate 15%
# Calculate average click rates
click_rate_a = np.mean(group_a_clicks)
click_rate_b = np.mean(group_b_clicks)
print("Group A Click Rate:", round(click_rate_a, 2))
print("Group B Click Rate:", round(click_rate_b, 2))
5. User Experience Evaluation
User experience evaluation is also important for the success of recommender systems. This includes evaluating user satisfaction and engagement.
5.1 User Satisfaction
User satisfaction is evaluated through surveys and feedback. This provides direct information for system improvement.
Utilizing Surveys and Feedback
Collect opinions directly from users through surveys and use the results to improve the system.
# Sample survey data
feedback_data = {"user_id": [1, 2, 3, 4, 5], "satisfaction": [5, 4, 3, 4, 5]}
feedback_df = pd.DataFrame(feedback_data)
average_satisfaction = feedback_df["satisfaction"].mean()
print("Average User Satisfaction:", round(average_satisfaction, 2))
5.2 Engagement
Engagement metrics show how frequently users use the system. This helps measure user loyalty.
Definition and Importance of Engagement Metrics
Engagement metrics measure the frequency and duration of user system usage. This allows evaluating how dependent users are on the system.
import pandas as pd
# Sample engagement data
engagement_data = {
"user_id": [1, 2, 3, 4, 5],
"sessions": [10, 15, 5, 20, 25],
"time_spent": [300, 450, 150, 600, 750],
}
engagement_df = pd.DataFrame(engagement_data)
average_sessions = engagement_df["sessions"].mean()
average_time_spent = engagement_df["time_spent"].mean()
print("Average Sessions per User:", round(average_sessions, 2))
print("Average Time Spent per User (minutes):", round(average_time_spent, 2))
Average Sessions per User: 15.0
Average Time Spent per User (minutes): 450.0
6. Summary
Choosing and combining evaluation methods is crucial. Continuously evaluate and improve (PDCA) to optimize system performance. Evaluation is essential for system success and user satisfaction.
Conclusion
This article broadly explained evaluation methods for recommender systems, from basic metrics to offline and online evaluations and user experience assessments.
Choosing appropriate evaluation methods and continuously improving the system is essential for building a successful recommender system.
For comprehensive information on recommender systems and evaluation metrics, refer to the following references:
References:
- Aggarwal, C. C. (2016). Recommender Systems: The Textbook. Springer.