On Evaluation Methods in Recommender Systems

Overview

This article details the evaluation methods in recommender systems.

Using equations and Python code, specific examples of the definition, properties, and applications of evaluation methods are provided.

Additionally, the advantages and disadvantages of the evaluation methods are discussed, and an implementation example using the “movielens-100k” dataset is introduced.

This is content I had always intended to summarize as a memo, so I will briefly and concisely document it here.

Source Code

github

  • For the Jupyter notebook file, click here

google colaboratory

  • To run on Google Colaboratory, click here

Execution Environment

The OS used is macOS. Please note that the options differ from those of Linux or Unix commands.

!sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90
!python -V
Python 3.9.17

We will import basic libraries and use watermark to check their versions. Additionally, we will set the seed for random numbers.

import random

import numpy as np

seed = 123
random_state = 123

random.seed(seed)
np.random.seed(seed)

from watermark import watermark

print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.17.2

numpy     : 1.25.2
matplotlib: 3.8.1
pandas    : 2.0.3
scipy     : 1.11.2

Watermark: 2.4.3

Evaluation Methods in Recommender Systems

Overview

Recommender systems are technologies for providing users with personalized items. Accurately evaluating their performance is directly linked to system improvement and user satisfaction. Based on my experience in constructing recommender systems, this article comprehensively explains evaluation methods, from basic metrics to offline and online evaluations, and user experience assessments. Specific evaluation methods are explained with equations and Python code examples.

1. Introduction

Recommender systems are used in many fields, such as online shopping, video streaming services, and music streaming. Providing users with suitable items enhances user engagement and directly contributes to business success. Selecting and executing evaluation methods are essential to understanding and optimizing system performance. The following sections explain specific evaluation methods.

2. Basic Evaluation Metrics for Recommender Systems

Basic metrics for evaluating the performance of recommender systems include accuracy and error metrics.

2.1 Accuracy Evaluation

Basic accuracy metrics for recommender systems include Precision, Recall, and F1 Score. These metrics are also widely used in evaluating classification problems.

  • Precision $$ \text{Precision} = \frac{TP}{TP + FP} $$ Here, $TP$ is True Positives, and $FP$ is False Positives.
from sklearn.metrics import precision_score

y_true_list = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred_list = [1, 0, 1, 0, 0, 1, 0, 1, 0, 1]

precision = precision_score(y_true_list, y_pred_list)
print("Precision:", round(precision, 2))
Precision: 0.8
  • Recall $$ \text{Recall} = \frac{TP}{TP + FN} $$ Here, $FN$ is False Negatives.
from sklearn.metrics import recall_score

recall = recall_score(y_true_list, y_pred_list)
print("Recall:", round(recall, 2))
Recall: 0.67
  • F1 Score $$ \text{F1 Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
from sklearn.metrics import f1_score

f1 = f1_score(y_true_list, y_pred_list)
print("F1 Score:", round(f1, 2))
F1 Score: 0.73

2.2 Error Metrics

Error metrics include Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). These metrics are important for rating predictions and regression problems.

  • Mean Absolute Error (MAE) $$ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y}_i \right| $$ Here, $N$ is the number of samples, $y_i$ is the actual value, and $\hat{y}_i$ is the predicted value.
from sklearn.metrics import mean_absolute_error

y_true_list = [3.5, 2.0, 4.0, 3.0, 5.0]
y_pred_list = [3.7, 2.1, 3.9, 3.2, 4.8]

mae = mean_absolute_error(y_true_list, y_pred_list)
print("MAE:", round(mae, 2))
MAE: 0.16
  • Root Mean Squared Error (RMSE) $$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y}_i \right)^2} $$
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_true_list, y_pred_list)
rmse = np.sqrt(mse)
print("RMSE:", round(rmse, 2))
RMSE: 0.17

3. Specific Evaluation Metrics for Recommender Systems

Recommender systems have specific evaluation metrics that focus on user behavior and ranking accuracy, different from general accuracy and error metrics.

3.1 Hit Rate

Hit Rate shows the proportion of items of interest to users included in the recommendation list. It is one of the basic success metrics for recommender systems.

  • Definition of Hit Rate $$ \text{Hit Rate} = \frac{\text{Number of Hits}}{\text{Total Number of Users}} $$
def hit_rate(recommended_list, relevant_list):
    hits = sum([1 for rec in recommended_list if rec in relevant_list])
    return hits / len(recommended_list)

recommended_list = [1, 2, 3, 4, 5]
relevant_list =

 [1, 2, 3, 6, 7]

hr = hit_rate(recommended_list, relevant_list)
print("Hit Rate:", round(hr, 2))
Hit Rate: 0.6

3.2 Mean Average Precision (MAP)

Mean Average Precision (MAP) is an evaluation metric that considers the ranking information of recommendation results. It can more accurately evaluate the usefulness for users.

  • Definition of MAP $$ \text{MAP} = \frac{1}{|U|} \sum_{u \in U} \text{AP}(u) $$ Here, $\text{AP}(u)$ is the average precision of user $u$, and $|U|$ is the number of users.
def average_precision(recommended_list, relevant_list):
    hits = 0
    sum_precisions = 0
    for i, rec in enumerate(recommended_list):
        if rec in relevant_list:
            hits += 1
            sum_precisions += hits / (i + 1)
    return sum_precisions / len(relevant_list)

recommended_list = [1, 2, 3, 4, 5]
relevant_list = [1, 2, 3]

ap = average_precision(recommended_list, relevant_list)
print("Average Precision:", round(ap, 2))

3.3 nDCG (Normalized Discounted Cumulative Gain)

nDCG (Normalized Discounted Cumulative Gain) is an evaluation metric that considers the importance of ranking. Items in higher positions are given more importance.

  • Definition of NDCG $$ \text{nDCG} = \frac{DCG}{IDCG} $$ Here, $DCG$ is the discounted cumulative gain, and $IDCG$ is the ideal discounted cumulative gain.
def dcg(recommended_list, relevant_list):
    return sum((1 if rec in relevant_list else 0) / np.log2(idx + 2) for idx, rec in enumerate(recommended_list))

def ndcg(recommended_list, relevant_list):
    dcg_val = dcg(recommended_list, relevant_list)
    idcg_val = dcg(sorted(relevant_list, reverse=True), relevant_list)
    return dcg_val / idcg_val

recommended_list = [1, 2, 3, 4, 5]
relevant_list = [1, 2, 6]

ndcg_val = ndcg(recommended_list, relevant_list)
print("NDCG:", round(ndcg_val, 2))
NDCG: 0.77

4. Offline Evaluation and Online Evaluation

There are two types of evaluation for recommender systems: offline evaluation and online evaluation. Each method has its own advantages and disadvantages.

4.1 Offline Evaluation

Offline evaluation uses pre-collected data to evaluate the system.

  • Advantages and Disadvantages

    Advantages include low cost and quick evaluation. Disadvantages include potential differences from actual user behavior.

  • Application Methods and Examples

    Simulations using past data to evaluate system performance. For example, using the MovieLens dataset for evaluation.

import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
ratings = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
    sep="\t",
    names=["user_id", "item_id", "rating", "timestamp"],
)

# Split into training and test data
train_data, test_data = train_test_split(ratings, test_size=0.2)

# Display sample of training data
display(train_data.head())
user_iditem_idratingtimestamp
390027611552882607017
340315067721874873247
587286437393891449476
31812218545874951657
151972694143891449624

4.2 Online Evaluation

Online evaluation evaluates the system using actual users.

  • A/B Testing

    A/B testing is used to compare different versions of the system. Users are randomly divided into groups, each provided with a different version, and their effects are compared.

  • Advantages and Disadvantages

    Advantages include reflecting actual user behavior. Disadvantages include time and cost required for implementation.

# Example of A/B testing simulation

# Assume click rates for different user groups
group_a_clicks = np.random.binomial(1, 0.1, 1000)  # Group A click rate 10%
group_b_clicks = np.random.binomial(1, 0.15, 1000)  # Group B click rate 15%

# Calculate average click rates
click_rate_a = np.mean(group_a_clicks)
click_rate_b = np.mean(group_b_clicks)

print("Group A Click Rate:", round(click_rate_a, 2))
print("Group B Click Rate:", round(click_rate_b, 2))

5. User Experience Evaluation

User experience evaluation is also important for the success of recommender systems. This includes evaluating user satisfaction and engagement.

5.1 User Satisfaction

User satisfaction is evaluated through surveys and feedback. This provides direct information for system improvement.

  • Utilizing Surveys and Feedback

    Collect opinions directly from users through surveys and use the results to improve the system.

# Sample survey data

feedback_data = {"user_id": [1, 2, 3, 4, 5], "satisfaction": [5, 4, 3, 4, 5]}

feedback_df = pd.DataFrame(feedback_data)
average_satisfaction = feedback_df["satisfaction"].mean()

print("Average User Satisfaction:", round(average_satisfaction, 2))

5.2 Engagement

Engagement metrics show how frequently users use the system. This helps measure user loyalty.

  • Definition and Importance of Engagement Metrics

    Engagement metrics measure the frequency and duration of user system usage. This allows evaluating how dependent users are on the system.

import pandas as pd

# Sample engagement data
engagement_data = {
    "user_id": [1, 2, 3, 4, 5],
    "sessions": [10, 15, 5, 20, 25],
    "time_spent": [300, 450, 150, 600, 750],
}

engagement_df = pd.DataFrame(engagement_data)

average_sessions = engagement_df["sessions"].mean()
average_time_spent = engagement_df["time_spent"].mean()

print("Average Sessions per User:", round(average_sessions, 2))
print("Average Time Spent per User (minutes):", round(average_time_spent, 2))
Average Sessions per User: 15.0
Average Time Spent per User (minutes): 450.0

6. Summary

Choosing and combining evaluation methods is crucial. Continuously evaluate and improve (PDCA) to optimize system performance. Evaluation is essential for system success and user satisfaction.

Conclusion

This article broadly explained evaluation methods for recommender systems, from basic metrics to offline and online evaluations and user experience assessments.

Choosing appropriate evaluation methods and continuously improving the system is essential for building a successful recommender system.

For comprehensive information on recommender systems and evaluation metrics, refer to the following references:

References:

  1. Aggarwal, C. C. (2016). Recommender Systems: The Textbook. Springer.