Content-Based Collaborative Filtering

Overview

In this article, I will explain content-based collaborative filtering. I will detail the definition, characteristics, and application examples of content-based collaborative filtering using mathematical formulas and Python code.

Additionally, I will discuss its advantages and disadvantages.

Furthermore, as a concrete example, I will show an implementation using the “movielens-100k” dataset.

Please note that this is a personal memorandum.

Source Code

GitHub

  • Jupyter notebook files can be found here

Google Colaboratory

  • To run on Google Colaboratory, click here

Execution Environment

The OS is macOS. Please note that options differ from Linux or Unix commands.

!sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90
!python -V
Python 3.9.17

We will import basic libraries and use watermark to check their versions. We will also set the seed for random numbers.

import random

import numpy as np
import pandas as pd

from pprint import pprint

seed = 123
random_state = 123

random.seed(seed)
np.random.seed(seed)

from watermark import watermark

print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version       : 3.9.17

numpy : 1.25.2
pandas: 2.0.3

Watermark: 2.4.3

Definition of Content-Based Collaborative Filtering

Content-Based Collaborative Filtering is a method of making recommendations based on the features of items or users. Unlike traditional collaborative filtering, it utilizes metadata of items or users to analyze users’ past behavior and preferences to recommend similar items.

Formulas and Examples

In content-based collaborative filtering, we use feature vectors of items. For example, in a movie recommender system, information such as genres, actors, and directors of movies forms the feature vectors. The similarity between the user’s preference vector and the item’s feature vector is calculated, and items with high similarity are recommended.

Representation of Feature Vectors

Let $\mathbf{x}_i$ be the feature vector of item $i$ and $\mathbf{y}_u$ be the preference vector of user $u$. Cosine similarity is used for the similarity calculation.

$$ \text{sim}(\mathbf{x}_i, \mathbf{y}_u) = \frac{\mathbf{x}_i \cdot \mathbf{y}_u}{|\mathbf{x}_i| |\mathbf{y}_u|} $$

Here, $\mathbf{x}_i \cdot \mathbf{y}_u$ is the dot product, and $|\mathbf{x}_i|$ and $|\mathbf{y}_u|$ are the norms of the vectors.

Example Implementation in Python

Below is a simple implementation of a movie recommender system. Here, we calculate the cosine similarity using feature vectors of movies and the user’s preference vector.

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

from pprint import pprint

# Sample feature vectors of movies
# Set some arbitrary vectors
movies = {
    "movie_1": np.array([1, 0, 1]),
    "movie_2": np.array([0, 1, 0]),
    "movie_3": np.array([1, 1, 0]),
}

# User's preference vector
user_preference = np.array([1, 0, 1])

# Calculate cosine similarity
similarity_dict = {}
for movie, features in movies.items():
    similarity = cosine_similarity([user_preference], [features])[0][0]
    similarity_dict[movie] = round(similarity, 2)

pprint(similarity_dict)
{'movie_1': 1.0, 'movie_2': 0.0, 'movie_3': 0.5}

In this code, the cosine similarity between the user’s preference vector and each movie’s feature vector is calculated, and movies with high similarity are recommended.

Application Examples

Content-based collaborative filtering is applied in the following areas:

  • Movie and Music Recommender Systems: Recommend new movies or music based on users’ viewing or listening history.
  • E-commerce Websites: Recommend related products by analyzing users’ purchase or browsing history.
  • News Article Recommendations: Recommend news articles of interest based on users’ past browsing history.

Advantages and Disadvantages

Advantages

  • Addressing the Cold Start Problem: Recommendations are possible even for new items because feature vectors can be used.
  • Recommendations Based on User Preferences: Recommendations can reflect individual user preferences.

Disadvantages

  • Risk of Overfitting: If too much reliance is placed on users’ past preferences, new items may not be recommended easily.
  • Computational Cost: Calculating feature vectors and similarity can be time-consuming.

Calculation of Concrete Examples

Here, I will implement a movie recommender system using the “movielens-100k” dataset.

Preparing the Dataset

First, load the “movielens-100k” dataset and prepare the feature vectors of movies and the preference vectors of users.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the dataset
movies_df = pd.read_csv(
    "./ml-100k/u.item",
    sep="|",
    encoding="latin-1",
    header=None,
    names=[
        "movie_id",
        "title",
        "release_date",
        "video_release_date",
        "IMDb_URL",
        "unknown",
        "Action",
        "Adventure",
        "Animation",
        "Children's",
        "Comedy",
        "Crime",
        "Documentary",
        "Drama",
        "Fantasy",
        "Film-Noir",
        "Horror",
        "Musical",
        "Mystery",
        "Romance",
        "Sci-Fi",
        "Thriller",
        "War",
        "Western",
    ],
)
ratings_df = pd.read_csv(
    "./ml-100k/u.data", sep="\t", encoding="latin-1", header=None, names=["user_id", "movie_id", "rating", "timestamp"]
)

# Concatenate genre information into a string
movie_genres = movies_df.iloc[:, 6:]
movie_genres_str = movie_genres.apply(lambda x: " ".join(movie_genres.columns[x == 1]), axis=1)

# Initialize TFIDF vectorizer
tfidf = TfidfVectorizer()

# Create TF-IDF vectors
try:
    tfidf_matrix = tfidf.fit_transform(movie_genres_str)
    print("TFIDF Matrix Shape:", tfidf_matrix.shape)

    # Display feature names
    feature_names = tfidf.get_feature_names_out()
    print("Feature Names:", feature_names)
except ValueError as e:
    print(e)
TFIDF Matrix Shape: (1682, 20)
Feature Names: ['action' 'adventure' 'animation' 'children' 'comedy' 'crime'
 'documentary' 'drama' 'fantasy' 'fi' 'film' 'horror' 'musical' 'mystery'
 'noir' 'romance' 'sci' 'thriller' 'war' 'western']
# Create user preference vectors
user_preferences = ratings_df.groupby("user_id")["movie_id"].apply(list)

# Function to calculate cosine similarity
def calculate

_similarity(user_pref, tfidf_matrix):
    user_vector = np.asarray(np.mean(tfidf_matrix[user_pref], axis=0))
    similarities = cosine_similarity(user_vector, tfidf_matrix)
    return similarities

# Calculate similarity between user_1's preference vector and movies
user_1_pref = user_preferences[1]
print("user_1 preferences length:", len(user_1_pref))

# Calculate similarities
similarities = calculate_similarity(user_1_pref, tfidf_matrix)

# Display movies with high similarity
similar_movies = np.argsort(similarities[0])[::-1][:10]
recommended_movies = movies_df.iloc[similar_movies]

print(recommended_movies[["movie_id", "title"]])
user_1 preferences length: 272
      movie_id                                title
3            4                    Get Shorty (1995)
73          74  Faster Pussycat! Kill! Kill! (1965)
1236      1237                       Twisted (1996)
521        522                   Down by Law (1986)
1456      1457          Love Is All There Is (1996)
1011      1012                 Private Parts (1997)
92          93      Welcome to the Dollhouse (1995)
1459      1460                     Sleepover (1995)
1271      1272             Talking About Sex (1994)
346        347                   Wag the Dog (1997)

Interpretation of Results

Recommend movies with high similarity to the user. For example, display the movies with the highest similarity to user 1.

Conclusion

In this article, I detailed content-based collaborative filtering. I showed concrete definitions, formulas, and examples using Python code, and discussed the advantages and disadvantages.

This method is applied in various fields such as movie and music recommendations, e-commerce, and news article recommendations.

There are many other aspects to consider, such as evaluation methods and hyperparameter tuning, but I will leave it as a personal note.