Content-Based Collaborative Filtering
Overview
In this article, I will explain content-based collaborative filtering. I will detail the definition, characteristics, and application examples of content-based collaborative filtering using mathematical formulas and Python code.
Additionally, I will discuss its advantages and disadvantages.
Furthermore, as a concrete example, I will show an implementation using the “movielens-100k” dataset.
Please note that this is a personal memorandum.
Source Code
GitHub
- Jupyter notebook files can be found here
Google Colaboratory
- To run on Google Colaboratory, click here
Execution Environment
The OS is macOS. Please note that options differ from Linux or Unix commands.
!sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
!python -V
Python 3.9.17
We will import basic libraries and use watermark to check their versions. We will also set the seed for random numbers.
import random
import numpy as np
import pandas as pd
from pprint import pprint
seed = 123
random_state = 123
random.seed(seed)
np.random.seed(seed)
from watermark import watermark
print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version : 3.9.17
numpy : 1.25.2
pandas: 2.0.3
Watermark: 2.4.3
Definition of Content-Based Collaborative Filtering
Content-Based Collaborative Filtering is a method of making recommendations based on the features of items or users. Unlike traditional collaborative filtering, it utilizes metadata of items or users to analyze users’ past behavior and preferences to recommend similar items.
Formulas and Examples
In content-based collaborative filtering, we use feature vectors of items. For example, in a movie recommender system, information such as genres, actors, and directors of movies forms the feature vectors. The similarity between the user’s preference vector and the item’s feature vector is calculated, and items with high similarity are recommended.
Representation of Feature Vectors
Let $\mathbf{x}_i$ be the feature vector of item $i$ and $\mathbf{y}_u$ be the preference vector of user $u$. Cosine similarity is used for the similarity calculation.
$$ \text{sim}(\mathbf{x}_i, \mathbf{y}_u) = \frac{\mathbf{x}_i \cdot \mathbf{y}_u}{|\mathbf{x}_i| |\mathbf{y}_u|} $$
Here, $\mathbf{x}_i \cdot \mathbf{y}_u$ is the dot product, and $|\mathbf{x}_i|$ and $|\mathbf{y}_u|$ are the norms of the vectors.
Example Implementation in Python
Below is a simple implementation of a movie recommender system. Here, we calculate the cosine similarity using feature vectors of movies and the user’s preference vector.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from pprint import pprint
# Sample feature vectors of movies
# Set some arbitrary vectors
movies = {
"movie_1": np.array([1, 0, 1]),
"movie_2": np.array([0, 1, 0]),
"movie_3": np.array([1, 1, 0]),
}
# User's preference vector
user_preference = np.array([1, 0, 1])
# Calculate cosine similarity
similarity_dict = {}
for movie, features in movies.items():
similarity = cosine_similarity([user_preference], [features])[0][0]
similarity_dict[movie] = round(similarity, 2)
pprint(similarity_dict)
{'movie_1': 1.0, 'movie_2': 0.0, 'movie_3': 0.5}
In this code, the cosine similarity between the user’s preference vector and each movie’s feature vector is calculated, and movies with high similarity are recommended.
Application Examples
Content-based collaborative filtering is applied in the following areas:
- Movie and Music Recommender Systems: Recommend new movies or music based on users’ viewing or listening history.
- E-commerce Websites: Recommend related products by analyzing users’ purchase or browsing history.
- News Article Recommendations: Recommend news articles of interest based on users’ past browsing history.
Advantages and Disadvantages
Advantages
- Addressing the Cold Start Problem: Recommendations are possible even for new items because feature vectors can be used.
- Recommendations Based on User Preferences: Recommendations can reflect individual user preferences.
Disadvantages
- Risk of Overfitting: If too much reliance is placed on users’ past preferences, new items may not be recommended easily.
- Computational Cost: Calculating feature vectors and similarity can be time-consuming.
Calculation of Concrete Examples
Here, I will implement a movie recommender system using the “movielens-100k” dataset.
Preparing the Dataset
First, load the “movielens-100k” dataset and prepare the feature vectors of movies and the preference vectors of users.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Load the dataset
movies_df = pd.read_csv(
"./ml-100k/u.item",
sep="|",
encoding="latin-1",
header=None,
names=[
"movie_id",
"title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western",
],
)
ratings_df = pd.read_csv(
"./ml-100k/u.data", sep="\t", encoding="latin-1", header=None, names=["user_id", "movie_id", "rating", "timestamp"]
)
# Concatenate genre information into a string
movie_genres = movies_df.iloc[:, 6:]
movie_genres_str = movie_genres.apply(lambda x: " ".join(movie_genres.columns[x == 1]), axis=1)
# Initialize TFIDF vectorizer
tfidf = TfidfVectorizer()
# Create TF-IDF vectors
try:
tfidf_matrix = tfidf.fit_transform(movie_genres_str)
print("TFIDF Matrix Shape:", tfidf_matrix.shape)
# Display feature names
feature_names = tfidf.get_feature_names_out()
print("Feature Names:", feature_names)
except ValueError as e:
print(e)
TFIDF Matrix Shape: (1682, 20)
Feature Names: ['action' 'adventure' 'animation' 'children' 'comedy' 'crime'
'documentary' 'drama' 'fantasy' 'fi' 'film' 'horror' 'musical' 'mystery'
'noir' 'romance' 'sci' 'thriller' 'war' 'western']
# Create user preference vectors
user_preferences = ratings_df.groupby("user_id")["movie_id"].apply(list)
# Function to calculate cosine similarity
def calculate
_similarity(user_pref, tfidf_matrix):
user_vector = np.asarray(np.mean(tfidf_matrix[user_pref], axis=0))
similarities = cosine_similarity(user_vector, tfidf_matrix)
return similarities
# Calculate similarity between user_1's preference vector and movies
user_1_pref = user_preferences[1]
print("user_1 preferences length:", len(user_1_pref))
# Calculate similarities
similarities = calculate_similarity(user_1_pref, tfidf_matrix)
# Display movies with high similarity
similar_movies = np.argsort(similarities[0])[::-1][:10]
recommended_movies = movies_df.iloc[similar_movies]
print(recommended_movies[["movie_id", "title"]])
user_1 preferences length: 272
movie_id title
3 4 Get Shorty (1995)
73 74 Faster Pussycat! Kill! Kill! (1965)
1236 1237 Twisted (1996)
521 522 Down by Law (1986)
1456 1457 Love Is All There Is (1996)
1011 1012 Private Parts (1997)
92 93 Welcome to the Dollhouse (1995)
1459 1460 Sleepover (1995)
1271 1272 Talking About Sex (1994)
346 347 Wag the Dog (1997)
Interpretation of Results
Recommend movies with high similarity to the user. For example, display the movies with the highest similarity to user 1.
Conclusion
In this article, I detailed content-based collaborative filtering. I showed concrete definitions, formulas, and examples using Python code, and discussed the advantages and disadvantages.
This method is applied in various fields such as movie and music recommendations, e-commerce, and news article recommendations.
There are many other aspects to consider, such as evaluation methods and hyperparameter tuning, but I will leave it as a personal note.