RFM Analysis

RFM analysis is a tool for analyzing customer purchasing behavior and formulating marketing strategies based on three indicators: “Recency,” “Frequency,” and “Monetary.”

Recency represents the time since the customer’s last purchase, Frequency is the number of purchases within a certain period, and Monetary is the total amount spent in the past. By analyzing these factors, customers can be segmented, and customized marketing strategies can be developed for each segment.

This analysis is particularly effective for retail and e-commerce sites, leading to improved customer satisfaction, increased sales, and enhanced customer loyalty.

In this article, we will create a dummy database of user purchase history using Poisson and Gamma distributions. For the reason why these distributions are used to create purchase history, please refer to “The Power of Mathematical Marketing Proven at USJ.” We will also use the Faker library to create dummy user data.

We will then perform an RFM analysis on this dummy data.

GitHub

  • The Jupyter notebook file is available here

Google Colaboratory

  • To run it on Google Colaboratory, click here

Execution Environment

The OS used is macOS. Please note that the options may differ from those in Linux or Unix commands.

!sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90
!python -V
Python 3.9.17

Import basic libraries and use watermark to check their versions. Also, set the seed for random numbers.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import pandas as pd
import scipy

import calendar
import random

import matplotlib.pyplot as plt

import faker
from faker import Faker

seed = 123
random_state = 123

random.seed(seed)
np.random.seed(seed)

from watermark import watermark

print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.17.2

faker     : 23.2.0
scipy     : 1.11.2
matplotlib: 3.8.1
numpy     : 1.25.2
pandas    : 2.0.3

Watermark: 2.4.3

Creating Dummy Purchase History

We will use SQLite, a lightweight file-based database, and the sqlite3 module to execute SQL commands.

The dummy data will be created based on the following ER diagram. We will create user, item, and purchase history tables. Refer to database design books for more details on ER diagrams.

The database will use SQLite and Python’s sqlite3 module to create a virtual database.

Python’s sqlite3 module is a built-in library for manipulating file-based SQLite databases. It allows easy creation, insertion, updating, deletion, and querying of data.

SQLite is file-based and serverless, making it easy to manage and back up as it stores data in a single file. It supports ACID transactions and many SQL standards.

SQLite is not suitable for large-scale transactions or high concurrent access due to performance issues with complex queries and high read/write volume. It is best suited for small to medium-sized databases and local environments.

In this article, we will create a virtual database using Python’s sqlite3 module.

Creating the Database and Tables

First, we create an SQLite file named test_data.db. Since SQLite is a file-based database, it will be created as a text file in the current directory using Python’s os module.

We will create three tables in this database.

  1. Users table to store user information.
  2. Items table to store item information.
  3. PurchaseHistory table to store purchase history.

The Users table has a UserID as the primary key, and stores the user’s name, birthday, and occupation.

The Items table has an ItemID as the primary key, and stores the item’s name and price.

The PurchaseHistory table stores the IDs of the user and item purchased, the quantity purchased, and the purchase date.

import os
import sqlite3

db_file_name = "test_data.db"

# Remove old file if it exists
if os.path.isfile(db_file_name):
    os.remove(db_file_name)

with sqlite3.connect(db_file_name) as conn:
    cur = conn.cursor()

    # Create Users table
    cur.execute(
        """
    CREATE TABLE Users (
        UserID INTEGER PRIMARY KEY AUTOINCREMENT,
        Name TEXT NOT NULL,
        Birthday TEXT,
        Job TEXT
    );
    """
    )

    # Create Items table
    cur.execute(
        """
    CREATE TABLE Items (
        ItemID INTEGER PRIMARY KEY AUTOINCREMENT,
        Name TEXT,
        Price INTEGER
    );
    """
    )

    # Create PurchaseHistory table
    cur.execute(
        """
    CREATE TABLE PurchaseHistory (
        PurchaseID INTEGER PRIMARY KEY AUTOINCREMENT,
        UserID INTEGER NOT NULL,
        ItemID INTEGER NOT NULL,
        PurchaseDate TEXT NOT NULL,
        Quantity INTEGER NOT NULL,
        FOREIGN KEY (UserID) REFERENCES Users(UserID),
        FOREIGN KEY (ItemID) REFERENCES Items(ItemID)
    );
    """
    )

    cur.close()
    conn.commit()

Verify that the tables have been created.

with sqlite3.connect(db_file_name) as conn:
    cur = conn.cursor()

    # Get list of tables
    cur.execute(
        """
        SELECT name FROM sqlite_master WHERE type='table';
    """
    )

    table_list = cur.fetchall()

    cur.close()

for i, table in enumerate(table_list):
    print(f"{i + 1}. {table[0]}")
1. Users
2. sqlite_sequence
3. Items
4. PurchaseHistory

The sqlite_master table in SQLite is a system table that stores metadata about all the tables, indexes, views, and triggers in the database.

This table has the following columns:

  • type: The type of object, such as table, index, view, or trigger.
  • name: The name of the object (table, index, view, trigger).
  • tbl_name: The name of the table associated with the object.
  • sql: The SQL statement used to create the object.

By querying the sqlite_master table, you can easily obtain information about the database structure and specific objects.

with sqlite3.connect(db_file_name) as conn:
    sqlite_master_df = pd.read_sql_query("SELECT * FROM sqlite_master", conn)

sqlite_master_df.head()
typenametbl_namerootpagesql
0tableUsersUsers2CREATE TABLE Users (\n UserID INTEGER P...
1tablesqlite_sequencesqlite_sequence3CREATE TABLE sqlite_sequence(name,seq)
2tableItemsItems4CREATE TABLE Items (\n ItemID INTEGER P...
3tablePurchaseHistoryPurchaseHistory5CREATE TABLE PurchaseHistory (\n Purcha...

This table is read-only and cannot be edited directly. When the database schema is changed, the sqlite_master table is automatically updated.

Creating the Users and Items Tables

To create dummy user data, we will use the Faker library.

Faker is a useful tool for generating dummy data for software testing, machine learning training data, and data anonymization.

It can generate various types of dummy data, such as names, addresses, phone numbers, email addresses, and passwords, which can be used to streamline software testing, improve machine learning model accuracy, and enhance personal information protection.

Here, we will use Faker to randomly generate user names, birthdays, and jobs. We will also create items such as toothpaste and toothbrushes.

We will set the number of users to 100 and the number of items to 3.

Faker.seed(seed)
fake = Faker("jp-JP")

user_num = 100
item_num = 3

with sqlite3.connect(db_file_name) as conn:
    cur = conn.cursor()

    # Create user table
    for _ in range(user_num):
        name = fake.name()
        birthday = fake.date_of_birth()
        job = fake.job()

        cur.execute(
            f"INSERT INTO Users (Name, Birthday, Job) VALUES (?, ?, ?);",
            (name, birthday, job),
        )

    # Create item table
    cur.execute(f"INSERT INTO Items (Name, Price) VALUES (?, ?);", ("歯ブラシ", 100))
    cur.execute(f"INSERT INTO Items (Name, Price) VALUES (?, ?);", ("歯磨き粉", 200))
    cur.execute(f"INSERT INTO Items (Name, Price) VALUES (?, ?);", ("柔軟剤", 500))

    cur.close()
    conn.commit()

Creating Purchase History

Next, we will create the purchase history for each user.

Poisson Distribution

The Poisson distribution is used to model the number of discrete events that occur in a fixed period of time with an average rate $\lambda$.

Here, we will model the average number of purchases a user makes in a given period as $\lambda$.

The Poisson distribution is used to model various phenomena.

For example, it can model the number of traffic accidents in a certain area within a certain period, the number of defective products produced on a factory line within a certain period, and the number of calls received at a call center within a certain period.

The Poisson distribution is defined as follows:

$$ P(X=k)= \frac{e^{-\lambda} \lambda^k}{k !} $$

The expected value $E[X]$ of the Poisson distribution is the weighted sum of the probability mass function over all possible values of $k$.

$$ E[X] =\sum_{k=0}^{\infty} k \cdot P(X=k)=\sum_{k=0}^{\infty} k \cdot \frac{e^{-\lambda} \lambda^k}{k !} =\sum_{k=1}^{\infty} \frac{e^{-\lambda} \lambda^k}{(k-1) !} =\lambda e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^{k-1}}{(k-1) !} =\lambda e^{-\lambda} \cdot e^{\lambda} = \lambda $$

Thus, the mean (expected value) of the Poisson distribution is equal to $\lambda$.

Using the scipy library, we can visualize how the Poisson distribution changes with different values of $\lambda$.

from scipy.stats import poisson

for _lambda in [0.5, 1, 3]:
    x = np.arange(10)
    mean, var, skew, kurt = poisson.stats(_lambda, moments="mvsk")
    plt.plot(x, poisson.pmf(x, _lambda), marker="o", label=f"$\\\\lambda$={_lambda} ,mean={mean}")

plt.xlabel("$k$")
plt.ylabel("$\\\\text{Poisson}(k|\lambda)$")
plt.grid()
plt.legend()
plt.show()

As $\lambda$ increases, the peak of the distribution shifts to the right.

Gamma Distribution

Next, we assume a probability distribution for the average number of purchases $\lambda$ that a user makes in a given period.

That is, we assume that there are many users who purchase a certain item, and User A has an average purchase frequency of $\lambda_1$, while User B has an average purchase frequency of $\lambda_2$. In this article, we assume that the distribution of $\lambda$ follows a gamma distribution.

The reason for assuming that the distribution of $\lambda$ follows a gamma distribution is detailed in the reference “The Power of Mathematical Marketing Proven at USJ.”

The gamma distribution is a continuous probability distribution defined for $x \geqq 0$. It has two parameters $\alpha$ and $\beta$, and their product $\alpha \times \beta$ is the mean. It is defined as follows:

$$ \operatorname{Gamma}(x \mid \alpha, \beta)=\frac{x^{\alpha-1} e^{-\frac{x}{\beta}}}{\Gamma(\alpha) \beta^\alpha} $$

Here, $\Gamma(\alpha)$ is the gamma function, defined as:

$$ \Gamma(\alpha)=\int_0^{\infty} x^{\alpha-1} e^{-x} d x \quad a>0 $$

Using scipy, we will show a graph of the gamma distribution with different parameters while keeping the mean constant at 2.

import numpy as np
import matplotlib.pyplot as plt

from scipy.stats import gamma

for a, b in zip([1, 3, 15], [2, 2 / 3, 2 / 15]):
    x = np.linspace(gamma.ppf(0.0001, a, scale=b), gamma.ppf(0.9999, a, scale=b), 100)
    mean, var, skew, kurt = gamma.stats(a, scale=b, moments="mvsk")
    plt.plot(x, gamma.pdf(x, a, scale=b), label=f"$\\\\alpha$={a}, $\\\\beta$={b:.2f},mean={mean}")

plt.xlim([0, 7])
plt.xlabel("$k$")
plt.ylabel("$\\\\text{Gamma}(k|\\\\alpha, \\\\beta)$")
plt.grid()
plt.legend()
plt.show()

Although the mean is the same, the shape varies significantly depending on the values of $\alpha$ and $\beta$.

Creating Purchase History

We will set the current date and create purchase history for the past 60 months. If the quantity of purchased items is 0, we do not insert it into the database.

For each item, we set the parameters of the gamma distribution. From this gamma distribution, we generate $\lambda$, the average number of purchases for each user. Using this $\lambda$, we generate the number of purchases per month from a Poisson distribution. If this number is 0, it means that the user did not purchase anything.

The following code simulates the number of purchases per month, and the specific date is generated randomly.

import numpy as np
import matplotlib.pyplot as plt

from scipy.stats import gamma
from scipy.stats import poisson

import datetime
from dateutil.relativedelta import relativedelta

# Number of purchase history entries
purchase_history_size = 60

# Set the current date
current_date = datetime.date(2024, 7, 1)

# List of dates for the past 60 months
datetime_list = [current_date - relativedelta(months=i + 1) for i in range(purchase_history_size)]

# Parameters of the gamma distribution for three items
# - Toothbrush
#   - alpha: 1
#   - beta : 1
# - Toothpaste
#   - alpha: 3
#   - beta : 1.5 / 3
# - Fabric softener
#   - alpha: 5
#   - beta : 0.5 / 5

gamma_a_list = [1, 3, 5]
gamma_b_list = [1, 1.5 / 3, 0.5 / 5]

with sqlite3.connect(db_file_name) as conn:
    cur = conn.cursor()

    # Create purchase history for each item
    for a, b, _item_id in zip(gamma_a_list, gamma_b_list, range(1, item_num + 1)):

        # Calculate the number of purchases for each user
        # Generate random numbers from the gamma distribution for each user's lambda
        lambda_list = gamma.rvs(a, scale=b, size=user_num, random_state=random_state)

        # Generate random numbers from the Poisson distribution for each user's lambda
        for _lambda, _user_id in zip(lambda_list, range(1, user_num + 1)):

            # Generate random numbers from the Poisson distribution
            history_list = poisson.rvs(_lambda, size=purchase_history_size, random_state=random_state)

            for _created_at, _history in zip(datetime_list[::-1], history_list[::-1

]):

                # If the number of purchased items is 1 or more, insert it into the database
                if _history > 0:
                    _year = pd.to_datetime(_created_at).year
                    _month = pd.to_datetime(_created_at).month
                    _random_day = random.choice([i + 1 for i in range(calendar.monthrange(_year, _month)[1])])

                    # Set the date randomly
                    _random_created_at = f"{_year}-{_month}-{_random_day}"

                    # Insert the purchase history into the database
                    cur.execute(
                        f"INSERT INTO PurchaseHistory (UserID, ItemID, Quantity, PurchaseDate) VALUES (?, ?, ?, ?);",
                        (_user_id, _item_id, int(_history), _random_created_at),
                    )

    cur.close()
    conn.commit()

Check the contents of the user table.

with sqlite3.connect(db_file_name) as conn:
    users_df = pd.read_sql_query("SELECT * FROM Users", conn).set_index("UserID", drop=True)

users_df.head()
NameBirthdayJob
UserID
1渡辺 康弘1955-05-14イラストレーター
2小林 英樹1946-08-29YouTuber
3高橋 太一1973-03-13大学教授
4高橋 涼平1958-09-25医師
5鈴木 花子1993-01-15イラストレーター

Check the contents of the items table.

with sqlite3.connect(db_file_name) as conn:
    items_df = pd.read_sql_query("SELECT * FROM Items", conn).set_index("ItemID", drop=True)

items_df.head()
NamePrice
ItemID
1歯ブラシ100
2歯磨き粉200
3柔軟剤500

Check the contents of the purchase history table.

Each purchase ID records which user purchased what, how many, and when. It also records how many days ago the purchase was made from the current date.

with sqlite3.connect(db_file_name) as conn:
    purchase_history_df = pd.read_sql_query("SELECT * FROM PurchaseHistory", conn).set_index("PurchaseID", drop=True)

purchase_history_df.PurchaseDate = pd.to_datetime(purchase_history_df.PurchaseDate)
purchase_history_df["diff_days"] = purchase_history_df.PurchaseDate.apply(
    lambda x: (pd.to_datetime(current_date) - x).days
)
purchase_history_df.head()
UserIDItemIDPurchaseDateQuantitydiff_days
PurchaseID
1112019-07-0231826
2112019-08-0911788
3112019-09-0321763
4112019-10-2531711
5112019-12-1421661

Using groupby, check the total number of each item purchased by each user.

Display the mean, standard deviation, and count of purchases.

purchase_history_df.groupby(["UserID", "ItemID"])["Quantity"].agg(["mean", "std", "count"]).head(15)
<tr>
  <th>2</th>
  <td>2.491228</td>
  <td>1.297288</td>
  <td>57</td>
</tr>
<tr>
  <th>3</th>
  <td>1.310345</td>
  <td>0.470824</td>
  <td>29</td>
</tr>
<tr>
  <th rowspan="3" valign="top">3</th>
  <th>1</th>
  <td>1.125000</td>
  <td>0.353553</td>
  <td>8</td>
</tr>
<tr>
  <th>2</th>
  <td>1.406250</td>
  <td>0.559918</td>
  <td>32</td>
</tr>
<tr>
  <th>3</th>
  <td>1.066667</td>
  <td>0.258199</td>
  <td>15</td>
</tr>
<tr>
  <th rowspan="3" valign="top">4</th>
  <th>1</th>
  <td>1.344828</td>
  <td>0.483725</td>
  <td>29</td>
</tr>
<tr>
  <th>2</th>
  <td>3.372881</td>
  <td>1.400718</td>
  <td>59</td>
</tr>
<tr>
  <th>3</th>
  <td>1.371429</td>
  <td>0.546955</td>
  <td>35</td>
</tr>
<tr>
  <th rowspan="3" valign="top">5</th>
  <th>1</th>
  <td>1.653061</td>
  <td>0.751416</td>
  <td>49</td>
</tr>
<tr>
  <th>2</th>
  <td>1.166667</td>
  <td>0.408248</td>
  <td>6</td>
</tr>
<tr>
  <th>3</th>
  <td>1.000000</td>
  <td>0.000000</td>
  <td>3</td>
</tr>
meanstdcount
UserIDItemID
111.5833330.76723748
21.2500000.44095928
31.1250000.3535538
211.0666670.25819915

RFM Analysis

Based on the created dummy database, we will perform a concrete RFM analysis.

  1. Recency: The number of days since the customer’s last purchase. Customers who have purchased recently are considered more likely to make repeat purchases.
  2. Frequency: The number of purchases a customer made within a certain period. Customers who make frequent purchases are considered loyal and more valuable.
  3. Monetary: The total amount spent by a customer within a certain period. Customers who spend more money are considered more valuable.

Using these indicators, we can classify customers into different groups and develop tailored marketing strategies for each group. For example, customers who have not purchased recently but made frequent purchases in the past can be given special offers (such as discount coupons) to encourage them to make repeat purchases.

RFM analysis is a highly effective method for gaining a deeper understanding of customer behavior, especially in business models where repeat purchases are important.

By conducting this analysis, we can increase the ROI of marketing and improve customer satisfaction.

Using pandas, we will calculate Recency, Frequency, and Monetary for each user and each item.

We will create DataFrames for each of the three categories.

Recency

  1. Recency: The number of days since the customer’s last purchase. Customers who have purchased recently are considered more likely to make repeat purchases.
recnecy_df = (
    purchase_history_df.sort_values(by=["UserID", "ItemID", "PurchaseDate"])
    .groupby(["UserID", "ItemID"])
    .tail(1)
    .reset_index(drop=True)[["UserID", "ItemID", "diff_days"]]
)

recnecy_df = recnecy_df.rename(columns={"diff_days": "recency"})
recnecy_df.head()
UserIDItemIDrecency
0111
11229
213210
321126
42226

Frequency

  1. Frequency: The number of purchases a customer made within a certain period. Customers who make frequent purchases are considered loyal and more valuable.
frequency_df = (
    purchase_history_df[purchase_history_df.diff_days < 100]
    .groupby(["UserID", "ItemID"])
    .Quantity.agg(["count"])
    .reset_index()
)

frequency_df = frequency_df.rename(columns={"count": "frequency"})
frequency_df.head()
UserIDItemIDfrequency
0112
1122
2223
3232
4322

Monetary

  1. Monetary: The total amount spent by a customer within a certain period. Customers who spend more money are considered more valuable.
monetary_df = pd.merge(purchase_history_df, items_df, on="ItemID", how="left")
monetary_df["total"] = monetary_df.apply(lambda x: x.Quantity * x.Price, axis=1)
monetary_df = monetary_df[monetary_df.diff_days < 100].groupby(["UserID", "ItemID"]).total.agg(["sum"]).reset_index()

monetary_df = monetary_df.rename(columns={"sum": "monetary"})
monetary_df.head()

UserIDItemIDmonetary
011300
112400
2

2

22000
3231000
432400

RFM Analysis

We will merge the three tables created above and visualize the RFM analysis results using graphs.

_temp = pd.merge(recnecy_df, frequency_df, on=["UserID", "ItemID"], how="left").fillna(0)
rfm_df = pd.merge(_temp, monetary_df, on=["UserID", "ItemID"], how="left").fillna(0)
rfm_df.head()
UserIDItemIDrecencyfrequencymonetary
01112.0300.0
112292.0400.0
2132100.00.0
3211260.00.0
422263.02000.0

Frequency Distribution

First, we will visualize the frequency distribution for the three analysis targets.

target_ItemID = 2

rfm_df[rfm_df.ItemID == target_ItemID].recency.hist()
plt.xlabel(f"Recency (ItemID = {target_ItemID})")
plt.show()

rfm_df[rfm_df.ItemID == target_ItemID].frequency.hist()
plt.xlabel(f"Frequency (ItemID = {target_ItemID})")
plt.show()

rfm_df[rfm_df.ItemID == target_ItemID].monetary.hist()
plt.xlabel(f"Monetary (ItemID = {target_ItemID})")
plt.show()

Frequency VS Recency

As an example, we will plot a 2D scatter plot of Frequency vs. Recency for each item.

In the following scatter plots, for example, marketing measures can be considered for users in the bottom right who have not purchased for a long time and have a low frequency of purchases.

plt.scatter(x=rfm_df[rfm_df.ItemID == 1].recency, y=rfm_df[rfm_df.ItemID == 1].frequency)
plt.xlabel(f"Recency (ItemID = 1)")
plt.ylabel(f"Frequency")
plt.grid()
plt.show()

plt.scatter(x=rfm_df[rfm_df.ItemID == 2].recency, y=rfm_df[rfm_df.ItemID == 2].frequency)
plt.xlabel(f"Recency (ItemID = 2)")
plt.ylabel(f"Frequency")
plt.grid()
plt.show()

plt.scatter(x=rfm_df[rfm_df.ItemID == 3].recency, y=rfm_df[rfm_df.ItemID == 3].frequency)
plt.xlabel(f"Recency (ItemID = 3)")
plt.ylabel(f"Frequency")
plt.grid()
plt.show()

3D Visualization

Finally, we will plot the RFM analysis results for each item on a 3D graph.

fig = plt.figure(figsize=(12, 8))

###################################################
ax1 = fig.add_subplot(131, projection="3d")

target_ItemID = 1

_rfm_df = rfm_df[rfm_df.ItemID == target_ItemID]
x = _rfm_df.recency
y = _rfm_df.frequency
z = _rfm_df.monetary

ax1.scatter(x, y, z, s=20, color="blue", label="RFM analysis")

ax1.set_title(f"RFM (ItemID = {target_ItemID})")
ax1.set_xlabel("Recnecy")
ax1.set_ylabel("Frequency")
ax1.set_zlabel("Monetary")

###################################################
ax2 = fig.add_subplot(132, projection="3d")

target_ItemID = 2

_rfm_df = rfm_df[rfm_df.ItemID == target_ItemID]
x = _rfm_df.recency
y = _rfm_df.frequency
z = _rfm_df.monetary

ax2.scatter(x, y, z, s=20, color="blue", label="RFM analysis")

ax2.set_title(f"RFM (ItemID = {target_ItemID})")
ax2.set_xlabel("Recnecy")
ax2.set_ylabel("Frequency")
ax2.set_zlabel("Monetary")

###################################################
ax3 = fig.add_subplot(133, projection="3d")

target_ItemID = 3

_rfm_df = rfm_df[rfm_df.ItemID == target_ItemID]
x = _rfm_df.recency
y = _rfm_df.frequency
z = _rfm_df.monetary

ax3.scatter(x, y, z, s=20, color="blue", label="RFM analysis")

ax3.set_title(f"RFM (ItemID = {target_ItemID})")

ax3.set_xlabel("Recnecy")
ax3.set_ylabel("Frequency")
ax3.set_zlabel("Monetary")

plt.show()

From the graphs, we can see that there are some users in the front (those who have not purchased for a long time, have a low frequency, and have spent little money).

First, it is worth considering offering discounts or coupons to encourage repeat purchases for these users.

Additionally, for ItemID 2, there is a cluster of users with high frequency but low spending. Special discounts or exclusive offers can be considered for frequent buyers.

Conclusion

Using Poisson and Gamma distributions, we created a dummy database and performed an RFM analysis based on that data.

We focused on Recency, Frequency, and Monetary, visualizing where each user stands on the graph.

Based on each user’s position on the graph, we can devise marketing strategies for them.

References

  1. The Power of Mathematical Marketing Proven at USJ