Implementation of Apriori Algorithm and FP-Growth Algorithm Using mlxtend (Association Analysis)
Association analysis is a data mining technique used to understand customer purchase patterns and formulate marketing plans, especially in the retail sector.
The goal of association analysis is to identify correlations and patterns between items.
For example, the famous story of “diapers and beer” demonstrates the importance of association analysis, revealing the tendency for fathers to purchase diapers and beer simultaneously. (Although this story may not be true, it is a well-known anecdote illustrating the significance of association analysis.)
This analysis can help with product placement in stores and promotional strategies.
The process of association analysis is as follows:
- Analyze consumer behavior to identify frequently purchased product combinations.
- Create and evaluate association rules.
To identify the frequent occurrence of these item sets, algorithms such as:
- Apriori
- FP-Growth
are used.
These algorithms analyze the relationships between products and identify sets of products frequently purchased together.
The effectiveness of association rules is evaluated using three metrics: support, confidence, and lift.
Support
Support indicates how frequently a particular item set appears in all transactions.
Support is calculated as follows:
$$ \begin{equation} \operatorname{Supp}(X) = \frac{\text{Number of transactions where X appears}}{\text{Total number of transactions}} = P(X) \end{equation} $$
It can also be calculated for combinations of multiple products and is defined as:
$$ \begin{equation} \operatorname{Supp}(X, Y) = P(X \cap Y) \end{equation} $$
Confidence
Confidence is a measure of the strength of a particular association rule.
It quantitatively defines the probability of another item set being purchased when one item set is purchased.
This is interpreted as the conditional probability between items. The confidence $(X \Rightarrow Y)$ that item $Y$ is purchased when item $X$ is purchased is calculated as follows:
$$ \begin{equation} \operatorname{Conf}(X \Rightarrow Y) = \frac{\operatorname{Supp}(X, Y)}{\operatorname{Supp}(X)} = \frac{P(X \cap Y)}{P(X)} = P(Y | X) \end{equation} $$
Lift
Lift indicates how effective a particular association rule is.
Lift measures the strength of the relationship between item sets and helps determine whether the rule is meaningful or merely coincidental. Lift is calculated as follows:
$$ \begin{equation} \operatorname{Lift}(X \Rightarrow Y) = \frac{\operatorname{Conf}(X \Rightarrow Y)}{\operatorname{Supp}(Y)} = \frac{P(X \cap Y)}{P(X) P(Y)} \end{equation} $$
A lift value greater than 1 indicates a positive correlation between the item sets, a lift value of 1 indicates independence, and a lift value less than 1 indicates a negative correlation.
Association analysis is also applied in recommendation systems, utilized for customized product recommendations, related product suggestions, bundled sales, inventory management, and marketing plan optimization.
Source Code and Execution Environment
GitHub
- The Jupyter notebook file can be found here
Google Colaboratory
- To run on Google Colaboratory, click here
Execution Environment
The OS is macOS. Note that the options differ from Linux or Unix commands.
!sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
!python -V
Python 3.9.17
We import the basic libraries and use watermark to check their versions. We also set the seed for random numbers.
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import random
import scipy
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
seed = 123
random_state = 123
random.seed(seed)
np.random.seed(seed)
from watermark import watermark
print(watermark(python=True, watermark=True, iversions=True, globals_=globals()))
Python implementation: CPython
Python version : 3.9.17
IPython version : 8.17.2
numpy : 1.25.2
matplotlib: 3.8.1
scipy : 1.11.2
Watermark: 2.4.3
Loading Data
We use a dataset published on Kaggle as sample data. The dataset can be downloaded from the following URL:
Save it in the “./data/” directory.
First, we read it into a DataFrame using pandas.
import pandas as pd
# Specify the file
# Save it in the ./data/ directory.
file_name = "./data/GroceryProductsPurchase.csv"
# Read it into a DataFrame using pandas
df = pd.read_csv(file_name)
df.head()
Product 1 | Product 2 | Product 3 | Product 4 | Product 5 | Product 6 | Product 7 | Product 8 | Product 9 | Product 10 | ... | Product 23 | Product 24 | Product 25 | Product 26 | Product 27 | Product 28 | Product 29 | Product 30 | Product 31 | Product 32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | citrus fruit | semi-finished bread | margarine | ready soups | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | tropical fruit | yogurt | coffee | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2 | whole milk | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | pip fruit | yogurt | cream cheese | meat spreads | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | other vegetables | whole milk | condensed milk | long life bakery product | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 32 columns
The column names are “Product1,” “Product2,” and so on, with the purchased product names listed in each row.
For example, the first row indicates that a user purchased four products simultaneously: “citrus fruit,” “semi-finished bread,” “margarine,” and “ready soups.” Since the user purchased only four products, columns “Product5” and beyond show NaN.
Let’s modify this into a commonly used transaction data format.
In other words, a two-dimensional array (list) where each sublist contains the purchased product names.
# Remove NaN from each row and convert it to a list
df["transaction_list"] = df.apply(lambda x: x.dropna().to_list(), axis=1)
transactions = df.transaction_list.to_list()
# Display the first 5 rows
transactions[:5]
[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
['tropical fruit', 'yogurt', 'coffee'],
['whole milk'],
['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'],
['other vegetables',
'whole milk',
'condensed milk',
'long life bakery product']]
Convert this transaction data into a format that mlxtend can read (one-hot encoding).
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
# One-hot encoding
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()
Instant food products | UHT-milk | abrasive cleaner | artif. sweetener | baby cosmetics | baby food | bags | baking powder | bathroom cleaner | beef | ... | turkey | vinegar | waffles | whipped/sour cream | whisky | white bread | white wine | whole milk | yogurt | zwieback | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
5 rows × 169 columns
Let’s run the Apriori algorithm using this DataFrame in mlxtend.
Apriori Algorithm
The Apriori algorithm is the most famous and fundamental algorithm in association analysis, identifying frequently occurring item sets in large datasets.
The Apriori algorithm first generates all single item sets, calculates their support, removes item sets below the minimum support threshold, and then combines the remaining item sets to form larger item sets. By repeating this process, it efficiently extracts frequently purchased item sets while limiting the search space.
The Apriori algorithm is fundamental in association analysis and is useful for extracting valuable information from large datasets, but it generally has a high computational cost.
We will implement the Apriori algorithm using mlxtend. Set the minimum support to 0.02.
# Import the apriori module
from mlxtend.frequent_patterns import apriori
# Extract frequently co-occurring item sets
frequent_itemsets = apriori(df, min_support=0.02, use_colnames=True)
frequent_itemsets.head()
support | itemsets | |
---|---|---|
0 | 0.033452 | (UHT-milk) |
1 | 0.052466 | (beef) |
2 | 0.033249 | (berries) |
3 | 0.026029 | (beverages) |
4 | 0.080529 | (bottled beer) |
From the item sets extracted by the Apriori algorithm, filter the combinations of item sets that satisfy the association rules.
Here, we extract item sets with a confidence of 0.02 or higher.
from mlxtend.frequent_patterns import association_rules
rules_df = association_rules(
frequent_itemsets,
metric="confidence",
min_threshold=0.02,
)
# Set the column names to be used
columns = [
"antecedents",
"consequents",
"support",
"confidence",
"lift",
]
# Sort by support, confidence, and lift in descending order
rules_df.sort_values(by=["support", "confidence", "lift"], ascending=False)[columns].reset_index(drop=True).round(
4
).head()
antecedents | consequents | support | confidence | lift | |
---|---|---|---|---|---|
0 | (other vegetables) | (whole milk) | 0.0748 | 0.3868 | 1.5136 |
1 | (whole milk) | (other vegetables) | 0.0748 | 0.2929 | 1.5136 |
2 | (rolls/buns) | (whole milk) | 0.0566 | 0.3079 | 1.2050 |
3 | (whole milk) | (rolls/buns) | 0.0566 | 0.2216 | 1.2050 |
4 | (yogurt) | (whole milk) | 0.0560 | 0.4016 | 1.5717 |
FP-Growth Algorithm
The FP-Growth algorithm (Frequent Pattern Growth) is an efficient algorithm used in association analysis, known for its high-speed processing of large datasets.
Unlike the Apriori algorithm, the FP-Growth algorithm does not go through the candidate item set generation process. Instead, it uses a special tree structure called an FP tree (Frequent Pattern Tree) to compress data and efficiently extract frequently purchased item sets.
The FP tree shares common parts of item sets, generates new FP trees using a conditional pattern base, and recursively constructs frequent item sets.
The FP-Growth algorithm simplifies the extraction of frequent item sets in large datasets and is generally considered advantageous for analyzing large datasets compared to the Apriori algorithm.
from mlxtend.frequent_patterns import fpgrowth
frequent_itemsets = fpgrowth(df, min_support=0.02, use_colnames=True)
frequent_itemsets.head()
support | itemsets | |
---|---|---|
0 | 0.082766 | (citrus fruit) |
1 | 0.058566 | (margarine) |
2 | 0.139502 | (yogurt) |
3 | 0.104 931 | (tropical fruit) |
4 | 0.058058 | (coffee) |
Similar to the Apriori algorithm, filter the combinations of item sets that satisfy the association rules from the item sets extracted by the FP-Growth algorithm.
Here too, we extract item sets with a confidence of 0.02 or higher.
from mlxtend.frequent_patterns import association_rules
rules_df = association_rules(
frequent_itemsets,
metric="confidence",
min_threshold=0.02,
)
# Set the column names to be used
columns = [
"antecedents",
"consequents",
"support",
"confidence",
"lift",
]
# Sort by support, confidence, and lift in descending order and display up to 4 decimal places
rules_df.sort_values(by=["support", "confidence", "lift"], ascending=False)[columns].reset_index(drop=True).round(
4
).head()
antecedents | consequents | support | confidence | lift | |
---|---|---|---|---|---|
0 | (other vegetables) | (whole milk) | 0.0748 | 0.3868 | 1.5136 |
1 | (whole milk) | (other vegetables) | 0.0748 | 0.2929 | 1.5136 |
2 | (rolls/buns) | (whole milk) | 0.0566 | 0.3079 | 1.2050 |
3 | (whole milk) | (rolls/buns) | 0.0566 | 0.2216 | 1.2050 |
4 | (yogurt) | (whole milk) | 0.0560 | 0.4016 | 1.5717 |
Conclusion
Association analysis is a data mining technique used to find relationships between items in transaction data.
In this article, we implemented the well-known Apriori algorithm and the FP-Growth algorithm using the Python library mlxtend (Machine Learning Extensions).