Preparation
In textbooks, calculations are mainly performed by Excel functions. Although Excel has an excellent GUI, it does not have enough API libraries to connect to external web systems or data analysis tools. Therefore, we will use Python to perform the same calculations as in the textbook. Here are the preparations for this.
github
- The jupyter notebook file on github is here .
google colaboratory
- If you want to run it on google colaboratory here
Author’s environment
This is the author’s environment.
The author's environment.
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022
Python -V
Python 3.7.3
Load the required libraries.
import numpy as np
import scipy
from scipy.stats import binom
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
print("numpy version :", np.__version__)
print("matplotlib version :", matplotlib.__version__)
print("sns version :",sns.__version__)
numpy version : 1.16.2
matplotlib version : 3.0.3
sns version : 0.9.0
Overview
In the end-of-book commentary2, the authors explain six tools that they use frequently, with specific examples. You can refer to this and apply it when you need it in a real business situation.
- Gamma-Poisson Least Sensitive Model
- Based on the data of “When did you buy recently” and “When did you visit recently” (Recent Purchase Period: Recency), it tells us which brand, which facility, and which period we should focus our resources relatively. 2.
- negative binomial distribution
- This is a tool that corrects for the difference between the consumer household panel’s data on your brand and the actual sales figures. This means that it is very useful for benchmarking trial rates, repeat rates, and purchase frequency during forecasting.
- category advancement rank model
- It will tell you how much market share you can get in a newly created category. It also allows you to simulate the market share based on your marketing plan.
- trial model/repeat model
- Using data from concept tests, concept use tests, and household panel data, you can predict the sales of a new product in the first year of its launch.
VPP Model (Volume per Purchase)
- Helps you determine the size of your product. 6.
- delishley NBD model
- It provides a concrete example of how Delishley S is calculated for NBD category K. It predicts the quarterly purchase rate, the number of quarterly purchases, and the percentage of 100% loyal customers for Colgate in Table 1-4 of the textbook, as explained in Explanation 1, 1-6.
2-1. Gamma-Poisson Least Sensitive Model
We can build an NBD model by calculating $m$ and $k$ from the data of “when did you buy recently” and “when did you visit recently”. the formula describing the NBD model can be calculated as follows.
$$ P\left(r \right) = \frac{\left(1 + \frac{M}{K} \right)^{-K} \cdot \Gamma\left(K + r \right)}{\Gamma\left(r + 1 \right)\cdot \Gamma\left(K \right)} \cdot \left(\frac{M}{M+K} \right)^r $$
Let $mt$ denote the corresponding average value of $M$ over the period $t$ for a given product, and let $K$ denote $k$. Since the penetration rate can be calculated by subtracting the number of people who never buy this product from 100%, we have
It becomes Thus, the penetration rate in a period $t$ and $t-1$ is
$$ P\left(t \right) - P\left(t-1 \right) = \left(1+\frac{m\times t}{k}\right)^{-k} - \left(1+\frac{m\times \left(t-1 \right)}{k}\right)^{-k} $$
This becomes
To apply this to an arbitrary period, we use two variables $t_1$ and $t_2$ and define $f \left(x \right)$ as follows.
$$ f\left(t_1,t_2,m,k \right) = \left(1+\frac{m\times t_1}{k} \right)^{-k} - \left(1+\frac{m\times t_2}{k} \right)^{-k} $$
Table 10-1 in the textbook can be expressed using the common function $f\left(x \right)$ as follows
Gamma distribution | Actual values |
---|---|
$\displaystyle f\left(t_1=0,t_2= \frac{14}{31}\right) $ | 43.9% |
$\displaystyle f\left(t_1=\frac{14}{31},t_2=1 \right) $ | 25.6% |
$\displaystyle f\left(t_1=1,t_2=2 \right) $ | 19.1% |
$\displaystyle f\left(t_1=2,t_2=3 \right) $ | 5.1% |
$\displaystyle f\left(t_1=3,t_2=4 \right) $ | 1.5% |
$\displaystyle f\left(t_1=4,t_2=5 \right) $ | 0.7% |
$\displaystyle f\left(t_1=5,t_2=6 \right) $ | 1.4% |
$\displaystyle f\left(t_1=6,t_2=\infty \right) $ | 2.7% |
Derivation of m,k by scipy’s curve_fig
In general, to perform least-squares fitting of nonlinear functions, we can use the curve_fit module of scipy . According to the scipy website
scipy.optimize.curve_fit(f, xdata, ydata, p0=None, sigma=None, absolute_sigma=False, check_finite=True, bounds=(-inf, inf), method=None, jac =None, **kwargs)[source].
In addition, $xdata$ and $ydata$ are
xdata : array_like
The independent variable where the data is measured. Must be an M-length sequence or an (k,M)-shaped array for functions with k predictors.
ydata : array_like
The dependent data, a length M array - nominally f(xdata, ...)
It is defined as To solve the fitting problem in the textbook.
$$ f\left(t_1,t_2,m,k \right) =\left(1+frac{m\times t_1}{k} \right)^{-k} -\left(1+frac{m\times t_2}{k} \right)^{-k} $$
For the function $\displaystyle f\left(x \right)$ defined above, we will be solving a fitting problem for a two-variable function, where the two variables specify the time period (to find the number of purchases in two weeks to a month, we use $\displaystyle t_1=\frac{ 14}{31}, t_2=1$) and are defined as a two-dimensional array as follows
x = np.array([.
[0.0 ,14/31 ,1.0 ,2.0 ,3.0 ,4.0 ,5.0 ,6.0 ],
[14/31 ,1.0 ,2.0 ,3.0 ,4.0 ,5.0 ,6.0 , 10000.0].
])
x[0] is the array of $t_1$, and x[1] is the array of $t_2$. x[1,7]=10000.0 is originally $\infty$, but infinity is not acceptable in actual numerical calculations, so 10000 is used, which is effectively infinity. This value can be as small as 100.
The actual code for fitting is as follows
import json
import numpy as np
from scipy.optimize import curve_fit
from scipy.special import gamma
def _get_delta_nbd(x, m, k):
return (1 + m * x[0] / k )**(-k) - (1 + m * x[1] / k )**(-k)
x = np.array([.
[0.0 ,14/31 ,1.0 ,2.0 ,3.0 ,4.0 ,5.0 ,6.0 ],
[14/31 ,1.0 ,2.0 ,3.0 ,4.0 ,5.0 ,6.0 , 10000.0].
])
y = [0.439, 0.256, 0.191, 0.051, 0.015, 0.007, 0.014, 0.027])
parameters, covariances = curve_fit(_get_delta_nbd, x, y)
print('parameters : ', parameters)
print('covariances : ', covariances)
parameters : [1.37824241 4.14429889]
covariances : [[ 0.00284656 -0.03699629]]
[-0.03699629 1.57449471]]
The resulting $m$ and $k$ are
and $m$ and $k$ used by the author.
The value is almost equal to.
2-2. Negative binomial distribution
This section explains how to correct the panel data by using the difference between the actual sales data and the data obtained from the panel data.
First of all, the panel data gives us the following information
- (A) : Number of households
- (B) : Penetration rate
- (C) : Average number of purchases
- (D) : Average number of items purchased
- (E) : Average purchase price
Please refer to Table 10-2 below for specific values. It is P281 in the textbook. From here, we can create a
In addition, we know the following results.
- Sales amount
Using this ratio of panel data sales to actual sales, we can correct the panel data and various parameters.
To do this, the textbook makes a number of important assumptions. The following are some of the assumptions made in the textbook, which will help you to understand the subsequent calculations smoothly.
Assumptions
- Actual current sales: 5.89 billion yen
- Sales based on panel data: 4.12 billion yen (actual sales ratio: 70%, calculated by AxBxCxDxE)
- Average number of items purchased per purchase is the same in reality as in the panel data
- Average unit price per purchase is the same in reality as in the panel data
- $K$ is the same in reality as in the panel data
Again, only “sales” are known as actual results. In the textbook example, we only know that the sales are 5.89 billion yen.
Item | Before correction | After correction | |
---|---|---|---|
(A) | Total number of households in 2008 (thousands) | 49973 | 49973 |
(B) | Penetration rate | 15.0% | 17.4% |
(C) | Average number of purchases | 2.50 | 3.07 |
(D) | Average number of items purchased per transaction | 1.10 | 1.10 |
(E) | Average unit price per purchase | 200 yen | 200 yen |
(F) | Percentage of customers who purchased two or more times | 50% | 55% |
(G) | Annual Sales (AxBxCxDxE) | 4.12 billion yen | 5.89 billion yen |
(H) | Ratio of G to actual | 70% | 100% |
Item | Before correction | After correction | |
---|---|---|---|
(I) | Brand $m$:(BxCxD) | 0.4125 | 0.5893 |
(J) | Brand $k$ | 0.09899 | 0.09899 |
(K) | $P_0$(probability of never buying) | 85.00% | 82.53% |
(L) | $P_1$(Probability of buying once) | 6.79% | 7.00% |
(M) | $P_{+2}=100\%-P_0-P_1$ | 8.21% | 10.47% |
(N) | Percentage of buyers who purchased two or more times by model:$\left(\frac{M}{B}\right)$ | 54.76% | 59.95% |
Steps in the correction
- brand’s $m=$penetration rate x average number of purchases x average number of units purchased
- $k$ of brand
$$ P\left(r \right) = \frac{\left(1 + \frac{M}{K} \right)^{-K} \cdot \Gamma\left(K + r \right)}{\Gamma\left(r + 1 \right)\cdot \Gamma\left(K \right)} \cdot \left(\frac{M}{M+K} \right)^r $$
by substituting $\displaystyle K=k, M=m=0.4125, r=0$ into
$$ P_0=\frac{\left(1+\frac{m}{k} \right)^{-k}\cdot \Gamma\left(k+0 \right)}{\Gamma\left(0+1 \right)\cdot \Gamma\left(k \right)}=\left(1+\frac{0.4125}{k} \right)^{-k} =0.85 $$
We obtain the nonlinear equation Where $\displaystyle P_0$ is the probability of never having made a purchase, so it can be calculated from (1 - penetration) and
$$ P_0=1 - 015 = 0.85 $$
We use the fact that $$ P_0=1 - 015 = 0.85 $$
Solving nonlinear equations numerically
Equation to find $k$
$$ \left(1+\frac{0.4125}{k} \right)^{-k} =0.85 $$
is nonlinear and cannot be solved analytically. We will use numerical methods to solve it using a computer. Here, we will use Python’s newton method to obtain the solution. In the textbook, we use Excel to obtain the value of $k$, but either method is fine.
The result is $$ k=0.09899 $$ as a result.
python code
The python code to get $k$ is as follows
from scipy.optimize import newton
MIN_k = 0
MAX_k = 1.0
def check_k(k):
if MIN_k < k and k < MAX_k:
return True
else:
return False
def get_k(m, P0):
def func(k, m=m, P0=P0):
return (1 + m / k) ** (-1 * k) - P0
k = None
try:
for initial_k in [(i + 1) * 0.01 for i in range(100)]:
k = newton(func, initial_k)
if check_k(k):
return k
else:
if not check_k(k):
return None
except:
return None
m = 0.4125
P0 = 0.85
print("k = {:,.5f}".format(get_k(m, P0)))
k = 0.09893
and the value is almost equal to the textbook even using python.
3. P_1, the probability of buying once
$P_1$ is the same as $P_0$.
$$ P\left(r \right) = \frac{\left(1 + \frac{m}{k} \right)^{-k} \cdot \Gamma\left(k + r \right)}{\Gamma\left(r + 1 \right)\cdot \Gamma\left(k \right)} \cdot \left(\frac{m}{m+k} \right)^r $$
Just substitute $\displaystyle k=0.09899, m=0.4125, r=1$ into However, since it contains a gamma function, calculations by python or excel are required.
The following example shows how to do this
4. Probability of buying two or more times P_{2+}
Since we can subtract $P_0$ and $P_1$ from $1.
$$ P_{2+}=1-0.85-0.06709=0.0821 $$
The result is
5. Percentage of buyers who buy more than once according to the model (uncorrected)
This is simply a ratio.
$$ \frac{P_{2+}}{1-P_0} =\frac{0.0821}{1-0.85} =\frac{0.0821}{0.1500}=0.5476 $$ The result is.
Calculate the specific correction
6. Calculation of $P_0$
The $m$ is corrected by the ratio of actual sales to the sales on the panel data (0.7). Let $m$ be $m’$ after correction.
$$ m’=\frac{m}{0.7}=\frac{0.4125}{0.7}=0.5893 $$
and the correction is simple. From the prior assumption, $k$ is common to both panel data and actual data, so $k’=k=0.09899$. The $k’$ means $k$ after correction. Using this $k’$, $P_0$ is corrected as follows.
$$ P_0=\left(1+\frac{0.5893}{0.09899}\right)^{-0.09899}=0.8253 $$
Also, the corrected osmotic rate (defined as $\tau’$ and the uncorrected osmotic rate as $\tau$)
$$ \tau’=1-0.8253=0.1747 $$
which can be calculated as
7. Average number of purchases after correction
Since $m=$penetration×average number of purchases×average number of purchases, as obtained in 1.
8. Percentage of purchasers who buy more than once
This is similarly just a matter of calculating the corrected $P_0$ and $P_1$. To calculate.
$$ P\left(r \right) = \frac{\left(1 + \frac{m^\prime}{k^\prime} \right)^{-k^\prime} \cdot \Gamma\left(k^\prime+ r \right)}{\Gamma\left(r + 1 \right)\cdot \Gamma\left(k^\prime \right)} \cdot \left(\frac{m^\prime}{m^\prime+k^\prime} \right)^r $$
using the following formula. If we add $’$ to each corrected value, we get
Thus.
$$ \frac{P_{2+}^\prime}{1-P_0^\prime} =\frac{0.1047}{1-0.8253} =0.5995 $$ It becomes
9. Ratio of buyers who purchased two or more times
This is just a correction to the values in the panel data by the ratio of two or more purchases by the model.
and it will be corrected as follows.
2-3. Category advancement order model
In this section, we will use the
The formula is shown as follows. The categories of products that can be supported are
- Fabric softener
- Liquid detergent for clothes
- Freeze dried
- Coffee
Freeze-dried coffee.
Official
Here we have
- a : entry order
- b : relative favoritism
- c : Ratio of publicity cost
- d : number of years between It becomes.
Example
The textbook gives a specific example.
- Pioneer brand (the brand with the highest market share): 35% share
- Entry rank: 4
- Relative favorability:0.9
- Advertising Expenditure Rate: 0.7
- Entry in the same year as the third product (intervening years):1
and the share will be 14%.
python code
Not really necessary, but here is the python calculation code.
pioneer_share = 0.35
order = 4
m = 0.9
cost = 0.7
entry = 1
prediceted_share = pioneer_share*order**(-0.49)*m**(1.11)*cost**(0.28)*entry**(0.07)
print('Predicted share ratio = {:,.3f}'.format(prediceted_share))
Predicted share ratio = 0.143
Aside from the issue of whether this is actually a correct prediction, the formula is quite meaningful in that it allows us to predict the share of a new market entry in the future.
2-4. Trial and Repeat Models
In this section, we will discuss
- Concept testing
- Concept Use Test
- Household panel data
This section explains how to predict the first year sales of a new product based on the values of
a) Trial model, repeat model
Definition.
- Sales from trial = (Pop) x (Trial rate) x (Trial VPP)
- Sales from repeat customers = (Pop) x (Trial rate) x (Repeat rate) x (Number of repeat customers) x (Repeat VPP)
b) Explanation of each item
- Pop: Number of total consumers and total households
- Trial rate: Percentage of Pop who purchased the target product for the first time in one year
- Repeat Rate: Percentage of people who purchased the product for the first time in one year who purchased it again in one year.
- Number of repeat purchases: Average number of purchases by repeat customers minus one (for trial)
- Trial VPP: Average purchase price during the trial period
- Repeat VPP: Average purchase price for repeat purchases
c) Example
Conditions
- 10% of all households purchased a certain new shampoo product in the first year after launch
- 30% of purchasers buy at least one more time within the period
- Average number of repeat purchases is 2.5
- The average purchase price for a trial is 383 yen (365 yen x 1.05)
- Average purchase price for repeat customers is 475 yen (431 yen x 1.10)
= 49.97 million households x 10% x 383 yen + 49.97 million households x 10% x 30% x 1.5x475 yen
= 1.91 billion yen + 1.07 billion yen = 2.98 billion yen
This section would not be so difficult to understand if only the trial rate could be derived from the panel data.
2-5. VPP Model (Volume per Purchase)
This section is omitted as there is no need to explain the mathematical aspects of the model.
2-6. The Delishley NBD Model
As explained in 1-6 , the Delishley NBD model is a useful model to predict and analyze the purchase rate and number of purchases for all brands in a category from the inter-brand shares.
Based on Colgate’s purchase data in the U.K., this article describes in detail how to find the purchase rate, the percentage of 100% loyal customers, and even the average number of purchases, starting from the derivation of the key parameters K and S.
Calculation of K
Delishley’s NBD model is shown again as follows. This is the textbook equation (6).
Here, we have
The $K$ is calculated by the category.
The calculation of $K$ is derived from the equation of the categorical NBD model as in 2-2 . Although it is a nonlinear equation, the solution can be obtained numerically by using the newton method for the unknowns.
The term $p(r_j|R)$ in equation (6) becomes 1 because it is obtained from the percentage of households that have never purchased anything, which greatly simplifies the calculation. This eliminates $S$, and it is not necessary to know it at this point.
For corrugations
It appears that.
Calculating S
To find S, we use the data from Table 1-4 for households that have never bought corrugates (80%).
It’s a little complicated, but a household that has never bought any corrugates is defined as $R=0$ and $r=0$, i.e., a household that has never bought any toothpaste (category); $R=1$ and $r=0$, a household that has bought toothpaste once but has never bought any corrugates; $R=2$ and $r=0$, a household that has bought toothpaste twice but has never bought any corrugates. households that bought toothpaste twice but did not buy Colgate, $r=0$ for $R=3$, and $r=0$ for $R=3$, households that bought toothpaste three times but did not buy Colgate… All of these people must be counted.
Therefore, we have to solve the following equation.
Ideally, some people would have bought toothpaste infinitely many times, but never bought any Colgate, which is what the formula represents.
In reality, however, there is no such data, and once R becomes large to some extent, it becomes zero after that (there are no infinite number of brands to begin with), so we need to censor it at a certain number. p289 limits it to 10.
and this is sufficient for practical purposes.
Also, $p(r_j=0|R)$ is somewhat easier to write, and
which is a function of $S$ and $R$, so the expression $(\ast)$ is a function of $S$ only.
However, the expression $(\ast)$ is quite complex. If you actually want to find it numerically, you need to make some assumptions about the initial values before finding it. As in the case of finding $K$, we can use the Newton method to find it. In the textbook, the final $S$ is obtained as $$S=1.2$$
About Table 10-9
Table 10-9 shows the results of the calculation of $p(r_j|R)$ when the values of $r_j$ and $R$ are given specifically. The $p(r_j=1|R=2)=20.5%$ is the ratio of households that bought toothpaste twice to households that bought Colgate once. However, since $p(r_j|R)$ is a conditional probability, we need to keep in mind that it is only 20% of the households that bought toothpaste twice.
About Table 10-10
Table 10-10 is a table of numbers where we can write $p_R(NBD)$ in Table 10-9. It is the probability that a household buys the toothpaste category and how many Colgate products they buy.
Percentage of 100% loyal customers of Colgate
The numbers on the diagonal in Table 10-10 are the percentage of buyers who decide to buy only Colgate toothpaste, so we can divide them by the percentage of toothpaste purchases to get the percentage of Colgate loyal customers.
Average number of purchases of Colgate
By multiplying the probability of purchase of Colgate by the number of purchases and summing them, we can calculate the expected value of the number of purchases (average number of purchases).
python code
Here is the python code we used to calculate in Table 10-8, 9, and 10. The derivation of the negative binomial and Delishley distributions is difficult, but the calculation of the results themselves is not very complicated.
import numpy as np
import math
from scipy.special import gamma
def get_nbd(M, T, K, R):
return ((1 + M * T / K)**(-1 * K)) * \
(gamma(K + R) / math.factorial(R) / gamma(K)) * \
((M * T / (M * T + K)) ** R)
def get_p_rj_0(r, a, S, R):
return (math.factorial(R)/ math.factorial(r) / math.factorial(R - r)) * \
(gamma(S) / gamma(a) / gamma(S - a)) * \
(gamma(a + r) * gamma(S - a + R - r) / gamma(S + R)))
def print01():
for R in range(0,11):
print('R={} | '.format(R), end='')
for r in range(R + 1):
print('{:.3f} | '.format(round(get_p_rj_0(r=r, a=1.2 * 0.25, S=1.2, R=R), 3)), end='')
print()
def print02():
for R in range(0,11):
print('R={} | '.format(R), end='')
for r in range(R + 1):
print('{:.1f} % | '.format(round(100 * get_nbd(M=1.46, T=1, K=0.78,R=R) * get_p_rj_0(r=r, a=1.2 * 0.25, S=1.2, R=R), 3)), end='')
print()
print('Table 10-9 Percentage of category purchases by number of purchases when S=0.12')
print01()
print()
print()
print('Table 10-10 Percentage of category and brand purchases by number of purchases when S=0.12')
print02()
Table 10-9 Proportion of category by number of purchases when S=0.12
R=0 | 1.000 |
R=1 | 0.750 | 0.250 |
R=2 | 0.648 | 0.205 | 0.148 |
R=3 | 0.587 | 0.182 | 0.125 | 0.106 | R=4
R=4 | 0.545 | 0.168 | 0.113 | 0.091 | 0.083 |
R=5 | 0.514 | 0.157 | 0.105 | 0.083 | 0.072 | 0.069 |
R=6 | 0.489 | 0.149 | 0.099 | 0.078 | 0.066 | 0.060 | 0.059 |
R=7 | 0.468 | 0.143 | 0.094 | 0.074 | 0.062 | 0.055 | 0.052 | 0.052 |
R=8 | 0.451 | 0.137 | 0.090 | 0.070 | 0.059 | 0.052 | 0.048 | 0.045 | 0.046 |
R=9 | 0.437 | 0.132 | 0.087 | 0.068 | 0.057 | 0.050 | 0.045 | 0.042 | 0.040 | 0.041 |
R=10 | 0.424 | 0.128 | 0.084 | 0.066 | 0.055 | 0.048 | 0.043 | 0.040 | 0.038 | 0.037 | 0.038 |
Table 10-10 Ratios of category and brand by number of purchases when S=0.12
R=0 | 43.9 % |
R=1 | 16.7 % | 5.6 % |
R=2 | 8.4 % | 2.6 % | 1.9 % |
R=3 | 4.6 % | 1.4 % | 1.0 % | 0.8 % |
R=4 | 2.6 % | 0.8 % | 0.5 % | 0.4 % | 0.4 % |
R=5 | 1.5 % | 0.5 % | 0.3 % | 0.2 % | 0.2 % | 0.2 % | R=6
R=6 | 0.9 % | 0.3 % | 0.2 % | 0.1 % | 0.1 % | 0.1 % | 0.1
R=7 | 0.6 % | 0.2 % | 0.1 % | 0.1 % | 0.1 % | 0.1 % | 0.1 % | 0.1
R=8 | 0.3 % | 0.1 % | 0.1 % | 0.1 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0
R=9 | 0.2 % | 0.1 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 %
R=10 | 0.1 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 %
Summary
The above is my attempt to break down the explanations at the end of “Strategic Thinking in Probability” in my own way. I am not a marketing expert, nor do I have any practical experience. I usually work in IT-related fields, such as web system development and machine learning model development. In that context, I usually use Poisson distribution and Gamma distribution, but I never thought that they are applied to marketing in this way.
At first, my friend who specializes in marketing told me “What is the negative binomial distribution? According to my friend, overseas companies such as P&G usually use mathematics in their marketing, but in Japan, it seems that there is still a long way to go. Dr. Arenberg, a great authority on marketing, published a paper that is the basis of this book many decades ago. However, I believe that probability and statistics will be applied more and more to marketing in Japan in the future as well, as a result of “Strategy Theory of Probability Thinking”.