Table of Contents
This analysis is a practical implementation of the Apriori Algorithm via Python.
Apriori algorithms is a data mining algorithm used for mining frequent itemsets and relevant association rules. It is devised to operate on a database that contain transactions -like, items bought by a customer in a store.
An itemset can be considered frequent if it meets a user-specified support threshold. For example, if the support threshold is set to 0.5(50%), a frequent itemset is a set of items that are bought/purchased together in atleast 50% of all transactions.
Association rules are a set of rules derived from a database, that can help determining relationship among variables in a large transactional database.
For example, let I ={i(1),i(2)...,i(m)} be a set of m attributes called items, and T={t(1),t(2),...,t(n)} be the set of transactions. Every transaction t(i) in T has a unique transaction ID, and it contains a subset of itemsets in I.
Association rules are usually written as i(j) -> i(k). This means that there is a strong relationship between the purchase of item i(j) and item i(k). Both these items were purchased together in the same transaction.
In the above example, i(j) is the antecedent and i(k) is the consequent.
Please note that both antecedents and consequents can have multiple items. For example, {Diaper,Gum} -> {Beer, Chips} is also valid.
Since multiplie rules are possible even from a very small database, i-order to select the most relevant ones, we use constraints on various measures of interest. The most important measures are discussed below. They are:
-
- Support : * The support of an itemset X, supp(X) is the proportion of transaction in the database in which the item X appears. It signifies the popularity of an itemset.
supp(X) = (Number of transactions in which X appears)/(Total number of transactions)
- Support : * The support of an itemset X, supp(X) is the proportion of transaction in the database in which the item X appears. It signifies the popularity of an itemset.
We can identify itemsets that have support values beyond this threshold as significant itemsets.
-
- Confidence :* Confidence of a rule signifies the likelihood of item Y being purchased when item X is purchased.
- Confidence :* Confidence of a rule signifies the likelihood of item Y being purchased when item X is purchased.
Thus, conf(X -> Y) = supp(X U Y) / supp( X )
If conf (X -> Y) is 75%, it implies that, for 75% of transactions containing X & Y, this rule is correct. It is more like a conditional probability, P(Y|X), that the probability of finding itemset Y in transactions fiven that the transaction already contains itemset X.
-
- Lift :* Lift explains the the likelihood of the itemset Y being purchased when itemset X is already purchased, while taking into account the popularity of Y.r>
Thus, lift (X -> Y) = supp (X U Y)/( supp(X) supp (Y) )
If the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset X, while a value less than 1 implies that the itemset Y is unlikely to be bought if the itemset X is bought.
Major frameworks/libraries used to bootstrap project.
Instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
- pip
pip install -r requirements
Installing and setting up your app.
- Run Jupyter notebook on Sagemaker at https://bcg-rise-bda.awsapps.com/start#/
- Clone the repo
git clone https://github.com/JohnTan38/Best-README.git
- Install packages
pip install mlxtend
- Import libraries
from mlxtend.frequent_patterns import apriori, association_rules from mlxtend.preprocessing import TransactionEncoder
Data Preprocessing and transformation - TransactionEncoder class from the MLXtend library
- To find unique items - flatten the dataframe and convert into a set. The transformation removes any duplicate items
- Fit the object of the class on the list and convert to dataframe.
- for every item in a transaction, append 1 if purchased and 0 otherwise.
# fitting the list and converting the transactions to true and false
encoder = TransactionEncoder()
transactions = encoder.fit(matcha_list).transform(matcha_list)
# converting the transactions array to a datafrmae
df = pd.DataFrame(transactions, columns=encoder.columns_)
Market Basket Analysis is a data mining tool used by retailers to increase sales by better understanding customer purchasing patterns.
Purchase history and items bought together are analyzed to reveal product groupings, as well as products that are likely to e purchased together.
Association Analysis looks for relationships in large datasets. These relationships can take 2 forms: frequent item sets or association rules. Frequent item sets are a collection of items that frequently occur together. Association rules suggest that a strong relationship exists between two items
> Matcha latte and Hojicha latte pair with high level of support and lift. Lift > 1 indicates that higher sales of antecedents lead to higher sales of consequents
> Awakening Matcha Whisk set and Matcha Starter kit bundle with high level of support and lift.
Closely associated products with minimum support of 3% and lift greater than 2. Customers who add item to cart could have closely associated items suggested to them before checkout. Different permutations and threholds of Support and Lift return differennt association rules.
Customers recency, frequency & monetary (transaction values) are analyzed and K Means clustering used to group customers into distinct segments
.
Customer segmentation fine-tuned with detailed analysis and RFM segments identified.
For example, top customers who buy frequently and with high ticket values in RFM segment '144' could be offered bundle of 'Awakening Matcha Whisk set' with
'Ceremonial Uji Matcha Powder'.
Customers' RFM segments and closely associated products provide opportuniites for targeted cross selling . Customers of RFM segment '444' who bought 'Awakening Matcha Whisk Set' could have 'Matcha Starter Kit' recommended.
Matcha Starter Kit enjoys high support and lift. Sales campaign to smooth out sales trend during 2nd and 3rd quarters. Gross profit would be increased with a successful campaign.
Potential uplift of 35% gross sales of Awakening Matcha Whisk Set.
Potential uplift of 52% gross sales of Ceremonial Uji Matcha Powder.
Potential uplift of 18% gross sales of Barista Uji Matcha Powder.
Easy to understand
Suitable for large itemsets
Computationally expensie if there are many association rules
Calculating Support is expensive as algorithm goes through entire dataset
_For more examples, please refer to the Documentation
- Data collection - customers' demographic profile
- Sesarch Engine Optimization (SEO) & click through rates (CTR)
- Google Analytics 360 - data driven attribution
- Fine tune threshold values for Support and Lift
- Multi-language Support
- Chinese
- Bahasa Indeonesia
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
This project follows the all-contributors specification. Contributions of any kind welcome!
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Project Link: https://github.com/JohnTan38/Best-README