- This is program consists of 3 tasks
- All these code files were my personal submissions for this program. Except the data files which were assigned by Quantium.
Python Version: 3.7
Packages: pandas, numpy, seaborn, sklearn, matplotlib, datetime, scipy
Conduct analysis on your client's transaction dataset and identify customer purchasing behaviours to generate insights and provide commercial recommendations.
We need to present a strategic recommendation to Julia that is supported by data which she can then use for the upcoming category review however to do so we need to analyse the data to understand the current purchasing trends and behaviours. The client is particularly interested in customer segments and their chip purchasing behaviour. Consider what metrics would help describe the customersβ purchasing behaviour.
- Examine transaction data - check for missing data, anomalies, outliers and clean them
- Examine customer data - similar to above transaction data
- Data analysis and customer segments - create charts and graphs, note trends and insights
- Deep dive into customer segments - determine which segments should be targetted
- Date column was in integer format. So the date column was changed to date time format.
- There are 365 days in a year but in the DATE column there are only 364 unique values so one was missing. As it was a Christmas day and store was closed there was no anomaly. Value was kept as zero transaction for "TOT_SALES".
- Checked if all the products in given data are chips.
- Some product names are written in more than one way. Example : Dorito and Doritos, Grains and GrnWves, Infusions and Ifzns, Natural and NCC, Red and RRD, Smith and Smiths and Snbts and Sunbites. It was cleaned thereafter.
- Split and frequency of each word in "PROD_NAME" column. Removed all rows containing "salsa" in "PROD_QTY" column.
- Checked for outliers and removed outliers rows in "PROD_QTY" column.
- Each word value was counted in "PROD_NAME" column to extract the brand name. Combined brands written in multiple ways. Created a new column "Cleaned_Brand_Names".
- The 4 main questions answered in data analysis were:
- Who spends the most on chips (total sales), describing customers by lifestage and how premium their general purchasing behaviour is
- How many customers are in each segment
- How many chips are bought per customer by segment
- What's the average chip price by customer segment
- Groupby sum TOT_SALES column and identified top 3 highest total sales contributing segments.
- Plot the groupby into stacked bar chart with percentage text on each segment stack.
- The high sales amount by segment "Young Singles/Couples - Mainstream" and "Retirees - Mainstream" are due to their large number of unique customers, but not for the "Older - Budget" segment. Next we'll explore if the "Older - Budget" segment has: High Frequency of Purchase and, Average Sales per Customer compared to the other segment.
- Used p-value calculation and found statistically significant TOT_SALES difference (pval < 5%) between "Mainstream Young Midage" to "Budget and Premium Young Midage" segment.
- Divided groupby sum to groupby nunique to get average amount of chips bought per customer segment. Older and Young Families bought the highest average amount of chips.
- Unstacked the groupby and plotted it by segment:
- Top 3 total sales contributor segment are
- Older families (Budget) $156,864
- Young Singles/Couples (Mainstream) $147,582
- Retirees (Mainstream) $145,169
- Young Singles/Couples (Mainstream) has the highest population, followed by Retirees (Mainstream). Which explains their high total sales.
- Despite Older Families not having the highest population, they have the highest frequency of purchase, which contributes to their high total sales.
- Older Families followed by Young Families has the highest average quantity of chips bought per purchase.
- The Mainstream category of the "Young and Midage Singles/Couples" have the highest spending of chips per purchase. And the difference to the non-Mainstream "Young and Midage Singles/Couples" are statistically significant.
- Chips brand Kettle is dominating every segment as the most purchased brand.
- Observing the 2nd most purchased brand, "Young and Midage Singles/Couples" is the only segment with a different preference (Doritos) as compared to others' (Smiths).
- Most frequent chip size purchased is 175gr followed by the 150gr chip size for all segments.
- Older Families: Focus on the Budget segment. Strength: Frequent purchase. We can give promotions that encourages more frequency of purchase. Strength: High quantity of chips purchased per visit. We can give promotions that encourage them to buy more quantity of chips per purchase.
- Young Singles/Couples: Focus on the Mainstream segment. This segment is the only segment that had Doritos as their 2nd most purchased brand (after Kettle). To specifically target this segment it might be a good idea to collaborate with Doritos merchant to do some branding promotion catered to "Young Singles/Couples - Mainstream" segment. Strength: Population quantity. We can spend more effort on making sure our promotions reach them, and it reaches them frequently.
- Retirees: Focus on the Mainstream segment. Strength: Population quantity. Again, since their population quantity is the contributor to the high total sales, we should spend more effort on making sure our promotions reaches as many of them as possible and frequent.
- General: All segments has Kettle as the most frequently purchased brand, and 175gr (regardless of brand) followed by 150gr as the preferred chip size. When promoting chips in general to all segments it is good to take advantage of these two points.
Extend your analysis from Task 1 to help you identify benchmark stores that allow you to test the impact of the trial store layouts on customer sales.
Julia has asked us to evaluate the performance of a store trial which was performed in stores 77, 86 and 88.
This can be broken down by:
- total sales revenue
- total number of customers
- average number of transactions per customer
Create a measure to compare different control stores to each of the trial stores to do this write a function to reduce having to re-do the analysis for each trial store. Consider using Pearson correlations or a metric such as a magnitude distance e.g. 1- (Observed distance β minimum distance)/(Maximum distance β minimum distance) as a measure. Once you have selected your control stores, compare each trial and control pair during the trial period. You want to test if total sales are significantly different in the trial period and if so, check if the driver of change is more purchasing customers or more purchases per customers etc.
- Select control stores β Explore data, define metrics, visualize graphs
- Assessment of the trial β insights/trends by comparing trial stores with control stores
- Collate findings β summarize and provide recommendations.
- Compile each store's monthly:
- Total sales
- Number of customers,
- Average transactions per customer
- Average chips per customer
- Average price per unit
- Check significance of Trial minus Control stores TOT_SALES Percentage Difference Pre-Trial vs Trial.
Step 1: Check null hypothesis of 0 difference between control store's Pre-Trial and Trial period performance.
Step 2: Proof control and trial stores are similar statistically
Check p-value of control store's Pre-Trial vs Trial store's Pre-Trial. If <5%, it is significantly different. If >5%, it is not significantly different (similar).
Step 3: After checking Null Hypothesis of first 2 step to be true, we can check Null Hypothesis of Percentage Difference between Trial and Control stores during pre-trial is the same as during trial.
-
Check T-Value of Percentage Difference of each Trial month (Feb, March, April 2019).
-
Mean is mean of Percentage Difference during pre-trial.
-
Standard deviation is stdev of Percentage Difference during pre-trial.
-
Formula is Trial month's Percentage Difference minus Mean, divided by Standard deviation.
-
Compare each T-Value with 95% percentage significance critical t-value of 6 degrees of freedom (7 months of sample - 1)
-
Null hypothesis is true. There isn't any statistically significant difference between control store's scaled Pre-Trial and Trial period sales.
-
There are 3 months' increase in performance that are statistically significant (Above the 95% confidence interval t-score):
- March and April trial months for trial store 77
- March trial months for trial store 86
- Check significance of Trial minus Control stores nCustomers Percentage Difference Pre-Trial vs Trial.
Step 1: Check null hypothesis of 0 difference between control store's Pre-Trial and Trial period performance.
Step 2: Proof control and trial stores are similar statistically
Step 3: After checking Null Hypothesis of first 2 step to be true, we can check Null Hypothesis of Percentage Difference between Trial and Control stores during pre-trial is the same as during trial.
- There are 5 months' increase in performance that are statistically significant (Above the 95% confidence interval t-score):
March and April trial months for trial store 77
Feb, March and April trial months for trial store 86
- We can see that Trial store 77 sales for Feb, March, and April exceeds 95% threshold of control store. Same goes to store 86 sales for all 3 trial months.
- Trial store 77: Control store 233
- Trial store 86: Control store 155
- Trial store 88: Control store 40
- Both trial store 77 and 86 showed significant increase in Total Sales and Number of Customers during trial period. But not for trial store 88. Perhaps the client knows if there's anything about trial 88 that differs it from the other two trial.
- Overall the trial showed positive significant result.
- Use your analytics and insights from Task 1 and 2 to prepare a report for your client, the Category Manager.
- With our project coming to an end its time for us to send a report to Julia, based on our analytics from the previous tasks. We want to provide her with insights and recommendations that she can use when developing the strategic plan for the next half year.
- As best practice at Quantium, we like to use the βPyramid Principlesβ framework when putting together a report for our clients. If you are not already familiar with this framework you can find quick introductions on by searching form them on the internet.
- For this report, we need to include data visualisations, key callouts, insights as well as recommendations and/or next steps.
- We recommend you use a tool like PowerPoint (or similar) to create your report.
- Data literacy level of your audience
- Table of contents / agenda
- Problem statement / purpose
- Overview and context
- Content balance
- Layout and content display
- Summary / next steps
- Use analytics and insights from Task 1 and 2 to prepare a report for the client, the Category Manager.
- Delivered the insights and recommendaions to the client in PowerPoint presentation along with easy to understand data visualizations.