Project of Udacity Data Scientist Nanodegree Program
See here for more detailed description: https://kangle-chen1103.medium.com/predicting-churn-rates-using-pyspark-54aa757bd408
Required packeages are listed in requirement.txt
.
In this project, the process of a comprehensive implementation of data science knowledge in realworld project is demonstrated, which includes following steps:
- Define project, analysis and modeling following the CRISP-DM process
- Using Spark Dataframes and Spark ML to manipulate data and build machine learning model
File Sparkify
contains the pyspark scripts.
File Sparkify_IBMWatson
contains the pyspark scripts employed on IBM Watson.
Through data processing and feature generating, an accurate machine leanring model has been trained. The model has demonstrated that most users churned the payment after using the service for 1000 hours ~ 2000 hours. Coupon and discounts for users in this period might be effective, which certainly still requires validation from for example A/B test.
Due to the restriction of computation power, CrossValidator
and paramGrid
here are only to demonstrate the pipeline to employ it rather than to provide optimized trained results.
Finally, since the dataset is not large, Spark has actually not shown its advantage over python and pandas. A further task is to employ the model in aws with larger data.
Must give credit to Udacity for the project. Otherwise, feel free to use the code here as you would like!