This repository demonstrates how to use Spark to work with big data and build machine learning models at scale.
Goals
- Practice processing and cleaning datasets to get comfortable with Spark’s SQL and dataframe APIs (Spark SQL, PySpark).
- Debug and optimize for data skewness when running on a cluster.
- Use Spark’s Machine Learning Library (MLlib) to train machine learning models at scale.