This project is to build a data lake on Amazon Web Service for a music application. The data architechture is simply described as below image. The output files generated by the music application will be placed on S3 under csv or json format. We build the service to process these data on EMR by using distributed data processing engine, Spark which is facilitate the task of big data processing.
The application's file is comprised of envent logs (historical music listening from users) and song library data. For this paticular project, data are stored as json format in AWS S3. There are three main tasks involved to accomplete the project:
- Load data from S3 to EMR and distribute data to Spark cluster
- Transform data and Load into dimensional and fact table data which is kept under Spark Data Frame format
- Write the data frame back to S3 as parquet file format
- Configure your AWS setup including credential in
dl.cfg
and update files storage path in S3
$ input_data = "s3a://udacity-dend/" $ output_data = "s3a://udacity-dend/output/"
- Run
python etl.py
and enjoy the result.
dl.cfg
contains configuration parameters to access AWS EMR and S3 bucket.etl.py
load data from S3 to Spark Clusters on EMR. Data after processed are save back to S3 for further usage.Sparkify Data Lake.ipynb
experiment of data expoloration, transformation, functions test on local Spark with a little set of data in folderdata
. The output tables are save in./output_data
. The experiment on jupyter notebook is helpful before deploying on AWS.
.
├── data
│ ├── log-data
│ └── song-data
├── dl.cfg
├── etl.py
├── images
│ └── data architechture.png
├── output_data
│ ├── artists
│ ├── songplays
│ ├── songs
│ ├── time
│ └── users
├── README.md
└── Sparkify Data Lake.ipynb
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
s3.ObjectSummary(bucket_name='udacity-dend', key='song-data/A/A/A/TRAAAAK128F9318786.json')
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.
s3.ObjectSummary(bucket_name='udacity-dend', key='log-data/2018/11/2018-11-01-events.json')