Skip to content

vdn-projects/Data-Lake-with-AWS-S3-and-Spark

Repository files navigation

Data Lake with AWS S3 & Spark

Summary of project

This project is to build a data lake on Amazon Web Service for a music application. The data architechture is simply described as below image. The output files generated by the music application will be placed on S3 under csv or json format. We build the service to process these data on EMR by using distributed data processing engine, Spark which is facilitate the task of big data processing.

The application's file is comprised of envent logs (historical music listening from users) and song library data. For this paticular project, data are stored as json format in AWS S3. There are three main tasks involved to accomplete the project:

  • Load data from S3 to EMR and distribute data to Spark cluster
  • Transform data and Load into dimensional and fact table data which is kept under Spark Data Frame format
  • Write the data frame back to S3 as parquet file format

How to run the python scripts

  1. Configure your AWS setup including credential in dl.cfg and update files storage path in S3
$ input_data = "s3a://udacity-dend/"
$ output_data = "s3a://udacity-dend/output/"
  1. Run python etl.py and enjoy the result.

Files in the repository

  1. dl.cfg contains configuration parameters to access AWS EMR and S3 bucket.
  2. etl.py load data from S3 to Spark Clusters on EMR. Data after processed are save back to S3 for further usage.
  3. Sparkify Data Lake.ipynb experiment of data expoloration, transformation, functions test on local Spark with a little set of data in folder data. The output tables are save in ./output_data. The experiment on jupyter notebook is helpful before deploying on AWS.
.
├── data
│   ├── log-data
│   └── song-data
├── dl.cfg
├── etl.py
├── images
│   └── data architechture.png
├── output_data
│   ├── artists
│   ├── songplays
│   ├── songs
│   ├── time
│   └── users
├── README.md
└── Sparkify Data Lake.ipynb

Dataset used in S3

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

s3.ObjectSummary(bucket_name='udacity-dend', key='song-data/A/A/A/TRAAAAK128F9318786.json')

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

s3.ObjectSummary(bucket_name='udacity-dend', key='log-data/2018/11/2018-11-01-events.json')

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published