Data Lake with AWS S3 & Spark

Summary of project

This project is to build a data lake on Amazon Web Service for a music application. The data architechture is simply described as below image. The output files generated by the music application will be placed on S3 under csv or json format. We build the service to process these data on EMR by using distributed data processing engine, Spark which is facilitate the task of big data processing.

The application's file is comprised of envent logs (historical music listening from users) and song library data. For this paticular project, data are stored as json format in AWS S3. There are three main tasks involved to accomplete the project:

Load data from S3 to EMR and distribute data to Spark cluster

Transform data and Load into dimensional and fact table data which is kept under Spark Data Frame format

Write the data frame back to S3 as parquet file format

How to run the python scripts

Configure your AWS setup including credential in dl.cfg and update files storage path in S3

$ input_data = "s3a://udacity-dend/"
$ output_data = "s3a://udacity-dend/output/"

Run python etl.py and enjoy the result.

Files in the repository

dl.cfg contains configuration parameters to access AWS EMR and S3 bucket.
etl.py load data from S3 to Spark Clusters on EMR. Data after processed are save back to S3 for further usage.
Sparkify Data Lake.ipynb experiment of data expoloration, transformation, functions test on local Spark with a little set of data in folder data. The output tables are save in ./output_data. The experiment on jupyter notebook is helpful before deploying on AWS.

.
├── data
│   ├── log-data
│   └── song-data
├── dl.cfg
├── etl.py
├── images
│   └── data architechture.png
├── output_data
│   ├── artists
│   ├── songplays
│   ├── songs
│   ├── time
│   └── users
├── README.md
└── Sparkify Data Lake.ipynb

Dataset used in S3

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

s3.ObjectSummary(bucket_name='udacity-dend', key='song-data/A/A/A/TRAAAAK128F9318786.json')

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

s3.ObjectSummary(bucket_name='udacity-dend', key='log-data/2018/11/2018-11-01-events.json')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lake with AWS S3 & Spark

Summary of project

How to run the python scripts

Files in the repository

Dataset used in S3

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
images		images
output_data		output_data
.ignore		.ignore
Pipfile		Pipfile
README.md		README.md
Sparkify Data Lake.ipynb		Sparkify Data Lake.ipynb
dl.cfg		dl.cfg
etl.py		etl.py

vdn-projects/Data-Lake-with-AWS-S3-and-Spark

Folders and files

Latest commit

History

Repository files navigation

Data Lake with AWS S3 & Spark

Summary of project

How to run the python scripts

Files in the repository

Dataset used in S3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages