Udacity Data Engineering Nanodegree - Project 1

Overview

The purpose of this project is to demonstrate the creation of a relational database schema and ETL pipeline which will be used to analyze user activity for a fictional music streaming app 'Sparkify'. The analytical goal is to understand what songs users are listening to. Song metadata and user activity logs in JSON format are transformed and loaded into a Postgres database with a schema design which is appropriate for the analytical goal.

Dependencies

Python 3
- psycopg2
- pandas
PostgreSQL

Files

The file structure is as follows:

data/ - a folder containing song metadata and user activity logs in JSON format.
create_tables.py - creates the database and required tables according to the schema below.
sql_queries.py - defines strings containing all the required SQL queries.
etl.py - performs the ETL process.
README.md - this readme file.

Datasets

Song metadata files should be placed in data/song_data/ and user activity logs should be placed in data/log_data/.

An example of the song dataset format is:

{
    "num_songs": 1,
    "artist_id": "ARGCY1Y1187B9A4FA5",
    "artist_latitude": 36.16778,
    "artist_longitude": -86.77836,
    "artist_location": "Nashville, TN.",
    "artist_name": "Gloriana",
    "song_id": "SOQOTLQ12AB01868D0",
    "title": "Clementina Santaf\u00e8",
    "duration": 153.33832,
    "year": 0
}

An example of the activity log dataset format is:

{
    "artist":"Joshua Radin",
    "auth":"Logged In",
    "firstName":"Aleena",
    "gender":"F","itemInSession":19,
    "lastName":"Kirby","length":161.69751,
    "level":"paid",
    "location":"Waterloo-Cedar Falls, IA",
    "method":"PUT",
    "page":"NextSong",
    "registration":1541022995796.0,
    "sessionId":237,
    "song":"We Are Okay",
    "status":200,
    "ts":1541385750796,
    "userAgent":"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko\/20100101 Firefox\/31.0","userId":"44"
}

The datasets used for testing and development of this software are:

A subset of real song metadata from the Million Song Dataset.
Log files generated from the metadata using this Event Simulator.

Database Schema

The relational database has a star schema, optimized for OLAP workloads. It consists of a single fact table songplays which references four dimension tables users, songs, artists, and time.

Fact Table

1. The songplays table has attributes:

Column	Data Type
* songplay_id	integer
start_time	timestamp
user_id	integer
level	varchar
song_id	varchar
artist_id	varchar
session_id	integer
location	varchar
user_agent	varchar

Dimension Tables

1. The users table has attributes:

Column	Data Type
* user_id	integer
first_name	varchar
last_name	varchar
gender	char(1)
level	varchar

2. The songs table has attributes:

Column	Data Type
* song_id	varchar
title	varchar
artist_id	varchar
year	int
duration	numeric

3. The artists table has attributes:

Column	Data Type
* artist_id	varchar
name	varchar
location	varchar
latitude	numeric
longitude	numeric

4. The time table has attributes:

Column	Data Type
* start_time	timestamp
hour	integer
day	integer
week	integer
month	integer
year	integer
weekday	integer

ETL Pipeline

Song and artist information is loaded from the song metadata JSON files into a pandas dataframe. The data is cleaned by replacing invalid values with null values, and is then used to populate the songs and artists tables according to the schema above.

User activity events are loaded from the events log JSON files into a pandas dataframe which is filtered to consist only of song play events ('page' = 'NextSong'). This dataframe is then used to populate the songplay, users, and time tables according to the schema above.

Usage

Usage steps:

Copy the dataset files (JSON format) into the data/ directory.
Ensure that there is a PostgreSQL server instance running.
Modify the database connection credentials HOST, USERNAME, PASSWORD in create_tables.py as appropriate for your PostgreSQL server instance.
From the projects root directory, execute the following from the command line:

python3 create_tables.py
python3 etl.py

Querying

Once the ETL pipeline has been run, you can query the PostgreSQL database to answer analytical questions.

For example, suppose you want to find out which three days of the week have the highest total play count, a sample query would be:

SELECT
    to_char(to_date(weekday||'-2017','IDDD-IYYY'), 'Day') AS day,
    count(*) AS play_count
FROM
    songplays JOIN time ON songplays.start_time = time.start_time
GROUP BY 1
ORDER BY 2 DESC
LIMIT 3

Result:

    day    | play_count 
-----------+------------
 Sunday    |       1370
 Tuesday   |       1364
 Thursday  |       1295

Copyright & License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Credits

Project template provided by Udacity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity Data Engineering Nanodegree - Project 1

Overview

Dependencies

Files

Datasets

Database Schema

Fact Table

Dimension Tables

ETL Pipeline

Usage

Querying

Copyright & License

Credits

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
etl.py		etl.py
sql_queries.py		sql_queries.py

gwejones/udacity_dend_proj1

Folders and files

Latest commit

History

Repository files navigation

Udacity Data Engineering Nanodegree - Project 1

Overview

Dependencies

Files

Datasets

Database Schema

Fact Table

Dimension Tables

ETL Pipeline

Usage

Querying

Copyright & License

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages