project-dezoomcamp

Project For dezoomcamp with datatalks club

Background Issue

(This is a hypothetical and synthetic requirement formulated for the zoomcamp project).

Make some data pipeline from dataset Spotify tracks and Get some Popularity Tracks, Artist or etc and put all the information to dashboard for stakeholder

Project high level design

This project produces a pipeline which:

Build cloud infrastructure using Terraform
pull the raw data into GCP cloud
Transforms the raw data
Joins the artists and tracks table to provide popularity write them back into BigQuery
Produce dashboard tiles in Google Data studio.
This allows the analytics to view the combined tracks and artists popularity information for quick review.

Dataset

Spotify Dataset

Technology choices

Cloud: GCP
Datalake: GCP Bucket
Infrastructure as code (IaC): Terraform
Workflow orchestration: Airflow
Data Warehouse: BigQuery
Transformation: Google Cloud Dataproc

Installation Google Cloud Infrastructure Using Terraform

# Refresh service-account's auth-token for this session
gcloud auth application-default login --no-launch-browser

# Initialize state file (.tfstate)
cd terraform/
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your project id>" \
-var="region=<your region>" \
-var="BQ_DATASET=<datsetname on bigquery>" \
-var="DATAPROC_CLUSTERNAME=<dataproc clustername>" \
-var="SERVICE_ACCOUNT=<service-account-from-iam>"

# Create new infra
terraform apply -var="project=<your project id>" \
-var="region=<your region>" \
-var="BQ_DATASET=<datsetname on bigquery>" \
-var="DATAPROC_CLUSTERNAME=<dataproc clustername>" \
-var="SERVICE_ACCOUNT=<service-account-from-iam>"

# Delete infra after your work, to avoid costs on any running services
terraform destroy -var="project=<your project id>"

Installation Google Dataproc

Because I already tried build data proc using terraform sucessfullt but when I running the pyspark job is still error. and I build google dataproc using wizard from gcp console and running the pyspark job is successfully, I recommended build the google dataproc using wizard from gcp console.

dataproc-setup

Setup Airflow

## Installation Airflow

```shell
docker-compose up

Airflow Webserver http://localhost:8090

user: admin
password : admin

# dont forget put google project on .env
GCP_PROJECT_ID=
GCP_GCS_BUCKET=

Dashboard

Total number of tracks
Total number of artists
Most popular song - by popularity
Most popular artist - by followers
Most Tracks - Sort by Artist

View-Dashboard-On-Google-Datastudio

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data-ingestion		data-ingestion
notebooks		notebooks
screenshoot		screenshoot
terraform		terraform
.gitignore		.gitignore
README.md		README.md
setup-gcp-dataproc.md		setup-gcp-dataproc.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

project-dezoomcamp

Background Issue

Project high level design

Dataset

Technology choices

Installation Google Cloud Infrastructure Using Terraform

Installation Google Dataproc

Setup Airflow

Dashboard

About

Releases

Packages

Contributors 2

Languages

febridev/project-dezoomcamp

Folders and files

Latest commit

History

Repository files navigation

project-dezoomcamp

Background Issue

Project high level design

Dataset

Technology choices

Installation Google Cloud Infrastructure Using Terraform

Installation Google Dataproc

Setup Airflow

Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages