Project For dezoomcamp with datatalks club
(This is a hypothetical and synthetic requirement formulated for the zoomcamp project).
Make some data pipeline from dataset Spotify tracks and Get some Popularity Tracks, Artist or etc and put all the information to dashboard for stakeholder
This project produces a pipeline which:
- Build cloud infrastructure using Terraform
- pull the raw data into GCP cloud
- Transforms the raw data
- Joins the artists and tracks table to provide popularity write them back into BigQuery
- Produce dashboard tiles in Google Data studio.
- This allows the analytics to view the combined tracks and artists popularity information for quick review.
- Cloud: GCP
- Datalake: GCP Bucket
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Airflow
- Data Warehouse: BigQuery
- Transformation: Google Cloud Dataproc
# Refresh service-account's auth-token for this session
gcloud auth application-default login --no-launch-browser
# Initialize state file (.tfstate)
cd terraform/
terraform init
# Check changes to new infra plan
terraform plan -var="project=<your project id>" \
-var="region=<your region>" \
-var="BQ_DATASET=<datsetname on bigquery>" \
-var="DATAPROC_CLUSTERNAME=<dataproc clustername>" \
-var="SERVICE_ACCOUNT=<service-account-from-iam>"
# Create new infra
terraform apply -var="project=<your project id>" \
-var="region=<your region>" \
-var="BQ_DATASET=<datsetname on bigquery>" \
-var="DATAPROC_CLUSTERNAME=<dataproc clustername>" \
-var="SERVICE_ACCOUNT=<service-account-from-iam>"
# Delete infra after your work, to avoid costs on any running services
terraform destroy -var="project=<your project id>"
Because I already tried build data proc using terraform sucessfullt but when I running the pyspark job is still error. and I build google dataproc using wizard from gcp console and running the pyspark job is successfully, I recommended build the google dataproc using wizard from gcp console.
## Installation Airflow
```shell
docker-compose up
Airflow Webserver http://localhost:8090
user: admin
password : admin
# dont forget put google project on .env
GCP_PROJECT_ID=
GCP_GCS_BUCKET=
- Total number of tracks
- Total number of artists
- Most popular song - by popularity
- Most popular artist - by followers
- Most Tracks - Sort by Artist