sxhixho / Preprocessing_Analysis Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A project that demonstrates data storage, preprocessing, and analysis using tools like HDFS, Apache Pig, and Hive, executed in an Azure virtual machine environment. The project includes cleaning and aggregating a Spotify dataset and running Hive queries to extract meaningful insights.

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
preprocessing		preprocessing
queries		queries
README.md		README.md
album_duration_avg		album_duration_avg
albume_type_count		albume_type_count
artist_followers_total		artist_followers_total
features_tempo_avg		features_tempo_avg
features_tracks_by_mode		features_tracks_by_mode
preprocessing_analysis.pdf		preprocessing_analysis.pdf

Repository files navigation

Dataset Description

The dataset consists of the following key attributes:

track_id: Unique identifier for each track
track_name: Name of the song
artist: Name of the artist who performed the track
duration: Length of the song in milliseconds
popularity: A numerical value indicating the track’s popularity
source: https://www.kaggle.com/datasets/tonygordonjr/spotify-dataset-2023

Workflow and Techniques Used

1. Data Storage

The dataset is stored in HDFS (/user/hadoop/dataset/spotify_data.csv) to enable distributed storage and processing.

2. Data Preprocessing with Apache Pig

The raw dataset is loaded into Pig for preprocessing.
Cleaning steps include filtering out null values and inconsistent data.
Aggregation is performed to compute average track popularity per artist.
The cleaned and aggregated data is stored back in HDFS (/user/hadoop/output/spotify_cleaned).

3. Data Analysis wit Hive

A Hive table is created to store the structured dataset.
Data is loaded from HDFS into Hive for querying.
Three key queries are executed to analyze the dataset:

Identifying the top 5 most popular tracks.
Computing the average popularity of tracks per artist.
Finding the total number of tracks per artist.

4. Query Results and Export

The results of Hive queries are stored back into HDFS for further use.
The processed data can be exported and visualized for deeper insights.

About

A project that demonstrates data storage, preprocessing, and analysis using tools like HDFS, Apache Pig, and Hive, executed in an Azure virtual machine environment. The project includes cleaning and aggregating a Spotify dataset and running Hive queries to extract meaningful insights.

hadoop hdfs data-processing sql-queries data-cleaning data-aggregation apache-pig apache-hive big-data-analytics

Report repository

Releases

No releases published

Packages

No packages published