Skip to content

A project that demonstrates data storage, preprocessing, and analysis using tools like HDFS, Apache Pig, and Hive, executed in an Azure virtual machine environment. The project includes cleaning and aggregating a Spotify dataset and running Hive queries to extract meaningful insights.

Notifications You must be signed in to change notification settings

sxhixho/Preprocessing_Analysis

Repository files navigation

Dataset Description

The dataset consists of the following key attributes:

Workflow and Techniques Used

1. Data Storage

  • The dataset is stored in HDFS (/user/hadoop/dataset/spotify_data.csv) to enable distributed storage and processing.

2. Data Preprocessing with Apache Pig

  • The raw dataset is loaded into Pig for preprocessing.
  • Cleaning steps include filtering out null values and inconsistent data.
  • Aggregation is performed to compute average track popularity per artist.
  • The cleaned and aggregated data is stored back in HDFS (/user/hadoop/output/spotify_cleaned).

3. Data Analysis wit Hive

  • A Hive table is created to store the structured dataset.
  • Data is loaded from HDFS into Hive for querying.
  • Three key queries are executed to analyze the dataset:
  1. Identifying the top 5 most popular tracks.
  2. Computing the average popularity of tracks per artist.
  3. Finding the total number of tracks per artist.

4. Query Results and Export

  • The results of Hive queries are stored back into HDFS for further use.
  • The processed data can be exported and visualized for deeper insights.

About

A project that demonstrates data storage, preprocessing, and analysis using tools like HDFS, Apache Pig, and Hive, executed in an Azure virtual machine environment. The project includes cleaning and aggregating a Spotify dataset and running Hive queries to extract meaningful insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published