Skip to content

Latest commit

 

History

History
30 lines (26 loc) · 1.4 KB

README.md

File metadata and controls

30 lines (26 loc) · 1.4 KB

Dataset Description

The dataset consists of the following key attributes:

Workflow and Techniques Used

1. Data Storage

  • The dataset is stored in HDFS (/user/hadoop/dataset/spotify_data.csv) to enable distributed storage and processing.

2. Data Preprocessing with Apache Pig

  • The raw dataset is loaded into Pig for preprocessing.
  • Cleaning steps include filtering out null values and inconsistent data.
  • Aggregation is performed to compute average track popularity per artist.
  • The cleaned and aggregated data is stored back in HDFS (/user/hadoop/output/spotify_cleaned).

3. Data Analysis wit Hive

  • A Hive table is created to store the structured dataset.
  • Data is loaded from HDFS into Hive for querying.
  • Three key queries are executed to analyze the dataset:
  1. Identifying the top 5 most popular tracks.
  2. Computing the average popularity of tracks per artist.
  3. Finding the total number of tracks per artist.

4. Query Results and Export

  • The results of Hive queries are stored back into HDFS for further use.
  • The processed data can be exported and visualized for deeper insights.