With the goal of launching a data-driven advertising campaign on YouTube, the customer needed insights to categorize videos based on comments and statistics and to identify factors affecting video popularity. Key metrics like views, likes, shares, and comments were crucial for analyzing user engagement. The dataset, sourced from Kaggle, included structured (CSV) and semi-structured (JSON) data on top trending YouTube videos and their categories. The challenge lay in building an efficient pipeline to process, clean, and analyze this data for actionable insights.
- Build a data lake to store and manage large volumes of structured and semi-structured data.
- Develop a lightweight ETL pipeline for data transformation and cleaning.
- Perform SQL-based data integration to create a unified dataset.
- Build interactive BI dashboards to visualize insights for decision-making.
- Enable the customer to identify trends and launch a successful ad campaign.
- Configured Amazon S3 to serve as the foundation for the data lake.
- A landing bucket was created to store raw datasets, including:
- A CSV file containing video statistics (views, likes, shares, etc.).
- A JSON file defining video categories.
- Ensured S3 bucket permissions and structure followed best practices for scalability and security.
- A landing bucket was created to store raw datasets, including:
- Utilized AWS Glue to create a Data Catalog, organizing metadata for seamless querying.
- Developed a light ETL pipeline using AWS Lambda:
- Converted nested JSON files to Parquet format for optimized storage and querying.
- Addressed challenges in nested JSON structure by parsing complex data fields.
- Used AWS Athena to perform SQL joins on the cleaned Parquet files.
- Integrated data from JSON and CSV files into a unified dataset.
- Generated intermediate datasets containing key metrics like:
- Total views, likes, and dislikes by genre.
- Engagement patterns by channel and category.
- Stored the transformed data in a new S3 reporting bucket.
- Built an interactive BI dashboard using AWS QuickSight:
- Bar Graphs:
- Sum of likes and dislikes by genre.
- Sum of views by genre.
- Donut Charts:
- Distribution of views by genre.
- Distribution of views by channel titles.
- Configured dynamic filters for drill-down analysis by regions and genres.
- Bar Graphs:
- Amazon S3: For data storage and lake creation.
- AWS Glue: For data cataloging and metadata management.
- AWS Lambda: For serverless ETL processing and JSON to Parquet conversion.
- AWS Athena: For SQL querying and data integration.
- AWS QuickSight: For building dashboards and visualizing insights.
- IAM: For creating secure roles and policies to manage access across AWS services.
- Handling Nested JSON Files:
- Directly querying nested JSON structures was infeasible. This required implementing a custom ETL step using AWS Lambda to flatten and convert JSON to Parquet.
- Data Cleansing:
- Identifying and resolving issues with join keys between the JSON and CSV datasets was critical for successful integration.
- Optimization:
- Ensuring that the data pipeline minimized latency and optimized storage without compromising query performance.
- The pipeline processed and analyzed large volumes of YouTube data efficiently.
- Key insights provided through dashboards:
- Top genres by views, likes, and dislikes.
- Leading channels driving engagement in specific regions.
- Factors influencing video popularity, such as comment patterns and interaction rates.
- The customer launched a targeted advertising campaign on YouTube, leveraging these insights to maximize user engagement and ROI.
This project was a significant learning experience in:
- Data Engineering:
- Designing and implementing a scalable data lake.
- Building ETL pipelines to process and transform large datasets.
- AWS Services:
- Leveraging S3, Glue, Lambda, Athena, and QuickSight for end-to-end data management and visualization.
- Data Integration:
- Handling structured and semi-structured data to create a unified dataset.
- Visualization:
- Creating interactive BI dashboards to communicate key insights effectively.
While the BI dashboard was an outcome, the focus was on mastering the ETL pipeline and data warehousing techniques, making this an excellent portfolio project for roles in Data Engineering, Data Analytics, and Data Science.
- Expand the pipeline to handle real-time streaming data from YouTube APIs.
- Incorporate advanced analytics, such as sentiment analysis of comments using AWS Comprehend.
- Enhance dashboards with predictive modeling insights using Amazon SageMaker.