Stack Overflow Big Data Processing

Project Description

Spun an Elastic MapReduce (EMR) cluster based on Spark and created a Spark application written in Python.
Implemented Python API for Apache Spark (PySpark) and performed spark-submit to process data from the Stack Overflow Annual Developer Survey 2020.
Created an S3 bucket to upload "survey_results_public.csv" file so EMR can access it for data processing.
Locally issued Linux commands (Amazon Linux 2) to the EMR cluster's master node by connecting to an Elastic Compute Cloud (EC2) instance using Secure Shell (SSH) connection.

Overview

Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.
- Software Configuration
  - emr-5.36.0
  - Spark: Spark 2.4.8 on Hadoop 2.10.1 YARN and Zeppelin 0.10.0
- Hardware Configuration
  - m5.xlarge
  - Number of instances: 3
- Security and access
  - EC2 key pair (used Amazon EC2 to create a RSA key pair)
Set-up a new S3 bucket to upload the file "survey_results_public.csv" so EMR can access it for data processing.
Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file.
Created a Spark application in a Python file called "main.py" for Spark storage job to process data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

S3_DATA_SOURCE_PATH = 's3://stackoverflow-123456/data-source/survey_results_public.csv'
S3_DATA_OUTPUT_PATH = 's3://stackoverflow-123456/data-output'

def main ():
    spark = SparkSession.builder.appName('StackoverflowApp').getOrCreate()
    all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
    print('Total number of records in the source data: %s' % all_data.count())
    selected_data = all_data.where((col('Country') == 'United States') & (col('WorkWeekHrs') > 45))
    print('The number of engineers who work more than 45 hours a week in the US is: %s' % selected_data.count())
    selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH)
    print('Selected data was was successfully saved to S3: %s' % S3_DATA_OUTPUT_PATH)

if __name__ == '__main__':
    main()

Opened port 22 to SSH into the EMR cluster using IP address and ran spark-submit command for "main.py" data processing.

Result

A new folder called "data-output" with parquet files was created in the same S3 bucket, after executing commands written in "main.py".

Language & Tools

Python
SQL
Spark (PySpark)
AWS (EMR, EC2, S3)
Bash (Amazon Linux 2)

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stack Overflow Big Data Processing

Project Description

Overview

Result

Language & Tools

About

Releases

Packages

Languages

dpghazi-zz/stack-overflow-big-data-processing

Folders and files

Latest commit

History

Repository files navigation

Stack Overflow Big Data Processing

Project Description

Overview

Result

Language & Tools

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages