Skip to content

Code for creating a Spark application written in Python and Big Data Processing with Spark (PySpark) and AWS (EMR)

Notifications You must be signed in to change notification settings

dpghazi-zz/stack-overflow-big-data-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 

Repository files navigation

Stack Overflow Big Data Processing

Screen Shot 2022-08-17 at 11 48 07 AM

Project Description

  • Spun an Elastic MapReduce (EMR) cluster based on Spark and created a Spark application written in Python.
  • Implemented Python API for Apache Spark (PySpark) and performed spark-submit to process data from the Stack Overflow Annual Developer Survey 2020.
  • Created an S3 bucket to upload "survey_results_public.csv" file so EMR can access it for data processing.
  • Locally issued Linux commands (Amazon Linux 2) to the EMR cluster's master node by connecting to an Elastic Compute Cloud (EC2) instance using Secure Shell (SSH) connection.

Overview

  • Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.
    • Software Configuration
      • emr-5.36.0
      • Spark: Spark 2.4.8 on Hadoop 2.10.1 YARN and Zeppelin 0.10.0
    • Hardware Configuration
      • m5.xlarge
      • Number of instances: 3
    • Security and access
      • EC2 key pair (used Amazon EC2 to create a RSA key pair) Kapture 2022-08-17 at 16 07 08
  • Set-up a new S3 bucket to upload the file "survey_results_public.csv" so EMR can access it for data processing. Kapture 2022-08-17 at 17 02 56
  • Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file. Kapture 2022-08-17 at 17 12 50
  • Created a Spark application in a Python file called "main.py" for Spark storage job to process data.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

S3_DATA_SOURCE_PATH = 's3://stackoverflow-123456/data-source/survey_results_public.csv'
S3_DATA_OUTPUT_PATH = 's3://stackoverflow-123456/data-output'

def main ():
    spark = SparkSession.builder.appName('StackoverflowApp').getOrCreate()
    all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
    print('Total number of records in the source data: %s' % all_data.count())
    selected_data = all_data.where((col('Country') == 'United States') & (col('WorkWeekHrs') > 45))
    print('The number of engineers who work more than 45 hours a week in the US is: %s' % selected_data.count())
    selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH)
    print('Selected data was was successfully saved to S3: %s' % S3_DATA_OUTPUT_PATH)

if __name__ == '__main__':
    main()
  • Opened port 22 to SSH into the EMR cluster using IP address and ran spark-submit command for "main.py" data processing. Kapture 2022-08-17 at 17 35 25 Kapture 2022-08-17 at 18 49 00

Result

Kapture 2022-08-17 at 18 53 44 Screen Shot 2022-08-17 at 6 56 05 PM Screen Shot 2022-08-17 at 6 56 28 PM Screen Shot 2022-08-17 at 6 56 49 PM

  • A new folder called "data-output" with parquet files was created in the same S3 bucket, after executing commands written in "main.py". Kapture 2022-08-17 at 18 59 50

Language & Tools

  • Python
  • SQL
  • Spark (PySpark)
  • AWS (EMR, EC2, S3)
  • Bash (Amazon Linux 2)

About

Code for creating a Spark application written in Python and Big Data Processing with Spark (PySpark) and AWS (EMR)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages