- Spun an Elastic MapReduce (EMR) cluster based on Spark and created a Spark application written in Python.
- Implemented Python API for Apache Spark (PySpark) and performed spark-submit to process data from the Stack Overflow Annual Developer Survey 2020.
- Created an S3 bucket to upload "survey_results_public.csv" file so EMR can access it for data processing.
- Locally issued Linux commands (Amazon Linux 2) to the EMR cluster's master node by connecting to an Elastic Compute Cloud (EC2) instance using Secure Shell (SSH) connection.
- Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.
- Set-up a new S3 bucket to upload the file "survey_results_public.csv" so EMR can access it for data processing.
- Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file.
- Created a Spark application in a Python file called "main.py" for Spark storage job to process data.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
S3_DATA_SOURCE_PATH = 's3://stackoverflow-123456/data-source/survey_results_public.csv'
S3_DATA_OUTPUT_PATH = 's3://stackoverflow-123456/data-output'
def main ():
spark = SparkSession.builder.appName('StackoverflowApp').getOrCreate()
all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
print('Total number of records in the source data: %s' % all_data.count())
selected_data = all_data.where((col('Country') == 'United States') & (col('WorkWeekHrs') > 45))
print('The number of engineers who work more than 45 hours a week in the US is: %s' % selected_data.count())
selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH)
print('Selected data was was successfully saved to S3: %s' % S3_DATA_OUTPUT_PATH)
if __name__ == '__main__':
main()
- Opened port 22 to SSH into the EMR cluster using IP address and ran spark-submit command for "main.py" data processing.
- A new folder called "data-output" with parquet files was created in the same S3 bucket, after executing commands written in "main.py".
- Python
- SQL
- Spark (PySpark)
- AWS (EMR, EC2, S3)
- Bash (Amazon Linux 2)