Skip to content

Rahulub3r/Spark-and-Kafka-for-Information-Retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

6347_miniproject03_student

The goal of this exercise is to develop a structured streaming spark application that reads particulate matter readings from Kafka, extract values from the the readings and then calculate aggregate statistics.

The excercise is based on data from the following study: Liang, X., S. Li, S. Zhang, H. Huang, and S. X. Chen (2016), PM2.5 data reliability, consistency, and air quality assessment in five Chinese cities, J. Geophys. Res. Atmos., 121, 10220–10236.

You will do so by implementing the three methods below:

  • ingestKafkaTopic
  • extractValues
  • calculateTopPollutionEventsPerWeek

You're done when all test cases pass.

The data will be streaming from five sources:

  • US Post in Beijing
  • US Post in Chengdu
  • US Post in Guangzhou
  • US Post in Shanghai
  • US Post in Shenyang

The first reading is from 1/1/2010 at midnight, the last reading is from 12/31/2015 at 11pm.

The key for each kafka record is the name of the source. The value for each kafka record is a CSV string with the following columns:

  • year, numeric year, e.g. 2015
  • month, numeric month, January is 1 etc.
  • day, numeric day, the first of the month is 1 etc.
  • hour, numeric 24h hour, 11pm is 23 etc.
  • season, ignore
  • PM_US Post, a double, the particulate matter reading at the US Post for the source,
  • DEWP, a double, the dew point
  • HUMI, a double, the humidity
  • PRES, a double, the pressure
  • TEMP, a double, the temperature
  • cbwd, a double, combined wind direction
  • Iws, a double, cumulated wind speed
  • precipitation, a double, hourly precipication in mm
  • Iprec, a double, cumulated precipication

Example kafka record: key: "BeijingPM20100101_20151231.csv" value: "2010,1,1,0,4,null,-21.0,43.0,1021.0,-11.0,NW,1.79,0.0,0.0"

Good luck!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages