Course Project

The goal of this course project is to gain experience with creating a streaming data pipeline with cloud computing technologies by designing and implementing an actor-model service using Akka that ingests logfile generated data in real time and delivers it via an event-based service called Kafka to Spark for further processing. This is a group project with each group consisting of one to six students. No student can participate in more than one group.

Grade: 20%

Team Members

Santhanagopalan Krishnamoorthy
Thiruvenkadam S R
Ashwin Bhaskar Srivatsa
Rahul Sai Samineni

Development Environment

Language : Scala v
IDE : Intellij IDEA Community Edition
Build Tool : SBT 1.5.2
Frameworks Used : CloudFlow,Akka,Spark, Kafka
Deployment : AWS EKS, AWS Lambda

Installations

Install and build cloudflow
Installing Kafka and Strimzi
Adding Spark support
Install and setup docker

Development and Deployment

To run cloudflow application, we first have to install cloudflow locally along with the cloudflow kubectl plugin. Ref

To deploy the application, we first create a Kubernetes cluster in EKS. Cloudflow also provides a (script)[https://cloudflow.io/docs/2.0.0/get-started/setup-k8s-cluster.html] to spin up a Kubernetes cluster with the required configurations. Once the cluster is ready, we add the context of the aws cluster to our local kubectl. aws eks update-kubeconfig --name clusterName.

Now we use Helm to install Cloudflow within the kubernetes cluster along with (Strimzi-kafka)[https://cloudflow.io/docs/dev/administration/how-to-install-and-use-strimzi.html] and the (Cloudflow spark operator)[https://cloudflow.io/docs/dev/administration/how-to-install-and-use-strimzi.html].

Once the installation is completed, we would be able to see all the operators running as pods in the cluster as below:

We also have to create a PersistentVolume in the kubernetes cluster with some storage provisioner and create a PersistentVolumeClaim of the name "cloudflow-spark" for the spark operator to store the data as explained in the cloudflow documentation.

In this case we use a open source NFS storage provisioner to create our Persistent Volume that the spark operator will use.

Once the application is developed, we first have to package our application into docker containers by adding the registry-repository details in the file `target-env.sbt'

Create a docker repo

ThisBuild / cloudflowDockerRegistry := Some("docker.io")
ThisBuild / cloudflowDockerRepository := Some("<your docker repository>")

Once this is done, the cloudflow application will automatically create the docker images and push them to the repository specified.

Build and publish the app using

sbt buildApp

To deploy the applications to kubernetes cluster

$ cat dockerpwd.txt | kubectl cloudflow deploy /path/to/CS441-CourseProject/target/CS441-CourseProject.json -u dockeruser --password-stdin

If using AWS ECR, provide those credentials instead.

After executing these commands you can see the streamlets running in different pods.

Project Structure

Logprocessor - Write the logs to s3.
akka-log-ingestor - Contains the ingestor streamlet which receives the key from the lambda function and adds the records in kafka streams.
log-processor-pipeline - Contains the blueprint which defines the flow of the streamlets.
spark-aggregator - Contains the spark aggregation program to perform some data aggregation tasks.
EmailProc.scala - Contains the code to dispatch email using AWS SES.

Complete flow along with loggers can be seen here

About Cloudflow Applications

Cloudflow provides an easy way to package and deploy structured streaming applications for streaming pipelines. Cloudflow uses the blueprint to orchestrate the streaming pipeline which are physically implemented as Kafka streams that are made typesafe during compile time using on Avro Schemas.

Some concepts on cloudflow.

Cloudflow Streamlets

Streamlets are cannonical class names which have inlets and outlets. A Streamlet can have more than one inlet and outlet. There are different streamlet shapes

Ingress - Ingress are streamlets with zero inlets and one or more outlets
Processor - A processor has one inlet and one outlet.
Fanout - FanOut-shaped streamlets have a single inlet and two or more outlets.
Fanin - FanIn-shaped streamlets have a single outlet and two or more inlets.
Egress - Egress has inlets but zero outlets.

Avro Schemas

Inlets and outlets of specific streamlets can handle data specified by Avro schemas.

Blueprint

The shape of Streamlets with inlets and outlets are specified in a blueprint.

The image below shows a general understanding of cloudflow model.

The figure shows the structures of akka streamlets and kafka streams in a cloudflow project. ##Implementation The overall architecture of our model is given by the image below

We have written a cronjob to add files to s3 bucket every 2 minutes. This triggers an AWS Lambda function which sends the key of the updated file to an akka streamlet which performs some file processing on "WARN" and "ERROR" messages.
This contents are transferred to kafka streams which are accessed by the spark streamlet which performs some aggregations on these messages and sends the outputs to clients using AWS email service.

Running the application

execute sbt buildApp
Deploy the application in amazon EKS as given in the description above
Run the cron job to add files to s3. After files have been added to s3 you should see the following output

This is the output produced by the spark program which is a generated email that contains the count of "WARN" and "ERROR" messages.

Tests

To run the test execute the command sbt test
The test/Test.scala has all the tests to be performed

Test if the bucket exists in s3
Test for name of the s3 bucket key of the file
Test for name of s3 region
Test for checking the email dispatching function
Test for Spark window

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
akka-log-ingestor		akka-log-ingestor
datamodel		datamodel
deployment-scripts		deployment-scripts
images		images
log-processor-pipeline/src/main/blueprint		log-processor-pipeline/src/main/blueprint
project		project
spark-aggregator/src/main/scala/logproc/aggregator		spark-aggregator/src/main/scala/logproc/aggregator
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
README.md		README.md
build.sbt		build.sbt
feedback.txt		feedback.txt
flow.md		flow.md
lambda.js		lambda.js
target-env.sbt		target-env.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Course Project

Grade: 20%

Team Members

Development Environment

Installations

Development and Deployment

Project Structure

About Cloudflow Applications

Cloudflow Streamlets

Avro Schemas

Blueprint

Running the application

Tests

About

Releases

Packages

Contributors 3

Languages

Ashwin1234/CS441-CourseProject

Folders and files

Latest commit

History

Repository files navigation

Course Project

Grade: 20%

Team Members

Development Environment

Installations

Development and Deployment

Project Structure

About Cloudflow Applications

Cloudflow Streamlets

Avro Schemas

Blueprint

Running the application

Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages