A koan is an incomplete test. Complete it, and find enlightenment.
This is an interactive tutorial on Apache Spark with Scala. There are a series of unit tests: some already pass, while others require you to fill in the gaps to make them pass. Where you see __
, replace it with the correct value, and where you see ???
, replace it with a function body. Each test class has a Spark context called sc
which is created by the TestSparkContext
trait, giving you access to Spark's functionality.
While it may be possible to complete these exercises with no knowledge of Scala, it is assumed that you already have some familiarity with Scala and Scala collections.
Inspired by many other koan-style projects, which I guess all started with the Ruby koans.
It should be possible to complete these exercises with only Scala and SBT installed. All dependencies, including Spark itself, should be downloaded by SBT.
Apache Spark is an open source (Apache license) cluster computing engine. Put plainly, it's a tool for analysing large amounts of data in order to learn something about that data, and its strengths lie in its speed, versatility and language bindings. It can be used standalone or within Apache Hadoop and comes with bindings for Scala, Java and Python. Spark is designed with the intent of unifying batch processing, stream processing and interactive (query-based) analytics into one framework, which occur through its built-in libraries:
- Spark SQL - a SQL interface for querying structured data
- Spark Streaming - tools for processing real-time data streams
- MLlib - a collection of machine learning algorithms: classification, regression, clustering, etc
- GraphX - tools for analysing graphs (the vertex-edge kind)
sbt "testOnly AboutRDDs"
- Build an RDD from a parallelized collection
- Build an RDD from a file
- Partitioning
- Map, reduce and filter
- Counting
- Zipping
- House prices
sbt "testOnly AboutKeyValuePairs"
- Key-value pairs
- Mapping values; reducing keys
- Grouping by key
- Sorting by key
- Counting words
- Joins
- Subtract by key (set difference)
- Co-group
sbt "testOnly AboutVectors"
- Local vectors
- Local matrices
sbt "testOnly AboutStatistics"
- Summary statistics
- Correlations