-
Notifications
You must be signed in to change notification settings - Fork 59
Structured vs Unstructured Data
Lets say we are the CodeAward organization, and offer scholarships to programmers who have overcome adversity. We have the following 2 datasets:
case class Demographic( id: Int,
age: Int,
codingBootcamp: Boolean,
country: String,
gender: String,
isEthnicMinority: Boolean,
servedInMilitary: Boolean)
val demographics = sc.textfile(...)... // Pair RDD: (id, demographic)
case class Finances(id: Int,
hasDebt: Boolean,
hasFinancialDependents: Boolean,
hasStudentLoans: Boolean,
income: Int)
val finances = sc.textfile(...)... // Pair RDD: (id, finances)
Our goal is to tally up and select students for specific scholarship. For example, lets count:
- Swiss students
- With dept and financial dependents
How might we implement this Spark program?
Possibility 1
demographics.join(finances) // Pair RDD: (Int, (Demographic,Finances))
.filter { p =>
p._2._1.country == "Switzerland" &&
p._2._2.hasFinancialDependents &&
p._2._2.hasDebt
}.count
- Inner Join first
- Filter to select people in Switzerland
- Filter to select people with debt and financial dependents
Possibility 2
val filtered = finances.filter(p => p._2.hasFinancialDependents && p._2.hasDebt)
demographics.filter(p => p._2.country == "Switzerland")
.join(filtered)
.count
- Filter down the database first (all people with debt and financial dependents)
- Filter down people that have country as Switzerland
- Inner Join on smaller, filtered down rdds
Possibility 3
Cartesian product of {A, K, Q, J, 10, 9, 8, 7, 6, 5, 4, 3, 2} with card suits {♠, ♥, ♦, ♣} gives 52 cards
val cartesian = demographics.cartesian(finances)
cartesian.filter {
case (p1, p2) => p1._1 == p2._1
}.filter {
case (p1, p2) => (p1._2.country == ŏSwitzerlandŏ) &&
(p2._2.hasFinancialDependents) &&
(p2._2.hasDebt)
}.count
- Cartesian product on both rdds
- Filter to select resulting of cartesian with same IDs
- Filter to select people in Switzerland who have debt and financial dependents
While for all three of these possible solutions, the end result is the same, the time it takes to execute the job is vastly different.
Turns out, possibility 1 is 3.6 times slower than possibility 2, and possibility 3 is 177 times slower than possibility 2.
Wouldn't it be nice if Spark automatically knew, if we wrote the code in possibility 3, that it could rewrite our code to possibility 2?
Given a bit of extra structural information, Spark can do many optimizations!
All data is not equal structurally. It falls on a spectrum from unstructured to structured.
Example of semistructured data is json
. It defines its own structure.
Spark and RDDs don't know anything about the schema of the data it's dealing with.
Given an arbitrary RDD, Spark knows that the RDD is parameterized with arbitrary types such as:
- Person
- Account
- Demographic
- etc
But it does not know anything about these types's structure.
Consider we have a case class Account:
case class Account(name: String, balance: Double, risk: Boolean)
and we have a RDD of of Account objects i.e. RDD[Account]
:
Here Spark/RDDs see only Blobs of objects that are called Account. Spark cannot see inside this object or analyze how it may be used or optimized based on that usage. It's opaque.
On the other hand, a database/Hive performs computations on columns on named and typed values. So everything is know about the structure of the data and hence they are heavily optimized.
So if Spark could see data this way, it could break up the data and optimize on the available information.
The same can be said about Computation.
In Spark:
- we do functional transformations on data.
- we pass user defined function literal to higher order functions like
map
,flatMap
,filter
.
Just like the data, the function literals are completely Opaque to spark as well. There is no structure to these user defined functions as far as Spark is concerned. A user can do anything
inside of one of the above functions, and all Spark can see is something like: $anon$1@604f1a67
In Database/Hive:
- we do declarative transformations on data.
- Specialized and structured, pre-defined operations. Eg.
SELECT * FROM * WHERE *
Databases know the operations that are being done on the data.
RDDs operate on unstructured data, and there are few limits on computation; your computations are defeined as functions that you have written your self, on your own data types.
But as we saw, we have to do all the optimization work ourselves.
Wouldn't it be nice if Spark could do some of these optimizations for us?
Spark SQL makes this possible. We have to give up some of the freedom, flexibility and generality of the functional collections API in order to give Spark more opportunities to optimize though.
Week 1
- Introduction
- Data Parallel to Distributed Data Parallel
- Latency
- RDDs: Spark's Distributed Collection
- RDDs: Transformation and Action
- Evaluation in Spark: Unlike Scala Collections!
- Cluster Topology Matters!
Week 2
- Reduction Operations (fold, foldLeft, aggregate)
- Pair RDDs
- Pair RDDs: Transformations and Actions
- Pair RDDs: Joins
Week 3
- Shuffling: What it is and why it's important
- Partitioning
- Optimizing with Partitioners
- Wide vs Narrow Dependencies
Week 4