-
Notifications
You must be signed in to change notification settings - Fork 59
Pair RDDs
Here, we're going to focus on distributed key values which are a popular way of representing and organizing large quantities of data in practice. In Spark, we called these distributed key-value pairs Pair RDDs.
In single-node Scala, key-value pairs can be thought of as maps, or in someother programming languages as dicitonaries. Map-Reduce is also based on key-value pairs of large distributed datasets.
They are useful because they allow us to act on each key in parallel or regroup data across the network.
An RDD parameterized by a pair are treated as Pair RDD by spark:
RDD[(k, v)] // <------- treated specially by Spark
Pair RDDs can be created from already existing regular RDDs for example by using the map
operation on the regular RDD:
val rdd: RDD[WikipediaPage] = ...
val pairRdd = rdd.map( wikipediapage => (wikipediapage.title, wikipediapage.text) )
Pair RDDs have additional, specialized methods for working with data associated with keys. Some of the commonly used are:
def groupByKey(): RDD[(K, Iterable[V])
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
Consider an RDD which has elements like this json:
"definitions": {
"firstname": "string",
"lastname": "string",
"address": {
"type": "object",
"properties": {
"street": {
"type: string"
},
"city": {
"type": "string"
},
"state": {
"type": "string"
}
},
"required": ["street", "city", "state"]
}
}
Here we can create a Pair RDD such that it maps all city names to a particular value (like the full address), etc.
Week 1
- Introduction
- Data Parallel to Distributed Data Parallel
- Latency
- RDDs: Spark's Distributed Collection
- RDDs: Transformation and Action
- Evaluation in Spark: Unlike Scala Collections!
- Cluster Topology Matters!
Week 2
- Reduction Operations (fold, foldLeft, aggregate)
- Pair RDDs
- Pair RDDs: Transformations and Actions
- Pair RDDs: Joins
Week 3
- Shuffling: What it is and why it's important
- Partitioning
- Optimizing with Partitioners
- Wide vs Narrow Dependencies
Week 4