Skip to content

Pair RDDs

Rohit edited this page Mar 25, 2017 · 3 revisions

Here, we're going to focus on distributed key values which are a popular way of representing and organizing large quantities of data in practice. In Spark, we called these distributed key-value pairs Pair RDDs.

In single-node Scala, key-value pairs can be thought of as maps, or in someother programming languages as dicitonaries. Map-Reduce is also based on key-value pairs of large distributed datasets.

They are useful because they allow us to act on each key in parallel or regroup data across the network.

An RDD parameterized by a pair are treated as Pair RDD by spark:

RDD[(k, v)] // <------- treated specially by Spark

Pair RDDs can be created from already existing regular RDDs for example by using the map operation on the regular RDD:

val rdd: RDD[WikipediaPage] = ...
val pairRdd = rdd.map( wikipediapage => (wikipediapage.title, wikipediapage.text) ) 

Pair RDDs have additional, specialized methods for working with data associated with keys. Some of the commonly used are:

def groupByKey(): RDD[(K, Iterable[V])
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Example

Consider an RDD which has elements like this json:

"definitions": {
    "firstname": "string",
    "lastname": "string",
    "address": {
        "type": "object",
        "properties": {
            "street": {
                "type: string"
            },
            "city": {
                "type": "string"
            },
            "state": {
                "type": "string"
            }
        },
        "required": ["street", "city", "state"]
    }
}

Here we can create a Pair RDD such that it maps all city names to a particular value (like the full address), etc.