Skip to content

Latest commit

 

History

History
174 lines (102 loc) · 6.3 KB

fu03-advanced_pig.asciidoc

File metadata and controls

174 lines (102 loc) · 6.3 KB

Advanced Pig

Advanced Join Fu

Pig has three special-purpose join strategies: the "map-side" (aka 'fragment replicate') join

The map-side join have strong restrictions on the properties

A dataflow designed to take advantage of them can produce order-of-magnitude scalability improvements.

They’re also a great illustration of three key scalability patterns. Once you have a clear picture of how these joins work, you can be confident you understand the map/reduce paradigm deeply.

Map-side Join

A map-side (aka 'fragment replicate') join

In a normal JOIN, the largest dataset goes on the right. In a fragement-replicate join, the largest dataset goes on the left, and everything to the right must be tiny.

The Pig manual calls this a "fragment replicate" join, because that is how Pig thinks about it: the tiny datasets are duplicated to each machine. Throughout the book, I’ll refer to it as a map-side join, because that’s how you should think about it when you’re using it. The other common name for it is a Hash join — and if you want to think about what’s going on inside it, that’s the name you should use.

How a Map-side (Hash) join works

If you’ve been to enough large conferences you’ve seen at least one registration-day debacle. Everyone leaves their hotel to wait in a long line at the convention center, where they have set up different queues for some fine-grained partition of attendees by last name and conference track. Registration is a direct join of the set of attendees on the set of badges; those check-in debacles are basically the stuck reducer problem come to life.

If it’s a really large conference, the organizers will instead set up registration desks at each hotel. Now you don’t have to move very far, and you can wait with your friends. As attendees stream past the registration desk, the 'A-E' volunteer decorates the Arazolos and Eliotts with badges, the 'F-K' volunteer decorates the Gaspers and Kellys, and so forth. Note these important differences: a) the registration center was duplicated in full to each site b) you didn’t have to partition the attendees; Arazolos and Kellys and Zarebas can all use the same registration line.

To do a map-side join, Pig holds the tiny table in a Hash (aka Hashmap or dictionary), indexed by the full join key.

    .-------------.      |
    | tiny table  |      |    ... huge table ...
    +--+----------+      |
    |A | ...a...  |      | Q | ...
    |  | ...a...  |      | B | ...
    |Q | ...q...  |      | B | ...
    |F | ...f...  |      | B | ...
      ...                | A |  ...
    |Z | ...z...  |      | B | ...
    |  | ...z...  |      | B | ...
    |P | ...p...  |      | C | ...
    |_____________|      | Z | ...
                         | A | ...

As each row in the huge table flys by, it is decorated with the matching rows from the tiny table and emitted. Holding the data fully in-memory in a hash table gives you constant-time lookup speed for each key, and lets you access rows at the speed of RAM.

One map-side only pass through the data is enough to do the join.

See distribution of weather measurements for an example.

Example: map-side join of wikipedia page metadata with wikipedia pageview stats

Merge Join

How a merge join works

(explanation)

Quoting Pig docs:

You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data).

Example: merge join of user graph with page rank iteration

Skew Join

(explanation of when needed)

How a skew join works

(explanation how)

Example: ? counting triangles in wikipedia page graph ? OR ? Pageview counts ?

TBD

Efficiency and Scalability

Do’s and Don’ts

The Pig Documentation has a comprehensive section on Performance and Efficiency in Pig. We won’t try to improve on it, but here are some highlights:

  • As early as possible, reduce the size of your data:

    • LIMIT

    • Use a FOREACH to reject unnecessary columns

    • FILTER

  • Filter out `Null`s before a join in a join, all the records rendezvous at the reducer if you reject nulls at the map side, you will reduce network load

Join Optimizations

"Make sure the table with the largest number of tuples per key is the last table in your query. In some of our tests we saw 10x performance improvement as the result of this optimization.

small = load 'small_file' as (t, u, v);
large = load 'large_file' as (x, y, z);
 C = join small by t, large by x;

(explain why)

(come up with a clever mnemonic that doesn’t involve sex, or get permission to use the mnemonic that does.)

Magic Combiners

TBD

Turn off Optimizations

After you’ve been using Pig for a while, you might enjoy learning about all those wonderful optimizations, but it’s rarely necessary to think about them.

In rare cases, you may suspect that the optimizer is working against you or affecting results.

To turn off an optimization

TODO: instructions

Exercises

  1. Quoting Pig docs: > "You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data)."

    Why is this?
  2. Each of the following snippets goes against the Pig documentation’s recommendations in one clear way.

    • Rewrite it according to best practices

    • compare the run time of your improved script against the bad version shown here.

      things like this from http://pig.apache.org/docs/r0.9.2/perf.html --
      1. (fails to use a map-side join)

      2. (join large on small, when it should join small on large)

      3. (many FOREACH`es instead of one expanded-form `FOREACH)

      4. (expensive operation before LIMIT)

For each use weather data on weather stations.

Pig and HBase

TBD

Pig and JSON

TBD ''''