Skip to content

Latest commit

 

History

History
46 lines (28 loc) · 2.21 KB

fu08-hadoop_api.asciidoc

File metadata and controls

46 lines (28 loc) · 2.21 KB

The Hadoop Java API

When to use the Hadoop Java API

Don’t.

How to use the Hadoop Java API

The Java API provides direct access but requires you to write the entirety of your program from boilerplate

Decompose your problem into small isolable transformations. Implement each as a Pig Load/StoreFunc or UDF (User-Defined Function) making calls to the Hadoop API.[^1]

The Skeleton of a Hadoop Java API program

I’ll trust that for very good reasons — to interface with an outside system, performance, a powerful library with a Java-only API — Java is the best choice to implement

[ Your breathtakingly elegant, stunningly performant solution to a novel problem ]

When this happens around the office, we sing this little dirge[^2]:

Life, sometimes, is Russian Novel. Is having unhappy marriage and much snow and little vodka.
But when Russian Novel it is short, then quickly we finish and again is Sweet Valley High.

What we don’t do is write a pure Hadoop-API Java program. In practice, those look like this:

HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT PARSING PARAMS
HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT READING FILES
COBBLED-TOGETHER CODE TO DESERIALIZE THE FILE, HANDLE SPLITS, ETC
[ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S FLATTEN COMMAND DOES
COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S CROSS COMMAND DOES
A SIMPLE COMBINER COPIED FROM TOM WHITE'S BOOK
1000 LINES OF CODE TO DO WHAT RUBY COULD IN THREE LINES OF CODE
HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT WRITING FILES
UGLY BUT NECESSARY CODE TO GLUE THIS TO THE REST OF THE ECOSYSTEM

The surrounding code is ugly and boring; it will take more time, produce more bugs, and carry a higher maintenance burden than the important stuff. More importantly, the high-level framework provides an implementation far better than it’s worth your time to recreate.[^3]

Instead, we write

A SKELETON FOR A PIG UDF DEFINITION
[ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
A PIG SCRIPT