The Java API provides direct access but requires you to write the entirety of your program from boilerplate
Decompose your problem into small isolable transformations. Implement each as a Pig Load/StoreFunc or UDF (User-Defined Function) making calls to the Hadoop API.[^1]
I’ll trust that for very good reasons — to interface with an outside system, performance, a powerful library with a Java-only API — Java is the best choice to implement
[ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
When this happens around the office, we sing this little dirge[^2]:
Life, sometimes, is Russian Novel. Is having unhappy marriage and much snow and little vodka. But when Russian Novel it is short, then quickly we finish and again is Sweet Valley High.
What we don’t do is write a pure Hadoop-API Java program. In practice, those look like this:
HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT PARSING PARAMS HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT READING FILES COBBLED-TOGETHER CODE TO DESERIALIZE THE FILE, HANDLE SPLITS, ETC
[ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S FLATTEN COMMAND DOES COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S CROSS COMMAND DOES A SIMPLE COMBINER COPIED FROM TOM WHITE'S BOOK 1000 LINES OF CODE TO DO WHAT RUBY COULD IN THREE LINES OF CODE HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT WRITING FILES UGLY BUT NECESSARY CODE TO GLUE THIS TO THE REST OF THE ECOSYSTEM
The surrounding code is ugly and boring; it will take more time, produce more bugs, and carry a higher maintenance burden than the important stuff. More importantly, the high-level framework provides an implementation far better than it’s worth your time to recreate.[^3]
Instead, we write
A SKELETON FOR A PIG UDF DEFINITION [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
A PIG SCRIPT