Note: this assumes you have a working Hadoop cluster, however large or small.
As you’ve surely guessed, Hadoop is organized very much like the Chimpanzee & Elephant team. Let’s dive in and see it in action.
First, copy the data onto the cluster:
hadoop fs -mkdir ./data hadoop fs -put wukong_example_data/text ./data/
These commands understand ./data/text
to be a path on the HDFS, not your local disk; the dot .
is treated as your HDFS home directory (use it as you would ~
in Unix.). The wu-put
command, which takes a list of local paths and copies them to the HDFS, treats its final argument as an HDFS path by default, and all the preceding paths as being local.
First, let’s test on the same tiny little file we used at the commandline. Make sure to notice how much longer it takes this elephant to squash a flea than it took to run without hadoop.
wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
After outputting a bunch of happy robot-ese to your screen, the job should appear on the jobtracker window within a few seconds. The whole job should complete in far less time than it took to set it up. You can compare its output to the earlier by running
hadoop fs -cat ./output/latinized_magi/\*
Now let’s run it on the full Shakespeare corpus. Even this is hardly enough data to make Hadoop break a sweat, but it does show off the power of distributed computing.
wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
We’ll go into much more detail in (TODO: ref), but here are the essentials of what you just performed.
When you ran the hadoop fs -mkdir
command, the Namenode (Nanette’s Hadoop counterpart) simply made a notation in its directory: no data was stored. If you’re familiar with the term, think of the namenode as a 'File Allocation Table (FAT)' for the HDFS.
When you run hadoop fs -put …
, the putter process does the following for each file:
-
Contacts the namenode to create the file. This also just makes a note of the file; the namenode doesn’t ever have actual data pass through it.
-
Instead, the putter process asks the namenode to allocate a new data block. The namenode designates a set of datanodes (typically three), along with a permanently-unique block ID.
-
The putter process transfers the file over the network to the first data node in the set; that datanode transfers its contents to the next one, and so forth. The putter doesn’t consider its job done until a full set of replicas have acknowledged successful receipt.
-
As soon as each HDFS block fills, even if it is mid-record, it is closed; steps 2 and 3 are repeated for the next block.
Now let’s look at what happens when you run your job.
(TODO: verify this is true in detail. @esammer?)
-
Runner: The program you launch sends the job and its assets (code files, etc) to the jobtracker. The jobtracker hands a
job_id
back (something likejob_201204010203_0002
— the datetime the jobtracker started and the count of jobs launched so far); you’ll use this to monitor and if necessary kill the job. -
Jobtracker: As tasktrackers "heartbeat" in, the jobtracker hands them a set of 'task’s — the code to run and the data segment to process (the "split", typically an HDFS block).
-
Tasktracker: each tasktracker launches a set of 'mapper child processes', each one an 'attempt' of the tasks it received. (TODO verify:) It periodically reassures the jobtracker with progress and in-app metrics.
-
Jobtracker: the Jobtracker continually updates the job progress and app metrics. As each tasktracker reports a complete attempt, it receives a new one from the jobtracker.
-
Tasktracker: after some progress, the tasktrackers also fire off a set of reducer attempts, similar to the mapper step.
-
Runner: stays alive, reporting progress, for the full duration of the job. As soon as the job_id is delivered, though, the Hadoop job itself doesn’t depend on the runner — even if you stop the process or disconnect your terminal the job will continue to run.
Warning
|
Please keep in mind that the tasktracker does not run your code directly — it forks a separate process in a separate JVM with its own memory demands. The tasktracker rarely needs more than a few hundred megabytes of heap, and you should not see it consuming significant I/O or CPU. |
I’ve danced around a minor but important detail that the workers take care of. For the Chimpanzees, books are chopped up into set numbers of pages — but the chimps translate sentences, not pages, and a page block boundary might happen mid-sentence.
The Hadoop equivalent of course is that a data record may cross and HDFS block boundary. (In fact, you can force map-reduce splits to happen anywhere in the file, but the default and typically most-efficient choice is to split at HDFS blocks.)
A mapper will skip the first record of a split if it’s partial and carry on from there. Since there are many records in each split, that’s no big deal. When it gets to the end of the split, the task doesn’t stop processing until is completes the current record — the framework makes the overhanging data seamlessley appear.
In practice, Hadoop users only need to worry about record splitting when writing a custom InputFormat
or when practicing advanced magick. You’ll see lots of reference to it though — it’s a crucial subject for those inside the framework, but for regular users the story I just told is more than enough detail.