Skip to content

Latest commit

 

History

History
42 lines (25 loc) · 1.77 KB

ca4e-herding_cats.asciidoc

File metadata and controls

42 lines (25 loc) · 1.77 KB

Herding `cat`s

Moving things to and fro

TIMINGS Show hdp-shovel Show distcp

To put something on the HDFS directly from a pipe:

hdp-mkdir infochimps.com
curl 'http://infochimps.com' | hdp-put - infochimps.com/index.html

Don’t use NFS

Stupid Hadoop Tricks

Mappers that process filenames, not file contents

You can output anything you want in your mappers.

Every once in a while, you need to do something where getting the content onto the HDFS is almost more work than it’s worth. For instance, say you had to process a whole bunch of files located in no convenient place or organization

  • pull in all the files

  • transfer the files to the HDFS

  • start the job to process them

  • transfer them back off

or:

  • send, as the mapper input, the files to fetch

  • each mapper fetches the page contents and emits them

Be careful: hadoop has no rate limiting. It will quite happily obliterate any system you point it at, for whom there’s no apparent difference between Hadoop and a concentrated Distributed Denial of Service attack.


Benign DDOS

Speaking of which…​ So you have an API. And you think it’s working well, and in fact you think it’s working really well. Want to simulate a 200x load spike? Replay a week’s worth of request logs at your server, accelerated to all show up in an hour. Each mapper reads a section of the logs, and makes the corresponding request (setting its browser string and referer URL accordingly). It emits the response duration, HTTP status code, and content size. There are dedicated tools to do this kind of HTTP benchmarking, but they typically make the same request over and over. Replaying a real load at higher speed means that your caching strategy is properly exercised.