Skip to content

Latest commit

 

History

History
287 lines (189 loc) · 14.8 KB

ha2a-tuning-wise_and_lazy.asciidoc

File metadata and controls

287 lines (189 loc) · 14.8 KB

Hadoop Tuning for the wise and lazy

There are enough knobs and twiddles on a hadoop installation to fully stock the cockpit of a 747. Many of them interact surprisingly, and many settings improve some types of jobs while impeding others. This chapter will help you determin

  • Baseline constraints of system components: CPU, disk, memory, network

  • Baseline constraints of Atomic operations: stream, join, sort, filter

  • Baseline constraints of each stage: setup, read, mapper, spill/combine, midflight, shuffle, reducer, write, replicate, commit.

Hadoop is designed to put its limiting resource at full utilization.

Baseline Performance

best-case scenario:

  • all-local mapper tasks

  • mapper throughput at baseline rate

  • low mapper setup overhead

  • one mapper spill per record

  • low variance in mapper finish time

  • shuffle is largely complete when last merge segments come in

  • reducer throughput at baseline rate

  • low replication overhead

Performance constraints: job stages

Raw ingredients:

  • scripts:

    • nullify  — 

    • identity  — 

    • faker  — generates address records deterministically.

    • should have a partition key we can make dance (see below)

    • should have a total-ordered line number

  • files:

    • zeros  — 512 zero-byte files

    • oneline  — 512 files, each with only its index

    • fakered  — faker.rb-generated 64-GB dataset as 1 64-GB file, 8 8-GB, 64 1-GB, 512 128-MB files. Re-running faker script will recreate fakered dataset. -

  • setups:

    • free-shuf -- set up reduce-slotted-only workers, with max-sized shuffle buffers, no shuffle flush (i.e as close as we can get to zero shuffle)

    • baseline  — large output block size, replication factor 1

  • setup

    • zeros -- mapper-only -- swallow

    • oneline -- mapred -- identity

  • read

    • fakered-128 -- mapper-only -- emit nothing

  • mapper

    • fakered-128 -- mapper-only -- split fields, regexp, but don’t emit

    • fakered-128 -- mapper-only -- split fields, regexp, emit

    • oneline -- mapper-only -- faker

  • spill/combine

    • fakered-128 -- mapred -- identity

    • oneline -- mapred -- faker

  • midflight:

    • xx -- free-shuffle -- swallow

  • shuffle; with various sizes of data per reducer

    • fakered -- lo-skew -- swallow

    • fakered -- hi-skew -- swallow

  • reducer

    • fakered -- mapred -- identity -- identity -- replication factor 1

    • oneline -- mapred -- identity -- faker -- replication factor 1

    • fakered -- mapred -- identity -- split fields, regexp, but don’t emit -- replication factor 1

    • fakered -- mapred -- identity -- split fields, regexp, emit -- replication factor 1

  • write

    • oneline -- mapred -- identity -- faker

  • replicate

    • oneline -- mapred -- identity -- faker -- replication factor 1

    • oneline -- mapred -- identity -- faker -- replication factor 2

    • oneline -- mapred -- identity -- faker -- replication factor 3

    • oneline -- mapred -- identity -- faker -- replication factor 5

  • commit

    • oneline -- mapred -- identity -- identity

    • oneline -- mapred -- identity -- swallow

Variation
  • non-local map tasks

  • EBS volumes

  • slowstart

  • more reducers than slots

  • S3 vs EBS vs HBase vs Elasticsearch vs ephemeral HDFS

Performance constraints: by operation

mapper-only performance

disk-cpu-disk only

  • FOREACH only

  • FILTER on a numeric column only

  • MATCH only

  • decompose region into tiles

midflight

Active vs Passive Benchmarks

When tuning, you should engage in active benchmarking. Passive benchmarking would be to start a large job run, time it on the wall clock (plus some other global measures) and call that a number. Active benchmarking means that while that job is running you watch the fine-grained metrics (following the [use_method]) — validate that the limiting resource is what you believe it to be, and understand how the parameters you are varying drive tradeoffs among other resources.