Skip to content

Latest commit

 

History

History
141 lines (114 loc) · 10.1 KB

xx-use_method_fiddle.asciidoc

File metadata and controls

141 lines (114 loc) · 10.1 KB

Hadoop Metrics

The USE Method applied to Hadoop

There’s an excellent methodology for performance analysis known as the "USE Method": For every resource, check Utilization, Saturation and Errors" [1]:

  • utilization — How close to capacity is the resource? (ex: CPU usage, % disk used)

  • saturation  — How often is work waiting? (ex: network drops or timeouts)

  • errors  — Are there error events?

For example, USE metrics for a supermarket cashier would include:

  • checkout utilization: flow of items across the belt; constantly stopping to look up the price of tomatoes or check coupons will harm utilization.

  • checkout saturation: number of customers waiting in line

  • checkout errors: calling for the manager

The cashier is an I/O resource; there are also capacity resources, such as available stock of turkeys:

  • utilization: amount of remaining stock (zero remaining would be 100% utilization)

  • saturation: in the US, you may need to sign up for turkeys near the Thanksgiving holiday.

  • errors: spoiled or damaged product

As you can see, it’s possible to have high utilization and low saturation (steady stream of customers but no line, or no turkeys in stock but people happily buying ham), low utilization and high saturation (a cashier in training with a long line), or any other combination.

Why it’s interesting

It may not be obvious why the USE method is novel — "Hey, it’s good to record metrics" isn’t profound advice.

The USE method novel in its negative space — "Hey, it’s good enough to record this limited set of metrics" is quite surprising advice.

So if you prefer, here’s "USE method the extended remix": For each resource, check Utilization, Saturation and Errors. This is the necessary and sufficient foundation of a system performance analysis, and advanced effort is only justifiable once you have done so. There are further benefits:

  • An elaborate solution space becomes a finite, parallelizable, delegatable list of activities.

  • Blind spots become obvious. We once had a client issue where datanodes would go dark in difficult-to-reproduce circumstances. After much work, we found that persistent connections from Flume and chatter around numerous small files created by Hive were consuming all available datanode handler threads. We learned by pain what we could have learned by making a list — there was no visibility for the number of available handler threads.

  • Saturation metrics are often non-obvious, and high saturation can have non-obvious consequences. (For example, the hard lower bound of TCP throughput)

  • It’s known that Teddy Bears make superb level-1 support techs, because being forced to deliver a clear, uninhibited problem statement often suggests its solution. To some extent any framework this clear and simple will carry benefits, simply by forcing an organized problem description.

Look for the Bounding Resource

The USE metrics described below help you to identify the limiting resource of a job; to diagnose a faulty or misconfigured system; or to guide tuning and provisioning of the base system.

Improve / Understand Job Performance

Hadoop is designed to drive max utilization for its bounding resource by coordinating manageable saturation_ of the resources in front of it.

The "bounding resource" is the fundamental limit on performance — you can’t process a terabyte of data from disk faster than you can read a terabyte of data from disk. k

  • disk read throughput

  • disk write throughput

  • job process CPU

  • child process RAM, with efficient utilization of internal buffers

  • If you don’t have the ability to specify hardware, you may need to accept "network read/write throughput" as a bounding resource during replication.

At each step of a job, what you’d like to see is very high utilization of exactly one bounding resource from that list, with reasonable headroom and managed saturation for everything else. What’s "reasonable"? As a rule of thumb, utilization above 70% in a non-bounding resource deserves a closer look.

Diagnose Flaws
Balanced Configuration/Provisioning of base system

Resource List

Please see the [glossary] for definitions of terms. I’ve borrowed many of the system-level metrics from Brendan Gregg’s Linux Checklist; visit there for a more-detailed list.

Table 1. USE Method Checklist
resource type metric instrument

CPU-like concerns

CPU

utilization

system CPU

top/htop: CPU %, overall

utilization

job process CPU

top/htop: CPU %, each child process

saturation

max user processes

ulimit -u ("noproc" in /etc/security/limits.d/…​)

mapper slots

utilization

mapper slots used

jobtracker console; impacted by mapred.tasktracker.map.tasks.maximum

mapper slots

saturation

mapper tasks waiting

jobtracker console; impacted by scheduler and by speculative execution settings

saturation

task startup overhead

???

saturation

combiner activity

jobtracker console (TODO table cell name)

reducer slots

utilization

reducer slots used

jobtracker console; impacted by mapred.tasktracker.reduce.tasks.maximum

reducer slots

saturation

reducer tasks waiting

jobtracker console; impacted by scheduler and by speculative execution and slowstart settings

Memory concerns

_

__

_

memory capacity

utilization

total non-OS RAM

free, htop; you want the total excluding caches+buffers.

utilization

child process RAM

free, htop: "RSS"; impacted by mapred.map.child.java.opts and mapred.reduce.child.java.opts

utilization

JVM old-gen used

JMX

utilization

JVM new-gen used

JMX

memory capacity

saturation

swap activity

vmstat 1 - look for "r" > CPU count.

saturation

old-gen gc count

JMX, gc logs (must be specially enabled)

saturation

old-gen gc pause time

JMX, gc logs (must be specially enabled)

saturation

new-gen gc pause time

JMX, gc logs (must be specially enabled)

mapper sort buffer

utilization

record size limit

announced in job process logs; controlled indirectly by io.sort.record.percent, spill percent tunables

utilization

record count limit

announced in job process logs; controlled indirectly by io.sort.record.percent, spill percent tunables

mapper sort buffer

saturation

spill count

spill counters (jobtracker console)

saturation

sort streams

io sort factor tunable (io.sort.factor)

shuffle buffers

utilization

buffer size

child process logs

utilization

buffer %used

child process logs

shuffle buffers

saturation

spill count

spill counters (jobtracker console)

saturation

sort streams

merge parallel copies tunable mapred.reduce.parallel.copies (TODO: also io.sort.factor?)

OS caches/buffers

utilization

total c+b

free, htop

disk concerns

_

__

_

system disk I/O

utilization

req/s, read

iostat -xz 1 (system-wide); iotop (per process); /proc/{PID}/sched "se.statistics.iowait_sum"

utilization

req/s, write

iostat -xz 1 (system-wide); iotop (per process); /proc/{PID}/sched "se.statistics.iowait_sum"

utilization

MB/s, read

iostat -xz 1 (system-wide); iotop (per process); /proc/{PID}/sched "se.statistics.iowait_sum"

utilization

MB/s, write

iostat -xz 1 (system-wide); iotop (per process); /proc/{PID}/sched "se.statistics.iowait_sum"

system disk I/O

saturation

queued requests

iostat -xnz 1; look for "avgqu-sz" > 1, or high "await".

system disk I/O

errors

/sys/devices/…/ioerr_cnt; smartctl, /var/log/messages

network concerns

_

__

_

network I/O

utilization

netstat; ip -s {link}; /proc/net/{dev} — RX/TX throughput as fraction of max bandwidth

network I/O

saturation

ifconfig ("overruns", "dropped"); netstat -s ("segments retransmited"); /proc/net/dev (RX/TX "drop")

network I/O

errors

interface-level

ifconfig ("errors", "dropped"); netstat -i ("RX-ERR"/"TX-ERR"); /proc/net/dev ("errs", "drop")

request timeouts

daemon and child process logs

handler threads

utilization

nn handlers

(TODO: how to measure) vs dfs.namenode.handler.count

utilization

jt handlers

(TODO: how to measure) vs

utilization

dn handlers

(TODO: how to measure) vs dfs.datanode.handler.count

utilization

dn xceivers

(TODO: how to measure) vs `dfs.datanode.max.xcievers

framework concerns

_

__

_

disk capacity

utilization

system disk used

df -bM

utilization

HDFS directories

du -smc /path/to/mapred_scratch_dirs (for all directories in dfs.data.dir, dfs.name.dir, fs.checkpoint.dir)

utilization

mapred scratch space

du -smc /path/to/mapred_scratch_dirs (TODO scratch dir tunable)

utilization

total HDFS free

namenode console

utilization

open file handles

ulimit -n ("nofile" in /etc/security/limits.d/…​)

job process

errors

stderr log

errors

stdout log

errors

counters

datanode

errors

namenode

errors

secondarynn

errors

tasktracker

errors

jobtracker

errors

Metrics in bold are critical resources — you would like to have exactly one of these at its full sustainable level


1. developed by Brendan Gregg for system performance tuning, modified here for Hadoop