diff --git a/README.md b/README.md index a31cf9a..4e3c82f 100644 --- a/README.md +++ b/README.md @@ -3,22 +3,34 @@ AnarchyApe Fault injection tool for hadoop cluster from yahoo anarchyape +Pre-requisites +----------- + +- Java JDK >= 1.7 +- pdsh +- stress-ng for CPU and Memory hog Compilation ----------- -[Java] -Required version: 1.7.0.21 or later +[Java with maven] +``` +mvn package +mv target/anarchyape.jar ape.jar + +``` + +[Java manual] ``` cd src/main/java -download log4j.ar and commons-cli.jar +# download log4j-1.4.12.jar and apache commons-cli-1.2.jar java -cp . ape/Main.java rm ape.jar +# Use either (preferred): jar cfm ape.jar META-INF/MANIFEST.MF ape META-INF/services org - -(old way) +# either : javac -cp .:log4j-1.4.12.jar:commons-cli-1.2.jar ape/*.java ``` @@ -40,7 +52,7 @@ Running ``` java -jar ape.jar [commands] -log file: /var/log/ape.log +log file: anarchyape.log (old way) java -cp .:log4j-1.4.12.jar ape/Main @@ -58,8 +70,11 @@ Install pdsh: yum install pdsh apt-get install pdsh +cp ape /usr/local/bin/ape +chmod go-rwx /usr/local/bin/ape + java -jar ape.jar -R node1,node2,node3 -S 100 5 -creates a script to run on the remote hosts: +# creates a script to run on the remote hosts: pdsh -Rssh -w node1,node2,node3 '/usr/local/bin/ape -L -S 100 5' ``` @@ -91,82 +106,131 @@ Available Commands ------------------ Here are some common failures in Hadoop environments: ``` -• Data node is killed -• Application Master (AM) is killed -• Application Master is suspended -• Node Manager (NM) is killed -• Node Manager is suspended -• Data node is suspended -• Tasktracker is suspended -• Node panics and restarts -• Node hangs and does not restart -• Random thread within data node is killed -• Random thread within data node is suspended -• Random thread within tasktracker is killed -• Random thread within tasktracker is suspended -• Network becomes slow -• Network is dropping significant numbers of packets -• Network disconnect (simulate cable pull) -• One disk gets VERY slow -• CPU hog consumes x% of CPU cycles -• Mem hog consumes x% of memory -• Corrupt ext3 data block on disk -• Corrupt ext3 metadata block on disk +# Data node is killed +# Application Master (AM) is killed +# Application Master is suspended +# Node Manager (NM) is killed +# Node Manager is suspended +# Data node is suspended +# Tasktracker is suspended +# Node panics and restarts +# Node hangs and does not restart +# Random thread within data node is killed +# Random thread within data node is suspended +# Random thread within tasktracker is killed +# Random thread within tasktracker is suspended +# Network becomes slow +tc qdisc add dev eth0 root netem delay 100.0ms && sleep 30.0 && tc qdisc del dev eth0 root netem +# Network is dropping significant numbers of packets +# Network disconnect (simulate cable pull) +iptables -A INPUT -p tcp --dport 8080 -j DROP # block port +iptables -D INPUT -p tcp --dport 8080 -j DROP # unblock port +# One disk gets VERY slow +# CPU hog consumes x% of CPU cycles, for example by running stress-ng command remotely with : +stress-ng -c 1 --verify -t 1m -v +# Mem hog consumes x% of memory, for example by running stress-ng command remotely with : +stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t 1m -v +# Corrupt blocks of a given file on rw-mounted disk ``` Command line options: ``` usage: ape [options] ... options: - -c,--corrupt-file Corrupt the file given - the address as the first - argument, size as the 2nd - arg, and offset as the - 3rd argument - -C,--corrupt-block Corrupt a random HDFS - block file with a size in - bytes as the 2nd arg and - offset in bytes as the - 3rd argument - -d,--network-disconnect