Skip to content

Getting Started

okram edited this page Aug 15, 2012 · 49 revisions

Faunus requires that the user have access to a Hadoop cluster. If a Hadoop cluster is readily available to the user, then Faunus is easy to get up and running. If not, then the provided Whirr recipe can be leveraged to spawn a Hadoop cluster on Amazon EC2 (see the following tutorial) or a local instance of Hadoop can be run in pseudo-cluster mode (see the following tutorial). This section will discuss the pseudo-cluster approach for those users just getting started with Hadoop (and Faunus). For more experienced Hadoop users, the examples below can be easily adapted to work with a non-local Hadoop cluster (e.g. simply change the Hadoop configuration to point to the accessible cluster ($HADOOP_CONF).

Installing Hadoop

Faunus has been written and tested with Hadoop 1.0. For using Faunus locally on a single machine, a Hadoop pseudo-cluster can be used. Instructions to set up a pseudo-cluster are provided in the Hadoop documentation. Once the pseudo-cluster has been set up, it can be started at anytime using $HADOOP_HOME/bin/start-all.sh. Once running, Hadoop can be issued jobs (e.g. Faunus jobs).

~$ start-all.sh
starting namenode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-namenode-markolaptop.local.out
localhost: starting datanode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-datanode-markolaptop.local.out
localhost: starting secondarynamenode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-secondarynamenode-markolaptop.local.out
starting jobtracker, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-jobtracker-markolaptop.local.out
localhost: starting tasktracker, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-tasktracker-markolaptop.local.out

To ensure that the installation is running properly, do an ls on Hadoop’s distributed file system, HDFS.

faunus$ hadoop fs -ls /
Found 2 items
drwxr-xr-x   - marko supergroup          0 2012-07-26 11:55 /tmp
drwxr-xr-x   - marko supergroup          0 2012-07-26 11:55 /user

Installing Faunus

Faunus can be downloaded from the downloads section of this project. Faunus makes use of a DSL that has a look-and-feel similar to Gremlin. The shell script bin/faunus.sh takes two arguments:

  • Faunus script: a graph derivation and/or statistic in Faunus’ DSL
  • Faunus configuration: a java.util.Properties formatted file with various connectivity parameters (optional — defaults to bin/faunus.properties)

The simple example below will calculate the outgoing degree distribution of the father graph extracted from the original input graph g. Note that before running this particular command, the graph g must be put into the cluster.

faunus$ bin/faunus.sh 'g.V.edgeLabelFilter(KEEP,"father").degreeDistribution(OUT)'

Graph of the Gods Examples

Faunus deploys with a toy graph called The Graph of the Gods (represented in Faunus’ GraphSON format). This graph has 12 vertices/17 edges and denotes people, places, monsters and their various types of relationships to one another. A diagrammatic representation is provided above and the raw GrapJSON representation is provided below.

{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"father","_id":12,"_outV":1}]}
{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}
{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}
{"name":"pluto","type":"god","_id":3,"_outE":[{"_label":"pet","_id":23,"_inV":11},{"_label":"lives","_id":21,"_inV":6},{"_label":"brother","_id":17,"_inV":1},{"_label":"brother","_id":18,"_inV":2}],"_inE":[{"_label":"brother","_id":19,"_outV":2},{"_label":"brother","_id":16,"_outV":1}]}
{"name":"sky","type":"location","_id":4,"_inE":[{"_label":"lives","_id":13,"_outV":1}]}
{"name":"sea","type":"location","_id":5,"_inE":[{"_label":"lives","_id":20,"_outV":2}]}
{"name":"tartarus","type":"location","_id":6,"_inE":[{"_label":"lives","_id":21,"_outV":3},{"_label":"lives","_id":22,"_outV":11}]}
{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"mother","_id":25,"_inV":8},{"time":1,"_label":"battled","_id":26,"_inV":9},{"time":2,"_label":"battled","_id":27,"_inV":10},{"time":12,"_label":"battled","_id":28,"_inV":11},{"_label":"father","_id":24,"_inV":1}]}
{"name":"alcmene","type":"human","_id":8,"_inE":[{"_label":"mother","_id":25,"_outV":7}]}
{"name":"nemean","type":"monster","_id":9,"_inE":[{"time":1,"_label":"battled","_id":26,"_outV":7}]}
{"name":"hydra","type":"monster","_id":10,"_inE":[{"time":2,"_label":"battled","_id":27,"_outV":7}]}
{"name":"cerberus","type":"monster","_id":11,"_outE":[{"_label":"lives","_id":22,"_inV":6}],"_inE":[{"_label":"pet","_id":23,"_outV":3},{"time":12,"_label":"battled","_id":28,"_outV":7}]}

To use this sample graph for the following example, it must be first placed into HDFS.

faunus$ hadoop fs -put data/graph-of-the-gods.json graph-of-the-gods.json
faunus$ hadoop fs -ls
Found 1 item
-rw-r--r--   1 marko supergroup       2028 2012-07-26 11:55 /user/marko/graph-of-the-gods.json

With the graph in HDFS, a reference is made to it in bin/faunus.properties. Faunus already has all the properties configured to point to graph-of-the-gods.json so it is possible to simply run the desired Faunus script. To determine the grandfathers of the entities in the graph, a 2 step walk over father edges is enacted.

faunus$ bin/faunus.sh 'g.V.edgeLabelFilter(KEEP,"father").traverse(OUT,"father",OUT,"father","grandfather",DROP)'
12/07/19 15:32:27 INFO faunus.FaunusGraph:         ,
12/07/19 15:32:27 INFO faunus.FaunusGraph:     ,   |\ ,__
12/07/19 15:32:27 INFO faunus.FaunusGraph:     |\   \/   `\
12/07/19 15:32:27 INFO faunus.FaunusGraph:     \ `-.:.     `\
12/07/19 15:32:27 INFO faunus.FaunusGraph:      `-.__ `\/\/\|
12/07/19 15:32:27 INFO faunus.FaunusGraph:         / `'/ () \
12/07/19 15:32:27 INFO faunus.FaunusGraph:       .'   /\     )  Faunus: A Library of Hadoop-Based Graph Tools
12/07/19 15:32:27 INFO faunus.FaunusGraph:    .-'  .'| \  \__
12/07/19 15:32:27 INFO faunus.FaunusGraph:  .'  __(  \  '`(()
12/07/19 15:32:27 INFO faunus.FaunusGraph: /_.'`  `.  |    )(
12/07/19 15:32:27 INFO faunus.FaunusGraph:          \ |
12/07/19 15:32:27 INFO faunus.FaunusGraph:           |/
12/07/19 15:32:27 INFO faunus.FaunusGraph: Generating job chain: g.V.edgeLabelFilter(KEEP,"father").traverse(OUT,"father",OUT,"father","grandfather",DROP)
12/07/19 15:32:27 INFO faunus.FaunusGraph: Compiled to 1 MapReduce job(s)
12/07/19 15:32:27 INFO faunus.FaunusGraph: Executing job 1 out of 1: MapReduceSequence[com.thinkaurelius.faunus.mapreduce.steps.EdgeLabelFilter.Map, com.thinkaurelius.faunus.mapreduce.steps.Traverse.Map, com.thinkaurelius.faunus.mapreduce.steps.Traverse.Reduce]
12/07/19 15:32:28 INFO mapred.JobClient: Running job: job_201207161225_0210
12/07/19 15:32:29 INFO mapred.JobClient:  map 0% reduce 0%
12/07/19 15:32:47 INFO mapred.JobClient:  map 100% reduce 0%
12/07/19 15:32:59 INFO mapred.JobClient:  map 100% reduce 100%
12/07/19 15:33:04 INFO mapred.JobClient: Job complete: job_201207161225_0210
...

Lets examine the command more closely.

g.V.edgeLabelFilter(KEEP,"father").traverse(OUT,"father",OUT,"father","grandfather",DROP)
  • g: the graph pointed to by bin/faunus.properties
  • V: for all the vertices in the graph
  • edgeLabelFilter: keep only those edges incident to each vertex that has a label of father
  • traverse: take an outgoing father-edge twice and then link to that vertex with a grandfather edge (and drop the father edges as they are no longer necessary)

When the Faunus job completes (which could be many MapReduce jobs), the resultant derivation is available in output.txt (the file name is set in bin/faunus.properties). This file can be pulled onto local disk (from HDFS) and viewed. Hercules’ grandfather is Saturn (that is the only grandfather derivation available in The Graph of the Gods).

faunus$ hadoop fs -getmerge output.txt target/output.txt
faunus$ more target/output.txt 
{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"grandfather","_id":-1,"_outV":7}]}
{"name":"jupiter","type":"god","_id":1}
{"name":"neptune","type":"god","_id":2}
{"name":"pluto","type":"god","_id":3}
{"name":"sky","type":"location","_id":4}
{"name":"sea","type":"location","_id":5}
{"name":"tartarus","type":"location","_id":6}
{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"grandfather","_id":-1,"_inV":0}]}
{"name":"alcmene","type":"human","_id":8}
{"name":"nemean","type":"monster","_id":9}
{"name":"hydra","type":"monster","_id":10}
{"name":"cerberus","type":"monster","_id":11}
Clone this wiki locally