-
Notifications
You must be signed in to change notification settings - Fork 58
Getting Started
Faunus requires that the user have access to a Hadoop cluster. If a Hadoop cluster is readily available to the user, then Faunus is easy to get up and running. If not, then the provided Whirr recipe can be leveraged to spawn a Hadoop cluster on Amazon EC2 (see the following tutorial) or a local instance of Hadoop can be run in pseudo-cluster mode (see the following tutorial). This section will discuss the pseudo-cluster approach for those users just getting started with Hadoop (and Faunus). For more experienced Hadoop users, the examples below can be easily adapted to work with a non-local Hadoop cluster (e.g. simply change the Hadoop configuration to point to the accessible cluster ($HADOOP_CONF_DIR
).
Faunus has been written and tested with Hadoop 1.0.3. For using Faunus locally on a single machine, a Hadoop pseudo-cluster can be used. Instructions to set up a pseudo-cluster are provided in the Hadoop documentation. Once the pseudo-cluster has been set up, it can be started by running the start script $HADOOP_HOME/bin/start-all.sh
. Once running, Hadoop can be issued jobs (e.g. Faunus jobs).
~$ start-all.sh
starting namenode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-namenode-markolaptop.local.out
localhost: starting datanode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-datanode-markolaptop.local.out
localhost: starting secondarynamenode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-secondarynamenode-markolaptop.local.out
starting jobtracker, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-jobtracker-markolaptop.local.out
localhost: starting tasktracker, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-tasktracker-markolaptop.local.out
To ensure that the installation is running properly, do an ls
on Hadoop’s distributed file system, HDFS. This should not return a view of the local filesystem, but instead a view of HDFS with two directories.
faunus$ hadoop fs -ls /
Found 2 items
drwxr-xr-x - marko supergroup 0 2012-07-26 11:55 /tmp
drwxr-xr-x - marko supergroup 0 2012-07-26 11:55 /user
Faunus can be downloaded from the downloads section of this project. Faunus provides a Gremlin implementation that can be used for convenient interactions with HDFS, interactions the supported graph sources (e.g. Titan), and for constructing Faunus jobs (a chain of 1 or more MapReduce jobs). The Gremlin REPL can be started with bin/gremlin.sh
.
faunus$ bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin>
Faunus distributes with a toy graph called The Graph of the Gods (represented in Faunus’ GraphSON format). This graph has 12 vertices/17 edges and denotes people, places, monsters and their various types of relationships to one another. A diagrammatic representation is provided above and the raw GraphJSON representation is provided below.
{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"father","_id":12,"_outV":1}]}
{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}
{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}
{"name":"pluto","type":"god","_id":3,"_outE":[{"_label":"pet","_id":23,"_inV":11},{"_label":"lives","_id":21,"_inV":6},{"_label":"brother","_id":17,"_inV":1},{"_label":"brother","_id":18,"_inV":2}],"_inE":[{"_label":"brother","_id":19,"_outV":2},{"_label":"brother","_id":16,"_outV":1}]}
{"name":"sky","type":"location","_id":4,"_inE":[{"_label":"lives","_id":13,"_outV":1}]}
{"name":"sea","type":"location","_id":5,"_inE":[{"_label":"lives","_id":20,"_outV":2}]}
{"name":"tartarus","type":"location","_id":6,"_inE":[{"_label":"lives","_id":21,"_outV":3},{"_label":"lives","_id":22,"_outV":11}]}
{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"mother","_id":25,"_inV":8},{"time":1,"_label":"battled","_id":26,"_inV":9},{"time":2,"_label":"battled","_id":27,"_inV":10},{"time":12,"_label":"battled","_id":28,"_inV":11},{"_label":"father","_id":24,"_inV":1}]}
{"name":"alcmene","type":"human","_id":8,"_inE":[{"_label":"mother","_id":25,"_outV":7}]}
{"name":"nemean","type":"monster","_id":9,"_inE":[{"time":1,"_label":"battled","_id":26,"_outV":7}]}
{"name":"hydra","type":"monster","_id":10,"_inE":[{"time":2,"_label":"battled","_id":27,"_outV":7}]}
{"name":"cerberus","type":"monster","_id":11,"_outE":[{"_label":"lives","_id":22,"_inV":6}],"_inE":[{"_label":"pet","_id":23,"_outV":3},{"time":12,"_label":"battled","_id":28,"_outV":7}]}
To make use of this sample graph for the examples to follow, it must be first placed into HDFS. Note that all graph sources do not necessarily originate from HDFS (e.g. Titan and Rexster/Blueprints graphs). However, file-based sources typically originate from HDFS. To store the GraphSON file in HDFS, the Gremlin REPL or the standard Hadoop CLI can be used.
Gremlin REPL
gremlin> hdfs.copyFromLocal('data/graph-of-the-gods.json','graph-of-the-gods.json')
==>null
gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.json
gremlin>
Hadoop CLI
faunus$ hadoop fs -put data/graph-of-the-gods.json graph-of-the-gods.json
faunus$ hadoop fs -ls
Found 1 item
-rw-r--r-- 1 marko supergroup 2028 2012-07-26 11:55 /user/marko/graph-of-the-gods.json
With The Graph of the Gods stored in HDFS, a reference is made to it in bin/faunus.properties
. Faunus already has all the properties configured to point to graph-of-the-gods.json
so it is possible to simply run desired Gremlin traversals without having to initially understand Faunus configuration.
gremlin> g = FaunusFactory.open('bin/faunus.properties')
==>faunusgraph[graphsoninputformat]
The next sections present some simple examples to demonstrate how Faunus, Gremlin, and Hadoop all interact with one another.
A frequency distribution is simply a count of the number of times a particular item appears in a set. If the set is defined as all the type property values of the vertices in The Graph of the Gods, then a distribution of those values is the number of times that monster, human, demigod, god, etc. appears. This can be easily computed with the following Gremlin traversal.
gremlin> g.V.type.groupCount
12/09/16 14:04:16 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
12/09/16 14:04:16 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.ValueGroupCountMapReduce.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.ValueGroupCountMapReduce.Reduce]
12/09/16 14:04:16 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
12/09/16 14:04:17 INFO input.FileInputFormat: Total input paths to process : 1
12/09/16 14:04:17 INFO mapred.JobClient: Running job: job_201209160849_0033
12/09/16 14:04:18 INFO mapred.JobClient: map 0% reduce 0%
...
==>demigod 1
==>god 3
==>human 1
==>location 3
==>monster 3
==>titan 1
Lets examine the traversal more closely.
g.V.type.groupCount
-
g
: the graph pointed to bybin/faunus.properties
-
V
: for all the vertices in the graph -
type
: get the type property value of those vertices -
groupCount
: count the number of times each unique type is seen
When the Faunus job completes (which could be many MapReduce jobs), the results are outputted to the terminal and are also available in the output
directory (set in bin/faunus.properties
).
gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.json
==>rwxr-xr-x marko supergroup 0 (D) output
gremlin> hdfs.ls('output')
==>rwxr-xr-x marko supergroup 0 (D) job-0
gremlin> hdfs.ls('output/job-0')
==>rw-r--r-- marko supergroup 0 _SUCCESS
==>rwxr-xr-x marko supergroup 0 (D) _logs
==>rw-r--r-- marko supergroup 435 graph-m-00000.bz2
==>rw-r--r-- marko supergroup 80 sideeffect-m-00000.bz2
gremlin> hdfs.head('output/job-0/sideeffect*')
==>demigod 1
==>god 3
==>human 1
==>location 3
==>monster 3
==>titan 1
gremlin>
A derivation is some mutation of the graph whether that mutation is as simple as removing vertices/edges or as complex as inferring new edges from explicit edges in the graph. With The Graph of the Gods, grandfather edges can be derived from father edges. This type of derivation is known as an inference.
From the Gremlin REPL, enter the following traversal.
g.V.as('x').out('father').out('father').linkIn('x','grandfather')
-
g
: the graph pointed to bybin/faunus.properties
-
V
: for all the vertices in the graph -
as('x')
: name the current elements “x” -
out('father')
: traverse out over father edges -
out('father')
: traverse out over father edges -
linkIn('x','grandfather')
: create incoming grandfather edges from remaining vertices at step “x”
The derived graph can be pulled to local disk or analyzed in HDFS. Realize that because the same output
directory is used, the previous output
was deleted. Again, this can all be configured in bin/faunus.properties
.
gremlin> hdfs.head('output')
==>{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"grandfather","_id":-1,"_outV":7},{"_label":"father","_id":12,"_outV":1}]}
==>{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}
==>{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}
==>{"name":"pluto","type":"god","_id":3,"_outE":[{"_label":"pet","_id":23,"_inV":11},{"_label":"lives","_id":21,"_inV":6},{"_label":"brother","_id":17,"_inV":1},{"_label":"brother","_id":18,"_inV":2}],"_inE":[{"_label":"brother","_id":19,"_outV":2},{"_label":"brother","_id":16,"_outV":1}]}
==>{"name":"sky","type":"location","_id":4,"_inE":[{"_label":"lives","_id":13,"_outV":1}]}
==>{"name":"sea","type":"location","_id":5,"_inE":[{"_label":"lives","_id":20,"_outV":2}]}
==>{"name":"tartarus","type":"location","_id":6,"_inE":[{"_label":"lives","_id":21,"_outV":3},{"_label":"lives","_id":22,"_outV":11}]}
==>{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"mother","_id":25,"_inV":8},{"_label":"grandfather","_id":-1,"_inV":0},{"time":1,"_label":"battled","_id":26,"_inV":9},{"time":2,"_label":"battled","_id":27,"_inV":10},{"time":12,"_label":"battled","_id":28,"_inV":11},{"_label":"father","_id":24,"_inV":1}]}
==>{"name":"alcmene","type":"human","_id":8,"_inE":[{"_label":"mother","_id":25,"_outV":7}]}
==>{"name":"nemean","type":"monster","_id":9,"_inE":[{"time":1,"_label":"battled","_id":26,"_outV":7}]}
==>{"name":"hydra","type":"monster","_id":10,"_inE":[{"time":2,"_label":"battled","_id":27,"_outV":7}]}
==>{"name":"cerberus","type":"monster","_id":11,"_outE":[{"_label":"lives","_id":22,"_inV":6}],"_inE":[{"_label":"pet","_id":23,"_outV":3},{"time":12,"_label":"battled","_id":28,"_outV":7}]}
To conclude, the grandfather derived graph can be further computed on using g.getNextGraph()
. This method returns a new graph that points to the output of the previous g
graph (with inputs/outputs and HDFS directories handled accordingly).
gremlin> g
==>faunusgraph[graphsoninputformat]
gremlin> h = g.getNextGraph()
==>faunusgraph[graphsoninputformat]
gremlin> h.E.has('label','grandfather').keep.count()
12/09/16 14:24:42 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
12/09/16 14:24:42 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.EdgesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.CommitEdgesMap.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Reduce]
...
==>1
The traversal above removes all edges except grandfather edges and then counts the remaining edges. As demonstrated, there is only 1 grandfather edge (the grandfather of Hercules is Saturn).
h.E.has('label','grandfather').keep.count()
-
h
: the graph pointing to the output ofg
-
E
: for all the vertices in the graph -
has('label','grandfather')
: traverse to grandfather edges -
keep
: keep the edges at the current step and delete all others -
count
: count the number of elements current at the current step
gremlin> hdfs.head('output_/*/graph*')
==>{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"grandfather","_id":-1,"_outV":7}]}
==>{"name":"jupiter","type":"god","_id":1}
==>{"name":"neptune","type":"god","_id":2}
==>{"name":"pluto","type":"god","_id":3}
==>{"name":"sky","type":"location","_id":4}
==>{"name":"sea","type":"location","_id":5}
==>{"name":"tartarus","type":"location","_id":6}
==>{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"grandfather","_id":-1,"_inV":0}]}
==>{"name":"alcmene","type":"human","_id":8}
==>{"name":"nemean","type":"monster","_id":9}
==>{"name":"hydra","type":"monster","_id":10}
==>{"name":"cerberus","type":"monster","_id":11}