Skip to content

Distributed Graph Computing with Gremlin

okram edited this page Mar 7, 2013 · 25 revisions

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the graph (or those which currently exist in Faunus’ computational pipeline). This simple idea has interesting ramifications for Gremlin-based distributed graph computing.

  • Global graph mutations: update a Titan cluster in parallel given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in the graph in order to compute some algorithm in a parallel fashion.

The script-step requires that a Gremlin script exist in HDFS and has the following method definitions:

  • setup(String... args)
  • map(FaunusVertex v, String... args)
  • cleanup(String... args)

Global Graph Mutations

One way to do global graph mutations is to use an InputFormat that reads a graph from a database (e.g. Titan and/or Rexster) and then mutate the Faunus representation of that graph in HDFS over a various Gremlin/Faunus steps. Finally, delete the original graph in the database and bulk loading the new mutated graph. The problem with this method is that it requires the graph database to be deleted and re-loaded which, for production 24×7 systems, is not a reasonable requirement.

Another way to do this is using script-step to allow for real-time, parallel bulk updates of the original graph in the graph database itself. A simple example explains the idea. Assume the Graph of the Gods dataset in Titan/Cassandra sharing its nodes with Hadoop data node and task trackers.

Assuming the following Gremlin/Groovy script called FathersNames.groovy.

def g

def setup(args) {
    conf = new org.apache.commons.configuration.BaseConfiguration()
    conf.setProperty('storage.backend', args[0])
    conf.setProperty('storage.hostname', 'localhost')
    g = com.thinkaurelius.titan.core.TitanFactory.open(conf)
}

def map(v, args) {
    u = g.v(v.id)
    pipe = u.out('father').name
    if (pipe.hasNext()) u.fathersName = pipe.next();
    u.name + "'s father's name is " + u.fathersName
}

def cleanup(args) {
    g.shutdown()
}

Place this file into HDFS using Gremlin.

gremlin> hdfs.copyFromLocal('data/FathersNames.groovy', 'FathersNames.groovy')
==>null

With this file in HDFS, it is possible to execute the following Gremlin/Faunus traversal. First, the Graph of the Gods in Titan serves as the input to the Faunus job. There is no need to write the graph, so NoOpOutputFormat is set. Finally, what is important to realize is that for each vertex at step will be passed to FathersName.map() method.

gremlin> g = FaunusFactory.open('bin/titan-cassandra-input.properties')                
==>faunusgraph[titancassandrainputformat->graphsonoutputformat]
gremlin> g.setGraphOutputFormat(NoOpOutputFormat.class)                                
==>null
gremlin> g.V.has('type','demigod','god').script('FathersName.groovy','cassandrathrift')
13/03/06 18:21:43 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/03/06 18:21:43 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.util.ScriptMap.Map]
13/03/06 18:21:43 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
13/03/06 18:21:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/03/06 18:21:44 INFO mapred.JobClient: Running job: job_201303061253_0077
13/03/06 18:21:45 INFO mapred.JobClient:  map 0% reduce 0%
13/03/06 18:21:53 INFO mapred.JobClient:  map 50% reduce 0%
...
==>hercules's father's name is jupiter
==>pluto's father's name is null
==>jupiter's father's name is saturn
==>neptune's father's name is null

Looking at the original graph in Titan, those vertices that have fathers, have a new fathersName property.

gremlin> g.V.transform("{it.name + ' ' + it.fathersName}")             
...
13/03/06 18:25:40 INFO mapred.JobClient: Running job: job_201303061253_0078
13/03/06 18:25:41 INFO mapred.JobClient:  map 0% reduce 0%
...
==>tartarus null
==>alcmene null
==>sea null
==>hydra null
==>hercules jupiter
==>cerberus null
==>pluto null
==>saturn null
==>sky null
==>jupiter saturn
==>neptune null
==>nemean null
Clone this wiki locally