accumulo-starter

A sample application that illustrates using Apache Accumulo to ingest and query the Enron Email Dataset.

Notes

This sample app was tested with the following:

Accumulo 1.4.2
Hadoop 0.20.2
Zookeeper 3.3.5
Java 1.7.0_15
Maven 3.0.4

Prerequisites

Accumulo, Hadoop, and ZooKeeper must be installed and running.
To compile the query module, mango-core (currently 1.0.3-SNAPSHOT) must be available in your local Maven repository:

$ git clone [email protected]:calrissian/mango.git
$ cd mango/mango-core && mvn clean install

Download Enron Data and Load into HDFS

Download enron dataset and untar.

Place into HDFS, e.g.:

$ hadoop fs -put enron_mail_20110402 /enron

Compile Starter Project

Clone and compile:

$ git clone [email protected]:tequalsme/accumulo-starter.git
$ cd accumulo-starter
$ mvn clean package

Ingest the Data

Untar the compiled ingest assembly:

$ cd ingest/target/
$ tar xf accumulo-starter-ingest-*-dist.tar.gz 
$ cd accumulo-starter-ingest-*

Edit the conf/ingest.xml file specifying your Accumulo connection parameters.

Execute the bin/ingest.sh script, specifying the HDFS path to be loaded into Accumulo. (When starting out, choose a small directory for testing purposes, for example "/enron/maildir/slinger-r")

$ ./bin/ingest.sh <path_to_ingest>

Query the Data

Create a profile in your local Maven settings.xml specifying your Accumulo connection parameters:

  <profiles>
    <profile>
      <id>test</id>
      <properties>
        <accumulo.instance>...</accumulo.instance>
        <accumulo.zookeepers>...</accumulo.zookeepers>
        <accumulo.username>...</accumulo.username>
        <accumulo.password>...</accumulo.password>
      </properties>
    </profile>
  </profiles>

Then launch the query webapp using the maven-jetty-plugin and your newly created Maven profile:

$ cd query-webapp/
$ mvn jetty:run -Ptest

This will start an Jetty webapp running on port 8080. You can enter Ctrl-C at any time to stop the web server.

Open the UI at http://localhost:8080/ and issue a query based on the data you have ingested.

Alternatively, you can issue queries via the REST url: http://localhost:8080/accumulo-starter/query/query

For example (using curl):

$ curl "http://localhost:8080/accumulo-starter/query/query?term=enron&limit=100"

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
common		common
ingest		ingest
query-webapp		query-webapp
query		query
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

accumulo-starter

Notes

Prerequisites

Download Enron Data and Load into HDFS

Compile Starter Project

Ingest the Data

Query the Data

About

Releases

Packages

Languages

License

tequalsme/accumulo-starter

Folders and files

Latest commit

History

Repository files navigation

accumulo-starter

Notes

Prerequisites

Download Enron Data and Load into HDFS

Compile Starter Project

Ingest the Data

Query the Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages