Skip to content
This repository has been archived by the owner on Apr 9, 2020. It is now read-only.
/ accumulo-starter Public archive

A sample application that illustrates using Apache Accumulo to ingest and query the Enron Email Dataset.

License

Notifications You must be signed in to change notification settings

tequalsme/accumulo-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

accumulo-starter

A sample application that illustrates using Apache Accumulo to ingest and query the Enron Email Dataset.

Notes

This sample app was tested with the following:

  • Accumulo 1.4.2
  • Hadoop 0.20.2
  • Zookeeper 3.3.5
  • Java 1.7.0_15
  • Maven 3.0.4

Prerequisites

  1. Accumulo, Hadoop, and ZooKeeper must be installed and running.
  2. To compile the query module, mango-core (currently 1.0.3-SNAPSHOT) must be available in your local Maven repository:
$ git clone [email protected]:calrissian/mango.git
$ cd mango/mango-core && mvn clean install

Download Enron Data and Load into HDFS

Download enron dataset and untar.

Place into HDFS, e.g.:

$ hadoop fs -put enron_mail_20110402 /enron

Compile Starter Project

Clone and compile:

$ git clone [email protected]:tequalsme/accumulo-starter.git
$ cd accumulo-starter
$ mvn clean package

Ingest the Data

Untar the compiled ingest assembly:

$ cd ingest/target/
$ tar xf accumulo-starter-ingest-*-dist.tar.gz 
$ cd accumulo-starter-ingest-*

Edit the conf/ingest.xml file specifying your Accumulo connection parameters.

Execute the bin/ingest.sh script, specifying the HDFS path to be loaded into Accumulo. (When starting out, choose a small directory for testing purposes, for example "/enron/maildir/slinger-r")

$ ./bin/ingest.sh <path_to_ingest>

Query the Data

Create a profile in your local Maven settings.xml specifying your Accumulo connection parameters:

  <profiles>
    <profile>
      <id>test</id>
      <properties>
        <accumulo.instance>...</accumulo.instance>
        <accumulo.zookeepers>...</accumulo.zookeepers>
        <accumulo.username>...</accumulo.username>
        <accumulo.password>...</accumulo.password>
      </properties>
    </profile>
  </profiles>

Then launch the query webapp using the maven-jetty-plugin and your newly created Maven profile:

$ cd query-webapp/
$ mvn jetty:run -Ptest

This will start an Jetty webapp running on port 8080. You can enter Ctrl-C at any time to stop the web server.

Open the UI at http://localhost:8080/ and issue a query based on the data you have ingested.

Alternatively, you can issue queries via the REST url: http://localhost:8080/accumulo-starter/query/query

For example (using curl):

$ curl "http://localhost:8080/accumulo-starter/query/query?term=enron&limit=100"

About

A sample application that illustrates using Apache Accumulo to ingest and query the Enron Email Dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published