README

Apache Accumulo Wikipedia Search Example

This project contains a sample application for ingesting and querying wikipedia data.
 
Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory
   You will want to grab the files with the link name of pages-articles.xml.bz2
3. Though not strictly required, the ingest will go more quickly if the files are decompressed:

   $ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml


INSTRUCTIONS
------------

    Configuration and Build
    -----------------------
    1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
       (For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
    2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
       as done in step 1.
    3. From the wikisearch directory, run mvn package
    
    Ingest
    ------
    1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
	2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
	3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML 
           files reside, this will start a MapReduce job to ingest the data into Accumulo
       (For parallel ingest, instead run ingest/bin/ingest_parallel.sh)
   
    Query
    -----
    1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:
    
        commons-jexl-*.jar
        guava-*.jar
        kryo-*.jar
        minlog-*.jar
        
    2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
    3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example: 
            setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
            	
	4. cd into webapp and run mvn jetty:run
	5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
	   You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
    6. Ctrl-C to stop the jetty container