GitHub - tequalsme/accumulo-wikisearch: Fork of Apache/accumulo-wikisearch, with the goal of being simpler to setup and use.

This repository has been archived by the owner on Apr 9, 2020. It is now read-only.

tequalsme / accumulo-wikisearch Public archive

Notifications You must be signed in to change notification settings
Fork 3
Star 3

Fork of Apache/accumulo-wikisearch, with the goal of being simpler to setup and use.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
ingest		ingest
query		query
webapp		webapp
.gitignore		.gitignore
README		README
pom.xml		pom.xml

Repository files navigation

Apache Accumulo Wikipedia Search Example

This project contains a sample application for ingesting and querying wikipedia data.

Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory
You will want to grab the files with the link name of pages-articles.xml.bz2
3. Though not strictly required, the ingest will go more quickly if the files are decompressed:

$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml

INSTRUCTIONS
------------

Configuration and Build
-----------------------
1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
(For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
as done in step 1.
3. From the wikisearch directory, run mvn package

Ingest
------
1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML
files reside, this will start a MapReduce job to ingest the data into Accumulo
(For parallel ingest, instead run ingest/bin/ingest_parallel.sh)

Query
-----
1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:

commons-jexl-*.jar
guava-*.jar
kryo-*.jar
minlog-*.jar

2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki

4. cd into webapp and run mvn jetty:run
5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
6. Ctrl-C to stop the jetty container