This repository has been archived by the owner on Apr 9, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
README
50 lines (39 loc) · 2.31 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Apache Accumulo Wikipedia Search Example
This project contains a sample application for ingesting and querying wikipedia data.
Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory
You will want to grab the files with the link name of pages-articles.xml.bz2
3. Though not strictly required, the ingest will go more quickly if the files are decompressed:
$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml
INSTRUCTIONS
------------
Configuration and Build
-----------------------
1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
(For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
as done in step 1.
3. From the wikisearch directory, run mvn package
Ingest
------
1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML
files reside, this will start a MapReduce job to ingest the data into Accumulo
(For parallel ingest, instead run ingest/bin/ingest_parallel.sh)
Query
-----
1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:
commons-jexl-*.jar
guava-*.jar
kryo-*.jar
minlog-*.jar
2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
4. cd into webapp and run mvn jetty:run
5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
6. Ctrl-C to stop the jetty container