This repository has been archived by the owner on Apr 9, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
README
76 lines (55 loc) · 3.83 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Apache Accumulo Wikipedia Search Example
This project contains a sample application for ingesting and querying wikipedia data.
Ingest
------
Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory.
You will want to grab the files with the link name of pages-articles.xml.bz2
3. Though not strictly required, the ingest will go more quickly if the files are decompressed:
$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml
INSTRUCTIONS
------------
1. Copy the ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change it to specify Accumulo information.
2. Copy the ingest/lib/wikisearch-*.jar and ingest/lib/protobuf*.jar to $ACCUMULO_HOME/lib/ext
3. Then run ingest/bin/ingest.sh with one argument (the name of the directory in HDFS where the wikipedia XML
files reside) and this will kick off a MapReduce job to ingest the data into Accumulo.
Query
-----
Prerequisites
-------------
1. The query software was tested using JBoss AS 6. Install this unless you feel like messing with the installation.
NOTE: Ran into a bug (https://issues.jboss.org/browse/RESTEASY-531) that did not allow an EJB3.1 war file. The
workaround is to separate the RESTEasy servlet from the EJBs by creating an EJB jar and a WAR file.
INSTRUCTIONS
-------------
1. Copy the query/src/main/resources/META-INF/ejb-jar.xml.example file to
query/src/main/resources/META-INF/ejb-jar.xml. Modify to the file to contain the same
information that you put into the wikipedia.xml file from the Ingest step above.
2. Re-build the query distribution by running 'mvn package assembly:single' in the top-level directory.
3. Untar the resulting file in the $JBOSS_HOME/server/default directory.
$ cd $JBOSS_HOME/server/default
$ tar -xzf $ACCUMULO_HOME/src/examples/wikisearch/query/target/wikisearch-query*.tar.gz
This will place the dependent jars in the lib directory and the EJB jar into the deploy directory.
4. Next, copy the wikisearch*.war file in the query-war/target directory to $JBOSS_HOME/server/default/deploy.
5. Start JBoss ($JBOSS_HOME/bin/run.sh)
6. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
7. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the $JBOSS_HOME/server/default/lib directory:
commons-lang*.jar
kryo*.jar
minlog*.jar
commons-jexl*.jar
google-collections*.jar
8. Copy the $JBOSS_HOME/server/default/deploy/wikisearch-query*.jar to $ACCUMULO_HOME/lib/ext.
9. At this point you should be able to open a browser and view the page: http://localhost:8080/accumulo-wikisearch/ui/ui.jsp.
You can issue the queries using this user interface or via the following REST urls: <host>/accumulo-wikisearch/rest/Query/xml,
<host>/accumulo-wikisearch/rest/Query/html, <host>/accumulo-wikisearch/rest/Query/yaml, or <host>/accumulo-wikisearch/rest/Query/json.
There are two parameters to the REST service, query and auths. The query parameter is the same string that you would type
into the search box at ui.jsp, and the auths parameter is a comma-separated list of wikis that you want to search (i.e.
enwiki,frwiki,dewiki, etc. Or you can use all)
10. Optional. Add the following line to the $ACCUMULO_HOME/conf/log4j.properties file to turn off debug messages in the specialized
iterators, which will dramatically increase performance:
log4j.logger.org.apache.accumulo.examples.wikisearch.iterator=INFO,A1
This needs to be propagated to all the tablet server nodes, and accumulo needs to be restarted.