-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
66 lines (46 loc) · 2.12 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
This is a simple script designed to read one or more Wiktionary XML dump files
and import definitions into Redis. It needs more far rigorous testing, but
the results seem good enough to experiment with.
Usage
=====
$ wiktionary_to_redis.pl <wiktionary_dump.xml> <2nd_wiktionary_dump.xml> ...
You can import one or multiple files. Definitions in files listed first are
used. This was designed to use definitions from Simple Wiktionary if available,
then English Wiktionary as a fallback.
Make sure Redis (http://redis.io) is running first.
Definitions are imported as a list with format <part of speech>:::<definition>.
After the import is finished, you can access definitions like this:
$ redis-cli
redis> LRANGE sink 0 -1
1. "noun:::A '''sink''' is a [[container]] for water to wash things in, usually with a [[drain]]."
2. "noun:::A '''sink''' is a [[device]] that [[remove]]s [[heat]] or [[energy]]."
3. "verb:::{{ti verb}} If something '''sinks''', it goes down, usually into water."
Queries with spaces need to use quote marks via the Redis CLI:
redis> LRANGE "Pacific Ocean" 0 -1
1. "proper_noun:::The '''Pacific Ocean''' is a very large [[body]] of water east of [[Asia]] and west of the [[America]]s."
Getting the data
================
Dump files are available from:
http://dumps.wikimedia.org/
Currently, it's only been tested to work with the Simple English and English
sites:
http://download.wikimedia.org/simplewiktionary/latest/simplewiktionary- \
latest-pages-articles.xml.bz2
http://download.wikimedia.org/enwiktionary/latest/enwiktionary-latest- \
pages-articles.xml.bz2
Wiki markup on other Wiktionary sites varies, so I can't guarantee it will work
for anything else. By default, it will assume the same markup as English
Wiktionary.
Updating the dictionary
=======================
If you're running the script with updated dump data, current definitions won't
be updated, but new words will be added.
If you want to re-import a new dump into the dictionary, you'll need to clear
the data already in Redis to avoid duplicate definitions:
$ redis-cli
redis> flushall
Dependencies
============
Redis
MediaWiki::DumpFile
Data::Dumper