-
Notifications
You must be signed in to change notification settings - Fork 1
chorsley/Wiktionary-dump-to-Redis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a simple script designed to read one or more Wiktionary XML dump files and import definitions into Redis. It needs more far rigorous testing, but the results seem good enough to experiment with. Usage ===== $ wiktionary_to_redis.pl <wiktionary_dump.xml> <2nd_wiktionary_dump.xml> ... You can import one or multiple files. Definitions in files listed first are used. This was designed to use definitions from Simple Wiktionary if available, then English Wiktionary as a fallback. Make sure Redis (http://redis.io) is running first. Definitions are imported as a list with format <part of speech>:::<definition>. After the import is finished, you can access definitions like this: $ redis-cli redis> LRANGE sink 0 -1 1. "noun:::A '''sink''' is a [[container]] for water to wash things in, usually with a [[drain]]." 2. "noun:::A '''sink''' is a [[device]] that [[remove]]s [[heat]] or [[energy]]." 3. "verb:::{{ti verb}} If something '''sinks''', it goes down, usually into water." Queries with spaces need to use quote marks via the Redis CLI: redis> LRANGE "Pacific Ocean" 0 -1 1. "proper_noun:::The '''Pacific Ocean''' is a very large [[body]] of water east of [[Asia]] and west of the [[America]]s." Getting the data ================ Dump files are available from: http://dumps.wikimedia.org/ Currently, it's only been tested to work with the Simple English and English sites: http://download.wikimedia.org/simplewiktionary/latest/simplewiktionary- \ latest-pages-articles.xml.bz2 http://download.wikimedia.org/enwiktionary/latest/enwiktionary-latest- \ pages-articles.xml.bz2 Wiki markup on other Wiktionary sites varies, so I can't guarantee it will work for anything else. By default, it will assume the same markup as English Wiktionary. Updating the dictionary ======================= If you're running the script with updated dump data, current definitions won't be updated, but new words will be added. If you want to re-import a new dump into the dictionary, you'll need to clear the data already in Redis to avoid duplicate definitions: $ redis-cli redis> flushall Dependencies ============ Redis MediaWiki::DumpFile Data::Dumper
About
Take a Wiktionary dump file, import it into Redis to lookup definitions in your own projects.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published