Skip to content

Take a Wiktionary dump file, import it into Redis to lookup definitions in your own projects.

Notifications You must be signed in to change notification settings

chorsley/Wiktionary-dump-to-Redis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This is a simple script designed to read one or more Wiktionary XML dump files
and import definitions into Redis. It needs more far rigorous testing, but 
the results seem good enough to experiment with.

Usage
=====

$ wiktionary_to_redis.pl <wiktionary_dump.xml> <2nd_wiktionary_dump.xml> ...

You can import one or multiple files. Definitions in files listed first are
used. This was designed to use definitions from Simple Wiktionary if available,
then English Wiktionary as a fallback.

Make sure Redis (http://redis.io) is running first. 
Definitions are imported as a list with format <part of speech>:::<definition>.

After the import is finished, you can access definitions like this:

$ redis-cli
redis> LRANGE sink 0 -1
1. "noun:::A '''sink''' is a [[container]] for water to wash things in, usually with a [[drain]]."
2. "noun:::A '''sink''' is a [[device]] that [[remove]]s [[heat]] or [[energy]]."
3. "verb:::{{ti verb}} If something '''sinks''', it goes down, usually into water."

Queries with spaces need to use quote marks via the Redis CLI:

redis> LRANGE "Pacific Ocean" 0 -1
1. "proper_noun:::The '''Pacific Ocean''' is a very large [[body]] of water east of [[Asia]] and west of the [[America]]s."

Getting the data
================

Dump files are available from:

http://dumps.wikimedia.org/

Currently, it's only been tested to work with the Simple English and English
sites:

http://download.wikimedia.org/simplewiktionary/latest/simplewiktionary- \
latest-pages-articles.xml.bz2
http://download.wikimedia.org/enwiktionary/latest/enwiktionary-latest- \
pages-articles.xml.bz2

Wiki markup on other Wiktionary sites varies, so I can't guarantee it will work
for anything else. By default, it will assume the same markup as English
Wiktionary.

Updating the dictionary
=======================

If you're running the script with updated dump data, current definitions won't
be updated, but new words will be added.

If you want to re-import a new dump into the dictionary, you'll need to clear
the data already in Redis to avoid duplicate definitions:

$ redis-cli
redis> flushall

Dependencies
============

Redis
MediaWiki::DumpFile
Data::Dumper

About

Take a Wiktionary dump file, import it into Redis to lookup definitions in your own projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages