This is an efficient search engine for Wikipedia pages with support for English and Hindi queries. The code for each language is stored in the respective directories. It is mostly similar, with minor differences in the tokenizer, stop words, and page parser.
The first part is indexing the Wikipedia data, stored as a XML file, to make search easier and quicker.
bash index.sh <path_to_wiki_dump> <path_to_inverted_index> statistics_file.txt
This will parse the data and store the final inverted index in a new location. Note that it will also create an intermediate representation at intermediate/
.
With this, we can search through the data:
bash search.sh <path_to_inverted_index> <path_to_file_with_queries>
-
index.py
is the script that indexes a given data dump. It iterates through each article in the XML, creates a page instance (defined in page.py), then indexes this (logic in indexer.py), and finally inverts this and writes to a file after several files. -
config.py
has some settings/options that the project uses, such as number of files per intermediate index. -
lang.py
has definitions for the stemmer and tokenizer used throughout this project -
search.py
implements the search functionality. It reads a query file and creates a posting list and then calculates the tf-idf scores per document. It returns the titles for the top documents.