WikiPassageQA

Reproduction of "WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval" by Daniel Cohen, Liu Yang, and W. Croft (SIGIR18)

Instructions for setting up

Download wikipassage data from here and move it to data/raw.
Download WebAP data from here and move it to data/raw.

Setup the dev environment by running:

virtualenv -p python3 env
source env/bin/activate
pip install -r requirements.txt

Installing nltk corpora:

python -c"import nltk; nltk.download('stopwords')"
python -c"import nltk; nltk.download('wordnet')"

Contribution Guidelines

Clone the repository: git clone git@github.com:nikhilsaldanha/WikiPassageQA.git.
Setup the repository according to the instructions above.
Pick an open issue from the list here or create your own and assign it to yourself.
Create a branch to fix the issue: git checkout -b <name-of-branch>.
Once you're done, commit and push.
Go to the branch page and create a merge request. Ask a team member to approve changes.

Data Extraction Pipeline

Passage Data Extraction
- Convert to lower case
- Remove punctuation
- Tokenize
- Remove stop words
- Lemmatize/Stem
Query Data Extraction
- Convert to lower case
- Remove punctuation
- Tokenize
- Remove stop words
- Lemmatize/Stem
- Split each row with multiple comma separated passage ids in RelevantPassages column into multiple rows, each with 1 passage id.

How to Run:

I. WikiPassageQA:

Execute: python src/data_extraction/wikiqa_query_data_extraction.py to extract query data
Execute: python src/data_extraction/wikiqa_passage_data_extraction.py to extract passage data

II. WebAP:

Execute: python src/data_extraction/webap_data_extraction.py to extract query and passage data from WebAP and store it in data/extracted/webap_passages.json and data/extraced/webap_queries.csv

If pre-processing is not required, while calling WebAPDataExtraction.extract_data(), pass preprocess=False.

Extracted data is stored in data/extracted. Query and Passage is converted to list of lemmatized/stemmed tokens.

Feature Extraction Pipeline

Document Term Frequency
Collection Term Frequency

How to Run: Execute python src/feature_extraction/feature_extraction.py (train|test|dev) to extract train, test or validation features

Extracted features are stored in data/processed/train, data/processed/dev and data/processed/test.

Structure of Collection Term Frequency col_term_freq.json:

{
    "term1": 23,
    "term2": 31,
    ...
}

where each key is a unique term in the collection of documents and its value is the number of its occurances in the collection across all documents.

Structure of Document Term Frequency doc_term_freq.json:

{
    "doc_id1": {
        "term1": 23,
        "term2": 31,
        ...
    },
    "doc_id2": {
        ...
    },
    ...
}

where each key is a unique id for a document in the collection and key is a dictionary of terms in the documents as keys and their frequency in that document as values.

Model Creation Pipeline (TBD)

Project Structure

data                   : all datasets(ignored from version control)
|__ processed          : features and pre-processed data
|   |__ features       : separate extracted features
|   |__ datasets       : combined features into datasets(train/valid/test)
|__ raw                : untouched data from the source

notebooks              : all notebooks (for quick and dirty work)
|__ data_analysis      : related to analysis and visualization of dataset
|__ feature_extraction : related to creating features
|__ models             : related to testing out models

src                    : all clean python scripts
|__ feature_extraction : one script for each feature
|__ models             : scripts for models
|__ experiments        : scripts to run all experiments (training, tuning, testing)

documents              : contains papers/reports required

Obtaining Metrics

Edit the /src/TestBench.py to add function to either:

generate test results
read test results from a pickle Then use the tester in the main function to run tests for the particular model.

Keep in ming that the output mapping is: QID -> (DocID + PassageID)

Therefore, keep your outputs in a data frame with columns as

columns = ["QID", "DocID", "PassageID"]
test_result = pd.DataFrame(columns=columns)

then you can populate test_result

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WikiPassageQA

Instructions for setting up

Contribution Guidelines

Data Extraction Pipeline

Feature Extraction Pipeline

Model Creation Pipeline (TBD)

Project Structure

Obtaining Metrics

Files

README.md

Latest commit

History

README.md

File metadata and controls

WikiPassageQA

Instructions for setting up

Contribution Guidelines

Data Extraction Pipeline

Feature Extraction Pipeline

Model Creation Pipeline (TBD)

Project Structure

Obtaining Metrics