Skip to content

Reproduction of "WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval" by Daniel Cohen, Liu Yang, and W. Croft (SIGIR18)

License

Notifications You must be signed in to change notification settings

nikhilsaldanha/WikiPassageQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiPassageQA

Reproduction of "WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval" by Daniel Cohen, Liu Yang, and W. Croft (SIGIR18)

Instructions for setting up

  1. Download wikipassage data from here and move it to data/raw.
  2. Download WebAP data from here and move it to data/raw.

Setup the dev environment by running:

  1. virtualenv -p python3 env
  2. source env/bin/activate
  3. pip install -r requirements.txt

Installing nltk corpora:

  1. python -c"import nltk; nltk.download('stopwords')"
  2. python -c"import nltk; nltk.download('wordnet')"

Contribution Guidelines

  1. Clone the repository: git clone [email protected]:nikhilsaldanha/WikiPassageQA.git.
  2. Setup the repository according to the instructions above.
  3. Pick an open issue from the list here or create your own and assign it to yourself.
  4. Create a branch to fix the issue: git checkout -b <name-of-branch>.
  5. Once you're done, commit and push.
  6. Go to the branch page and create a merge request. Ask a team member to approve changes.

Data Extraction Pipeline

  1. Passage Data Extraction

    • Convert to lower case
    • Remove punctuation
    • Tokenize
    • Remove stop words
    • Lemmatize/Stem
  2. Query Data Extraction

    • Convert to lower case
    • Remove punctuation
    • Tokenize
    • Remove stop words
    • Lemmatize/Stem
    • Split each row with multiple comma separated passage ids in RelevantPassages column into multiple rows, each with 1 passage id.

How to Run:

I. WikiPassageQA:

  1. Execute: python src/data_extraction/wikiqa_query_data_extraction.py to extract query data
  2. Execute: python src/data_extraction/wikiqa_passage_data_extraction.py to extract passage data

II. WebAP:

  1. Execute: python src/data_extraction/webap_data_extraction.py to extract query and passage data from WebAP and store it in data/extracted/webap_passages.json and data/extraced/webap_queries.csv

If pre-processing is not required, while calling WebAPDataExtraction.extract_data(), pass preprocess=False.

Extracted data is stored in data/extracted. Query and Passage is converted to list of lemmatized/stemmed tokens.

Feature Extraction Pipeline

  1. Document Term Frequency
  2. Collection Term Frequency

How to Run: Execute python src/feature_extraction/feature_extraction.py (train|test|dev) to extract train, test or validation features

Extracted features are stored in data/processed/train, data/processed/dev and data/processed/test.

Structure of Collection Term Frequency col_term_freq.json:

{
    "term1": 23,
    "term2": 31,
    ...
}

where each key is a unique term in the collection of documents and its value is the number of its occurances in the collection across all documents.


Structure of Document Term Frequency doc_term_freq.json:

{
    "doc_id1": {
        "term1": 23,
        "term2": 31,
        ...
    },
    "doc_id2": {
        ...
    },
    ...
}

where each key is a unique id for a document in the collection and key is a dictionary of terms in the documents as keys and their frequency in that document as values.

Model Creation Pipeline (TBD)

Project Structure

data                   : all datasets(ignored from version control)
|__ processed          : features and pre-processed data
|   |__ features       : separate extracted features
|   |__ datasets       : combined features into datasets(train/valid/test)
|__ raw                : untouched data from the source

notebooks              : all notebooks (for quick and dirty work)
|__ data_analysis      : related to analysis and visualization of dataset
|__ feature_extraction : related to creating features
|__ models             : related to testing out models

src                    : all clean python scripts
|__ feature_extraction : one script for each feature
|__ models             : scripts for models
|__ experiments        : scripts to run all experiments (training, tuning, testing)

documents              : contains papers/reports required

Obtaining Metrics

Edit the /src/TestBench.py to add function to either:

  1. generate test results
  2. read test results from a pickle Then use the tester in the main function to run tests for the particular model.

Keep in ming that the output mapping is: QID -> (DocID + PassageID)

Therefore, keep your outputs in a data frame with columns as

columns = ["QID", "DocID", "PassageID"]
test_result = pd.DataFrame(columns=columns)

then you can populate test_result

About

Reproduction of "WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval" by Daniel Cohen, Liu Yang, and W. Croft (SIGIR18)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •