Reproduction of "WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval" by Daniel Cohen, Liu Yang, and W. Croft (SIGIR18)
- Download wikipassage data from here and move it to
data/raw
. - Download WebAP data from here and move it to
data/raw
.
Setup the dev environment by running:
virtualenv -p python3 env
source env/bin/activate
pip install -r requirements.txt
Installing nltk corpora:
python -c"import nltk; nltk.download('stopwords')"
python -c"import nltk; nltk.download('wordnet')"
- Clone the repository:
git clone [email protected]:nikhilsaldanha/WikiPassageQA.git
. - Setup the repository according to the instructions above.
- Pick an open issue from the list here or create your own and assign it to yourself.
- Create a branch to fix the issue:
git checkout -b <name-of-branch>
. - Once you're done, commit and push.
- Go to the branch page and create a merge request. Ask a team member to approve changes.
-
Passage Data Extraction
- Convert to lower case
- Remove punctuation
- Tokenize
- Remove stop words
- Lemmatize/Stem
-
Query Data Extraction
- Convert to lower case
- Remove punctuation
- Tokenize
- Remove stop words
- Lemmatize/Stem
- Split each row with multiple comma separated passage ids in
RelevantPassages
column into multiple rows, each with 1 passage id.
How to Run:
I. WikiPassageQA:
- Execute:
python src/data_extraction/wikiqa_query_data_extraction.py
to extract query data - Execute:
python src/data_extraction/wikiqa_passage_data_extraction.py
to extract passage data
II. WebAP:
- Execute:
python src/data_extraction/webap_data_extraction.py
to extract query and passage data from WebAP and store it indata/extracted/webap_passages.json
anddata/extraced/webap_queries.csv
If pre-processing is not required, while calling WebAPDataExtraction.extract_data()
, pass preprocess=False
.
Extracted data is stored in data/extracted
. Query and Passage is converted to list of lemmatized/stemmed tokens.
- Document Term Frequency
- Collection Term Frequency
How to Run:
Execute python src/feature_extraction/feature_extraction.py (train|test|dev)
to extract train, test or validation features
Extracted features are stored in data/processed/train
, data/processed/dev
and data/processed/test
.
Structure of Collection Term Frequency col_term_freq.json
:
{
"term1": 23,
"term2": 31,
...
}
where each key is a unique term in the collection of documents and its value is the number of its occurances in the collection across all documents.
Structure of Document Term Frequency doc_term_freq.json
:
{
"doc_id1": {
"term1": 23,
"term2": 31,
...
},
"doc_id2": {
...
},
...
}
where each key is a unique id for a document in the collection and key is a dictionary of terms in the documents as keys and their frequency in that document as values.
data : all datasets(ignored from version control)
|__ processed : features and pre-processed data
| |__ features : separate extracted features
| |__ datasets : combined features into datasets(train/valid/test)
|__ raw : untouched data from the source
notebooks : all notebooks (for quick and dirty work)
|__ data_analysis : related to analysis and visualization of dataset
|__ feature_extraction : related to creating features
|__ models : related to testing out models
src : all clean python scripts
|__ feature_extraction : one script for each feature
|__ models : scripts for models
|__ experiments : scripts to run all experiments (training, tuning, testing)
documents : contains papers/reports required
Edit the /src/TestBench.py
to add function to either:
- generate test results
- read test results from a pickle Then use the tester in the main function to run tests for the particular model.
Keep in ming that the output mapping is: QID -> (DocID + PassageID)
Therefore, keep your outputs in a data frame with columns as
columns = ["QID", "DocID", "PassageID"]
test_result = pd.DataFrame(columns=columns)
then you can populate test_result