Natural Language Processing project on multiple choice question answering on the QASC dataset.
Each item consists of a question, eight multiple choices (from 'A' to 'H') and two facts that provide information about the question, the task being to select the correct answer. The multiple-choice task can be seen as a single-label multi-class classification between the eight possible alternatives. The questions are about grade school science, here we can see the word cloud generated from the training data:
To complete the project, I began by implementing simple, foundational methods in NLP, knowing they would not produce state-of-the-art results but would still be valuable for understanding the evolution of the field. I started with TF-IDF, a basic yet effective method for capturing the importance of words in a document, and experimented with bigram/trigram language models.
Next, I explored neural models for generating word embeddings, such as Word2Vec, which is trained to predict the context of a given word. For sentence similarity, I primarily used cosine similarity to compare vector representations from both TF-IDF and word embeddings, selecting the alternative with the highest cosine similarity score.
Finally, I implemented BERT and large language models (LLMs) that utilize the transformer architecture, which is currently state-of-the-art for sequence-to-sequence tasks. Specifically, I examined the encoder (BERT-like models are encoder-only) that extracts meaning from input sequences and the decoder (LLMs are decoder-only) that generates new sequences based on the input.
The models can be downloaded here
Unzip the file and move the 'models' folder into the 'SelectWise' folder. In the ./models
folder I have saved the models trained in the notebooks. This models can be loaded and used for evaluation.
The notebooks can be just run sequentially cell by cell. The dataset is loaded with load_dataset("allenai/qasc")
thanks to the hugging face library, so it's not necessary to download the dataset.
In the notebooks the models are loaded with the following path: '../models/model_name'
In the notebooks the images are inserted in this way: ![picture](../imgs/img_name)
- Look at the Dataset
- Representing documents with TF-IDF weighting:
- Normal tf-idf
- tf-idf truncated with SVD
- tf-idf based retrieval from train set
- n-gram LM based classification
- Representation by means of static word embeddings:
- Word2Vec
- GloVE
- FastText
- Doc2Vec
- Other ways of combining word embeddings:
- Remove duplicated words
- Word embeddings weighted by their idf score
- Other ways of choosing the answer:
- Siamese Neural Network
- Feed Forward Neural Network
- BERT:
- Binary classification - NextSentencePrediction
- Multiclass classification - MultipleChoice
- Different ways of tuning a pretrained models:
- Linear probing
- Mixed method
- Zero-Shot Prompting
- Zero-Shot Chain of Thought Prompting
- Few-Shot Prompting
- RAG inspired Few-Shot Prompting
- BERT - combined method
- DeciLM - few-shot prompting
requirements.txt
for piprequirements.yml
for anaconda
@inproceedings{Khot2019QASC,
title={{QASC}: A Dataset for Question Answering via Sentence Composition},
author={Tushar Khot and Peter Clark and Michal Guerquin and Peter Alexander Jansen and Ashish Sabharwal},
booktitle={AAAI},
year={2019},
}