SelectWise

Natural Language Processing project on multiple choice question answering on the QASC dataset.

Each item consists of a question, eight multiple choices (from 'A' to 'H') and two facts that provide information about the question, the task being to select the correct answer. The multiple-choice task can be seen as a single-label multi-class classification between the eight possible alternatives. The questions are about grade school science, here we can see the word cloud generated from the training data:

To complete the project, I began by implementing simple, foundational methods in NLP, knowing they would not produce state-of-the-art results but would still be valuable for understanding the evolution of the field. I started with TF-IDF, a basic yet effective method for capturing the importance of words in a document, and experimented with bigram/trigram language models.

Next, I explored neural models for generating word embeddings, such as Word2Vec, which is trained to predict the context of a given word. For sentence similarity, I primarily used cosine similarity to compare vector representations from both TF-IDF and word embeddings, selecting the alternative with the highest cosine similarity score.

Finally, I implemented BERT and large language models (LLMs) that utilize the transformer architecture, which is currently state-of-the-art for sequence-to-sequence tasks. Specifically, I examined the encoder (BERT-like models are encoder-only) that extracts meaning from input sequences and the decoder (LLMs are decoder-only) that generates new sequences based on the input.

Models

The models can be downloaded here

Unzip the file and move the 'models' folder into the 'SelectWise' folder. In the ./models folder I have saved the models trained in the notebooks. This models can be loaded and used for evaluation.

Notebooks

The notebooks can be just run sequentially cell by cell. The dataset is loaded with load_dataset("allenai/qasc") thanks to the hugging face library, so it's not necessary to download the dataset.

In the notebooks the models are loaded with the following path: '../models/model_name'

In the notebooks the images are inserted in this way: ![picture](../imgs/img_name)

Notebook 1 - Count based methods:

Look at the Dataset
Representing documents with TF-IDF weighting:
- Normal tf-idf
- tf-idf truncated with SVD
- tf-idf based retrieval from train set
n-gram LM based classification

Notebook 2 - Word Embeddings:

Representation by means of static word embeddings:
- Word2Vec
- GloVE
- FastText
- Doc2Vec
Other ways of combining word embeddings:
- Remove duplicated words
- Word embeddings weighted by their idf score
Other ways of choosing the answer:
- Siamese Neural Network
- Feed Forward Neural Network

Notebook 3 - Transformer Encoder Only:

BERT:
- Binary classification - NextSentencePrediction
- Multiclass classification - MultipleChoice
Different ways of tuning a pretrained models:
- Linear probing
- Mixed method

Notebook 4 - LLM Prompting:

Zero-Shot Prompting
Zero-Shot Chain of Thought Prompting
Few-Shot Prompting
RAG inspired Few-Shot Prompting

Notebook 5 - Evaluation on the Test Set:

BERT - combined method
DeciLM - few-shot prompting

Dependencies

requirements.txt for pip
requirements.yml for anaconda

References

 @inproceedings{Khot2019QASC,
   title={{QASC}: A Dataset for Question Answering via Sentence Composition},
   author={Tushar Khot and Peter Clark and Michal Guerquin and Peter Alexander Jansen and Ashish Sabharwal},
   booktitle={AAAI},
   year={2019},
  }

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
imgs		imgs
notebooks		notebooks
.gitignore		.gitignore
AI_prompts.txt		AI_prompts.txt
README.md		README.md
[TMNLP 2023-2024] SelectWise report.pdf		[TMNLP 2023-2024] SelectWise report.pdf
[TMNLP 2023-2024] SelectWise slides.pdf		[TMNLP 2023-2024] SelectWise slides.pdf
link_to_models.txt		link_to_models.txt
requirements.txt		requirements.txt
requirements.yml		requirements.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelectWise

Models

Notebooks

Notebook 1 - Count based methods:

Notebook 2 - Word Embeddings:

Notebook 3 - Transformer Encoder Only:

Notebook 4 - LLM Prompting:

Notebook 5 - Evaluation on the Test Set:

Dependencies

References

About

Languages

AlessandroGhiotto/SelectWise

Folders and files

Latest commit

History

Repository files navigation

SelectWise

Models

Notebooks

Notebook 1 - Count based methods:

Notebook 2 - Word Embeddings:

Notebook 3 - Transformer Encoder Only:

Notebook 4 - LLM Prompting:

Notebook 5 - Evaluation on the Test Set:

Dependencies

References

About

Resources

Stars

Watchers

Forks

Languages