Multi-choice question generator

Problem: Given a topic, generate a question with multiple choices and an answer.

Shortly about the project:

I generated a dataset of 40k samples
I trained a model on based on GPT-2 architecture
Inference: I used a lot of different tequniques to make a reasonable generated content

Example of model output

Topic: "chemistry"
Model output:
"""
question: what can be used to determine the age of an organism
variants: (a) cell division (b) survival (c) rapid expansion (d) the rapid growth of a species (e) it needs them (f) genetic
answer: f
context: genetic information is used for determining the ages of organisms
"""

Dataset creation

Here I created dataset from qasc dataset. Can be downloaded here
Actually I sent the dataset so that you don't need to download it.

A bit later I figured out that this dataset also exists on HF

Note: I used only dev.jsonl and train.jsonl files because test.jsonl doesn't have answers.

Creation

Proceed data, form convient format
Use GPT2 API to create topics for each question
- Add OPENAI_API_KEY to env - your OpenAI API key
Explode questions
Some augmentation can be done (I missed it)
Save dataset on HF

Final size of dataset is 40k samples

Model training

Fine-tuning a pretrained model and inference

Download the pretrained model and dataset

dataset - 'under-tree/labeled-multiple-choice' Dataset generated by me for our task

model - 'distillgpt2'

Fine-tuning the model

I tried different ways to accelerate the training

PyTorch with devices
PyTorch with accelerator
Default Trainer with accelerator
Default Trainer
Trainer with change of batch_size

The last method (5) is fastest

Perplexity after fine-tuning - 3.07

Inference

I saved model on HF Hub
I created inference pipeline (Please, take a look on inference)

I did inference on GPU. I tried different parameters for text generation

max_length
num_beams
temperature
repetition_penalty
do_sample
top_k
top_p

The final inference consists of several forward passes, truncation of text, adding prompt. I think it works great!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
QASC_Dataset		QASC_Dataset
README.md		README.md
dataset.ipynb		dataset.ipynb
main.ipynb		main.ipynb
query.txt		query.txt
topics.pkl		topics.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-choice question generator

Example of model output

Dataset creation

Creation

Model training

Fine-tuning a pretrained model and inference

Download the pretrained model and dataset

Fine-tuning the model

Inference

About

Releases

Packages

Languages

RodionfromHSE/MultiChoice

Folders and files

Latest commit

History

Repository files navigation

Multi-choice question generator

Example of model output

Dataset creation

Creation

Model training

Fine-tuning a pretrained model and inference

Download the pretrained model and dataset

Fine-tuning the model

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages