Skip to content

RodionfromHSE/MultiChoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-choice question generator

Problem: Given a topic, generate a question with multiple choices and an answer.

Shortly about the project:

  1. I generated a dataset of 40k samples
  2. I trained a model on based on GPT-2 architecture
  3. Inference: I used a lot of different tequniques to make a reasonable generated content

Example of model output

Topic: "chemistry"
Model output:
"""
question: what can be used to determine the age of an organism
variants: (a) cell division (b) survival (c) rapid expansion (d) the rapid growth of a species (e) it needs them (f) genetic
answer: f
context: genetic information is used for determining the ages of organisms
"""

Dataset creation

Here I created dataset from qasc dataset. Can be downloaded here
Actually I sent the dataset so that you don't need to download it.

A bit later I figured out that this dataset also exists on HF

Note: I used only dev.jsonl and train.jsonl files because test.jsonl doesn't have answers.

Creation

  1. Proceed data, form convient format
  2. Use GPT2 API to create topics for each question
    • Add OPENAI_API_KEY to env - your OpenAI API key
  3. Explode questions
  4. Some augmentation can be done (I missed it)
  5. Save dataset on HF

Final size of dataset is 40k samples

Model training

Fine-tuning a pretrained model and inference

Download the pretrained model and dataset

dataset - 'under-tree/labeled-multiple-choice' Dataset generated by me for our task

model - 'distillgpt2'

Fine-tuning the model

I tried different ways to accelerate the training

  1. PyTorch with devices
  2. PyTorch with accelerator
  3. Default Trainer with accelerator
  4. Default Trainer
  5. Trainer with change of batch_size

The last method (5) is fastest

Perplexity after fine-tuning - 3.07

Inference

  1. I saved model on HF Hub
  2. I created inference pipeline (Please, take a look on inference)

I did inference on GPU. I tried different parameters for text generation

  • max_length
  • num_beams
  • temperature
  • repetition_penalty
  • do_sample
  • top_k
  • top_p

The final inference consists of several forward passes, truncation of text, adding prompt. I think it works great!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published