Problem: Given a topic, generate a question with multiple choices and an answer.
Shortly about the project:
- I generated a dataset of 40k samples
- I trained a model on based on GPT-2 architecture
- Inference: I used a lot of different tequniques to make a reasonable generated content
Topic: "chemistry"
Model output:
"""
question: what can be used to determine the age of an organism
variants: (a) cell division (b) survival (c) rapid expansion (d) the rapid growth of a species (e) it needs them (f) genetic
answer: f
context: genetic information is used for determining the ages of organisms
"""
Here I created dataset from qasc dataset. Can be downloaded here
Actually I sent the dataset so that you don't need to download it.
A bit later I figured out that this dataset also exists on HF
Note: I used only dev.jsonl and train.jsonl files because test.jsonl doesn't have answers.
- Proceed data, form convient format
- Use GPT2 API to create topics for each question
- Add
OPENAI_API_KEY
to env - your OpenAI API key
- Add
- Explode questions
- Some augmentation can be done (I missed it)
- Save dataset on HF
Final size of dataset is 40k samples
dataset - 'under-tree/labeled-multiple-choice' Dataset generated by me for our task
model - 'distillgpt2'
I tried different ways to accelerate the training
- PyTorch with devices
- PyTorch with accelerator
- Default Trainer with accelerator
- Default Trainer
- Trainer with change of batch_size
The last method (5) is fastest
Perplexity after fine-tuning - 3.07
- I saved model on HF Hub
- I created inference pipeline (Please, take a look on inference)
I did inference on GPU. I tried different parameters for text generation
- max_length
- num_beams
- temperature
- repetition_penalty
- do_sample
- top_k
- top_p
The final inference consists of several forward passes, truncation of text, adding prompt. I think it works great!