This is a simple GPT implementation in Python. It is based on the russian version of GPT-2.
Final dataset consists of 0.8M samples.
We used 2 large russian text corpus: Yandex QA and Diasum Dataset. We tried next techniques to prepare dataset for our model:
- Form dialogues from sentences
- Make each sample consist of 3 parts: context, prompt and answer
Example
history: "Привет, как дела?"
speaker1: "Привет, все хорошо, а у тебя?"
speaker2: "Все хорошо, спасибо!"
On english
history: "Hi, how are you?"
speaker1: "Hi, I'm fine, and you?"
speaker2: "I'm fine, thanks!"
Here is usual training pipeline. We used HuggingFace transformers library to train our model.