Question-Answer Generation Pipeline

This project generates questions and answers from PDF lecture slides using different AI models.

It is also a supplementary material for the following conference paper:

Ivo Lodovico Molina, Valdemar Švábenský, Tsubasa Minematsu, Li Chen, Fumiya Okubo, and Atsushi Shimada.
Comparison of Large Language Models for Generating Contextually Relevant Questions.
In 19th European Conference on Technology Enhanced Learning (ECTEL 2024).
https://link.springer.com/chapter/10.1007/978-3-031-72312-4_18

Preprint: https://arxiv.org/abs/2407.20578

How to cite

If you use or build upon the materials, please use the BibTeX entry below to cite the original paper (not only this web link).

@inproceedings{Molina2024comparison,
    author    = {Lodovico Molina, Ivo and \v{S}v\'{a}bensk\'{y}, Valdemar and Minematsu, Tsubasa and Chen, Li and Okubo, Fumiya and Shimada, Atsushi},
    title     = {{Comparison of Large Language Models for Generating Contextually Relevant Questions}},
    booktitle = {Proceedings of the 19th European Conference on Technology Enhanced Learning},
    series    = {ECTEL '24},
    editor    = {Mello, Rafael Ferreira and Rummel, Nikol and Jivet, Ioana and Pishtari, Gerti and Ruipérez Valiente, José A.},
    location  = {Krems, Austria},
    publisher = {Springer},
    address   = {Cham, Switzerland},
    year      = {2024},
    numpages  = {6},
    isbn      = {978-3-031-72312-4},
    url       = {https://doi.org/10.1007/978-3-031-72312-4_18},
    doi       = {10.1007/978-3-031-72312-4_18}
}

Overview

The workflow follows these steps:

Extract text from PDF slides
Generate answers from slide text using Llama 2
Generate questions for each answer using Llama 2
Generate additional questions using GPT-3.5 Turbo
Generate additional questions using Flan-T5-XXL
Save question-answer pairs to JSON file

The code provides examples for generating questions from a set of sample lecture slides on Pattern Recognition. (These slides are not included in this repository.)

Setup

Requirements

Python 3.10+
PyTorch
Transformers
OpenAI API key
Hugging Face auth token

Installation

Clone the repo:

git clone https://github.com/limu-research/qa-generator-pipeline-aied2024.git
cd qa-generator-pipeline-aied2024

Install requirements:

pip install -r requirements.txt

Usage

Run the sections of the notebook workflow.ipynb in order:

Llama 2 13B-Chat Generation - Extract text from PDFs, generates the list of answers and the first set of question generated by Llama 2 13B-Chat model, and it saves into a JSON file
GPT 3.5 Turbo Generation - Reads the answers from the JSON file, generates questions for each one using the GPT 3.5 Turbo model and adds the new questions into the JSON file
Flan T5 XXL Generation - Reads the answers from the JSON file, generates questions for each one using the Flan T5 XXL model and adds the new questions into the JSON file

Set the OPENAI_API_KEY and HF_AUTH_TOKEN environment variables before running.

Code Structure

pdf_slides/ - Folder containing sample PDF slides
workflow.ipynb - Notebook to extract text from PDFs, generate Q&A pairs with the LLMs
qa_pairs.json - JSON file containing generated question-answer pairs
requirements.txt - Python requirements

Models

Llama 2 13B

Llama 2 is used to generate answers and initial questions from the PDF slide text.

GPT-3.5 Turbo

GPT-3.5 Turbo is used to generate additional questions for each answer.

Flan-T5-XXL

Flan-T5-XXL also generates additional questions for each answer.

Results

The output is a JSON file qa_pairs.json containing the generated question-answer pairs.

See also the /figures folder for the analysis of the answer distributions for the paper.

Contributing

Pull requests are welcome! Feel free to open an issue for any enhancements or bugs.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
figures		figures
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
workflow.ipynb		workflow.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Question-Answer Generation Pipeline

How to cite

Contents

Overview

Setup

Requirements

Installation

Usage

Code Structure

Models

Llama 2 13B

GPT-3.5 Turbo

Flan-T5-XXL

Results

Contributing

About

Releases

Packages

Contributors 2

Languages

License

limu-research/2024-ectel-qg

Folders and files

Latest commit

History

Repository files navigation

Question-Answer Generation Pipeline

How to cite

Contents

Overview

Setup

Requirements

Installation

Usage

Code Structure

Models

Llama 2 13B

GPT-3.5 Turbo

Flan-T5-XXL

Results

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages