Skip to content

Latest commit

 

History

History
51 lines (35 loc) · 6.18 KB

File metadata and controls

51 lines (35 loc) · 6.18 KB

Automatic Quesion Answering of City Council Questions

The city council can ask the municipal executive questions to clarify certain subjects. Answering those questions is a time-consuming and complex process. This project aims to aid the task by experimenting with different NLP modeling approaches and comparing the results regarding accuracy and execution times.

Therefore, the project aims to answer the following research question: To what extent can state-of-the art NLP models for (i) information retrieval and (ii) text generation aid the process of answering Amsterdam city council questions?

The planned implementation consists of two main steps: (1) retrieving: rank the answer-relevant documents from the websites (2) generating: create an answer based on the most relevant documents


Published Long Form Question Answering (LFQA) Dataset in Dutch

You can find the published LFFQA dataset with questions, posed by the Amsterdam city council and answers by the Amsterdam municipal executive in the dates ranging from 2013 till 2022 at: questions

Project Folder Structure

This project has the following folder structure:

  1. notebooks: Folder containing the different stages of the pipeline:

    1. 0_exploratory_data_analysis: Notebooks containing graphs and descriptive statistics of the data

      • questions_answers_EDA: Explores the data in terms of the questions and the answers, their length, common words, common topics
      • URLs_EDA: Explores the most commonly referneced URLs in the municipal answers
    2. 1_supporting_documents_collection: Folder of notebooks containing supporting documents. You can collect the supporting documents from amsterdam.nl either all at once through 1_1_collect_ams_at_once or individually through 1_2_1_collect_ams_subpages_individually.ipynb

      • 1_1_collect_ams_at_once: Collects all subpages and all nested subpages under amsterdam.nl under the section 'onderwepen'
      • 1_2_1_collect_ams_subpages_individually Collects all subpages and all nested subpages under amsterdam.nl under the section 'onderwepen' individually (section by section)
      • 3_collect_update_URLs_from_references: This notebook needs to be ran -> it collects all URLs that are referenced in the municipal executive's answers but more importantly it updates the URLs, since some URLs have an outdated version in the answers. It saves the questions with updated URLs 'questions_updated_url.csv'. It also saves a file with questions containing only amsterdam.nl URLs which we use for ranking evaluation and a test set for text generation.
    3. 2_ranking_text_generation_pipeline: Notebooks containing different main steps and preprocessing steps of the pipeline

      • 0_get_amsterdamnl_only_questions: Extracts only the questions which in their answers refer to amsterdam.nl and saves them in the file amsterdam_questions.csv. This is going to be our test set. Also we will use it to evaluate the ranking step.
      • 1_ranking_comparison_tfidf_random_bm25: In this notebook we compare the three ranking strategies: TF-IDF, BM25 and Random sampling (TF-IDF resulting in being the best one).
      • 2_1_ranking_all_questions: Ranking all questions with the most accurate strategy (tfidf) and saving them in a .pickle file
      • 2_2_ranking_all_questions_nonfact_filtering: Filtering out non-factual questions and ranking all of them with the most accurate strategy (tfidf) and saving them in a .pickle file
      • 3_1_get_train_test_set: Filtering out amsterdam.nl references from the data, which results in our training sample
      • 3_2_get_train_test_set_nonfact_filtered: Filtering out amsterdam.nl references from the data - which is cleaned from non-factual questions, which results in our training sample
      • 4_train_mT5_models: We use this notebook to fine-tine mT5 to different tasks. We change the input format according to the task we fine-tune the model on
      • 5_compare_mT5_finetuned_models: We evaluate the results from all fine-tuned models
    4. 3_ideas_future_work_not_included_thesis: Notebooks containing different implementations and ideas that were not implemented in the final thesis, however, they might be useful for future work. (They are still work in progress.)

  2. src: Folder for source files specific to this project

  3. src_clean: Folder for source files specific to this project

  4. media: Folder containing media files (icons, video)