Skip to content

BU Spark Fall 2023 project repo for integrating LLMs with MAPLE

License

Notifications You must be signed in to change notification settings

maple-testimony/ml-maple-bill-summarization

 
 

Repository files navigation

MAPLE (Bill Summarization, Tagging, Explanation)

In this project, we generate summaries and category tags for of Massachusetts bills for MAPLE Platform. The goal is to simplify the legal language and content to make it comprehensible for a broader audience (9th-grade comprehension level) by exploring different ML and LLM services.

This repository contains a pipeline from taking bills from Massachusetts legislature, generating summaries and category tags leveraging different the Massachusetts General Law sections, creating a dashboard to display and save the generated texts, to deploying and integrating into MAPLE platform.

Roadmap of Repository Directories

  • Documentation:
    Research.md: our research on large language models and evaluation methods we planned to use for this project.
    Documentation MAPLE.pdf: includes detail operation of our model for future use and improvement.

  • EDA: the notebook eda.ipynb includes our work from scraping data that takes bills from MAPLE Swagger API, creating a dataframe to clean and process data, making visualizations to analyze data and explore characteristics of the dataset.

  • demoapp:
    demo_app_function.py contains the code for generating summaries, category and bills for selected bills. For the documents (bill title + bill text + MGL sections text + MGL names + committee names) larger than 120K token size, we split the document into chunks and use vector embeddings before injection into the prompt. For documents smaller than 120K token size, all the documents are inserted directly into the prompt. The summary, category and tags are all generated using a single prompt. run_demo_app_12bills.py: we test on top 12 bills from MAPLE website. We extract information from Massachusetts General Law to add context for the summaries of these bills. MGL sections text was scraped using extract_mgl_sections.py. Contains code to generate bill categories and tags run_demo_app.py: contains the codes of the LLM - OpenAI service and webapp made using Streamlit. The webapp allows user to search for all bills. MGL sections text is extracted for all but ~1300 bills and is available for in 'Combined_MGL' column in all_bills_with_mgl.pq file (currently hosted on google drive due to it's large size).

    We currently use vectorstore to split large documents into chunks for vectorstore storage and embeddings before injection into the prompt when the prompt length exceeds a hardcoded context window size. However, note that operations through vectorstores are fuzzy (rely on similarity search).

    Currently using 'gpt-4-1106-preview' to generate summaries, categories and tags of the bills

    Files used in demo_app_function.py, run_demo_app_12bills.py, and run_demo_app.py: 12_bills_with_more_sections.pq: file containing information on 12 bills with MGL all_bills_with_mgl.pq: file containing information on all bills with MGL (currently hosted on google drive) chapter_section_names.pq: file containing MGL chapter and section names committee_info.pq: file containing committee names and description

    Other files: helper files to be imported in the above two Python app files.

  • Prompts Engineering: prompts.md stores all prompts that we tested.

  • Tagging: contains the list of categories and tags.

  • Deployment: contains the link of our Streamlit deployed webapp.

Ethical Implications

The dataset used for this project is fully open sourced and can be access through Mass General Laws API.

Our team and MAPLE agree about putting disclaimer that this text is AI-generated.

Although we make use of open source transformers to evaluate hallucination with Vectara, it is important to have experts and human evaluation to further maintain a trustworthy LLM system.

Resources and Citation

Team Members

Vy Nguyen - Email: [email protected]
Andy Yang - Email: [email protected]
Gauri Bhandarwar - Email: [email protected]
Weining Mai - Email: [email protected]

About

BU Spark Fall 2023 project repo for integrating LLMs with MAPLE

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 84.7%
  • Python 15.3%