Legal Document Summarization with Fine-Tuned LLM and RAG Pipeline

This project focuses on two main objectives:

Fine-tuning a Large Language Model (LLM) for summarizing legal documents.
Implementing a Retrieval-Augmented Generation (RAG) pipeline to automatically retrieve and summarize legal documents based on user-provided details.

The project leverages the Llama-2-7b model, fine-tuned using LoRA (Low-Rank Adaptation), and integrates a RAG pipeline for efficient document retrieval and summarization.

Project Overview

1. Fine-Tuning LLM for Legal Summarization

The project fine-tunes the Llama-2-7b model using LoRA to adapt it for summarizing legal documents.
The model is trained on a dataset of legal judgments and their corresponding summaries.
The training process uses the SFTTrainer from the trl library, which simplifies fine-tuning with LoRA.

2. RAG Pipeline for Document Retrieval

The RAG pipeline retrieves relevant legal documents based on user queries (e.g., case names or details).
It uses FAISS for efficient similarity search and TF-IDF for document vectorization.
The retrieved documents are then summarized using the fine-tuned LLM.

Setup Instructions

1. Create a Conda Environment

conda create --name legal_assistant python=3.10
conda activate legal_assistant

2. Install Required Packages

pip install -r requirements.txt

3. Congifurtions for using huggingfac llama model Llama-2-7b Model

Visit the Llama 2 hugging-face page and request access to the model.
Once approved, log in to Hugging Face:
```
huggingface-cli login
```

4. Install Additional Dependencies

pip install sentencepiece datasets trl bitsandbytes faiss-cpu

Dataset Preparation

1. Download the Dataset

Download the dataset from Zenodo.
Extract the dataset into the legal-llm-project/datasets directory.

2. Preprocess the Dataset

Run the preprocessing script to prepare the dataset for training:

python src/data_preprocessing.py

Fine-Tuning the LLM

1. Configure LoRA

LoRA is used to fine-tune the Llama-2-7b model with a low-rank adaptation approach.
The configuration includes parameters like lora_alpha, lora_dropout, and r (rank).

2. Training with SFTTrainer

The SFTTrainer from the trl library is used for fine-tuning.

The dataset is formatted with clear distinctions between instructions, input, and response:

### Instruction: Summarize the following legal text.

### Input:
{legal_text}

### Response:
{summary}

3. Save the Fine-Tuned Model

After training, the fine-tuned model is saved for inference:

model.save_pretrained("../fine_tuned_lora_model")
tokenizer.save_pretrained("../fine_tuned_lora_model")

RAG Pipeline

1. Document Retrieval

The pipeline uses FAISS for efficient similarity search.
Documents are vectorized using TF-IDF for retrieval.

2. Summarization

Retrieved documents are summarized using the fine-tuned LLM.

The prompt format ensures the model knows where to start the response:

### Instruction: Summarize the following legal text.

### Input:
{retrieved_document}

### Response:
{generated_summary}

Running the Project

1. Fine-Tuning

Run the fine-tuning script:

python src/fine_tune.py

2. Inference with RAG

Run the RAG pipeline for document retrieval and summarization:

python src/rag_pipeline.py

Future Work

Increase Token Limit: The current model supports up to 4096 tokens. Future work can explore extending this limit for longer documents.
Expand to UK Dataset: Adapt the model for summarizing UK legal documents, which are typically larger and more complex.
Optimize Retrieval: Improve the RAG pipeline for faster and more accurate document retrieval.

References

This project provides a robust framework for fine-tuning LLMs for legal document summarization and integrating them into a RAG pipeline for efficient retrieval and generation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal Document Summarization with Fine-Tuned LLM and RAG Pipeline

Table of Contents

Project Overview

1. Fine-Tuning LLM for Legal Summarization

2. RAG Pipeline for Document Retrieval

Setup Instructions

1. Create a Conda Environment

2. Install Required Packages

3. Congifurtions for using huggingfac llama model Llama-2-7b Model

4. Install Additional Dependencies

Dataset Preparation

1. Download the Dataset

2. Preprocess the Dataset

Fine-Tuning the LLM

1. Configure LoRA

2. Training with SFTTrainer

3. Save the Fine-Tuned Model

RAG Pipeline

1. Document Retrieval

2. Summarization

Running the Project

1. Fine-Tuning

2. Inference with RAG

Future Work

References

About

Releases

Packages

Languages

aryanmangal769/Legal-Doc-Summarization-LLM-LoRA-RAG

Folders and files

Latest commit

History

Repository files navigation

Legal Document Summarization with Fine-Tuned LLM and RAG Pipeline

Table of Contents

Project Overview

1. Fine-Tuning LLM for Legal Summarization

2. RAG Pipeline for Document Retrieval

Setup Instructions

1. Create a Conda Environment

2. Install Required Packages

3. Congifurtions for using huggingfac llama model Llama-2-7b Model

4. Install Additional Dependencies

Dataset Preparation

1. Download the Dataset

2. Preprocess the Dataset

Fine-Tuning the LLM

1. Configure LoRA

2. Training with SFTTrainer

3. Save the Fine-Tuned Model

RAG Pipeline

1. Document Retrieval

2. Summarization

Running the Project

1. Fine-Tuning

2. Inference with RAG

Future Work

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages