- π οΈ Requirements
- βοΈ Configuration
- π Data Structure
- π Extraction Pipeline
- π Evaluation
- Python 3.11
- Poetry
- Git
-
Clone the repository
Use the following command to clone the repository:
git clone [email protected]:worldbank/impactAI-extraction-paper.git
-
Install poetry
Poetry is a tool for dependency management in Python. Install it with the following command:
curl -sSL https://install.python-poetry.org | python -
-
Install dependencies
Use Poetry to install the project dependencies:
poetry install
-
Configure pre-commit
Pre-commit is a tool that runs checks on your code before you commit it. It is configured in the
.pre-commit-config.yaml
file. To install it, run the following command:poetry run pre-commit install
-
Set-up environment variables
Create a
.env
file at the root of the project and add the following environment variables- OPENAI_API_KEY
- GOOGLE_API_KEY
And activate the variables with:
source .env
- OPENAI_API_KEY
Most scripts in this repository are written to perform extraction and subsequent transforms on a batch of article pdfs.
As a suggestion, prior to running the extraction pipeline, pdf files for scientific articles could be put as a batch in a new subfolder of the data/raw_pdfs
folder, e.g. data/raw_pdfs/batch_01/
. The data
folder should also contain an extraction subfolder, data/extraction
where the output of extraction scripts will be stored as a batch subfolder, e.g. data/extraction/tables/
.
The last suggestion is to also include annotations in a subfolder such as data/annotations/batch_01/
.
The parse_pdf.py
script implements a robust PDF processing pipeline that converts academic papers into clean markdown format. This pipeline has been extensively tested against various approaches and leverages parallel processing for optimal performance.
# Basic usage
python src/parse/parse_pdf.py
# Process with verbose logging
python src/parse/parse_pdf.py --verbose
# Process specific number of samples
python src/parse/parse_pdf.py --n_samples 5
Configure the pipeline in src/parse/settings.py
:
Models:
model_text = "gpt-4o-mini" # For text processing
model_tables = "gpt-4o" # For table processing
Local paths:
path_input = Path("/tmp/raw_pdfs")
path_output = Path("/tmp/processed_files")
GCP Bucket paths:
raw_bucket = get_secret("raw-bucket")
processed_bucket = get_secret("processed-bucket")
-
Document Loading
- Parallel scanning of input directory for PDF files
- Configurable concurrency limits for optimal resource usage
- Handles nested directory structures efficiently
-
Parallel Docling Ingestion
- Concurrent PDF processing using Docling's pipeline
- Multi-threaded extraction of text, tables, and structural elements
- Parallel generation of high-quality page images for table processing
-
Content Separation
- Divides content into two streams:
- Main text content
- Tables and their associated captions
- Preserves document structure and relationships
- Divides content into two streams:
-
LLM Post-processing
- Text Processing:
- Uses GPT-4-mini for optimal performance/cost ratio
- Cleans formatting, removes noise (headers, footers, page numbers)
- Preserves academic structure and references
- Table Processing:
- Uses GPT-4 with vision capabilities
- Verifies and corrects table structure
- Ensures accuracy of numerical data
- Preserves notes and statistical indicators
- Text Processing:
-
Output Generation
- Generates clean markdown files
- Creates separate files for main text and tables
- Saves processing metrics for analysis
To use this pipeline locally without Google Cloud Platform:
- Local Setup
# Clone the repository
git clone https://github.com/yourusername/impactAI-extraction-paper.git
cd impactAI-extraction-paper
# Install dependencies using Poetry
poetry install
- Configure Local Settings
Create a
.env
file in the project root:
# .env
OPENAI_API_KEY=your-openai-key # If using OpenAI models
GOOGLE_API_KEY=your-google-key # If using Google models
- Run Locally
# Process PDFs in local directory
poetry run python src/parse/parse_pdf.py --verbose
# Process specific number of PDFs
poetry run python src/parse/parse_pdf.py --n_samples 5
- VM Configuration
# Create VM with T4 GPU
gcloud compute instances create pdf-parser-vm \
--project=impactai-430615 \
--zone=us-central1-a \
--machine-type=n1-highcpu-16 \
--accelerator="type=nvidia-tesla-t4,count=1" \
--maintenance-policy=TERMINATE \
--boot-disk-size=200GB \
--image-family=pytorch-2-4-cu124-ubuntu-2204 \
--image-project=deeplearning-platform-release \
--metadata-from-file startup-script=src/parse/startup-script.sh
- Install Dependencies on the VM NB: This part could be tricky to automate, as the VM is created with a custom image. Be sure to install the dependencies on the VM before running the startup script.
# Install system dependencies
sudo apt-get update
sudo apt-get install -y python3-pip python3-venv
# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Copy startup scripts
sudo cp process_and_shutdown.sh /home/agomberto/
sudo chmod +x /home/agomberto/process_and_shutdown.sh
Overview We have implemented an automated system to manage our PDF parsing VM efficiently using three components:
- Cloud Function (to check VM activity)
- Service Account (to handle permissions)
- Scheduled Jobs (to manage VM lifecycle)
1. Cloud Function Setup
This function automatically checks if the VM is being used and shuts it down if inactive.
# Deploy the monitoring function
gcloud functions deploy check-inactive-vm \
--gen2 \
--runtime python39 \
--trigger-http \
--entry-point check_inactive_vm \
--source src/parse/cloud_functions \
--region us-central1 \
--service-account="[email protected]" \
--allow-unauthenticated
The function code (src/parse/cloud_functions/main.py
) checks VM status and activity:
@functions_framework.http
def check_inactive_vm(request):
"""Monitors VM activity and manages shutdown if needed."""
# Checks if VM is running
# Monitors activity levels
# Shuts down if inactive
To view what the function is doing:
# Check the most recent activity logs
gcloud functions logs read check-inactive-vm --gen2 --region=us-central1 --limit=5
2. Service Account Setup
First, we create a special account that has permission to manage the VM:
# Create management account
gcloud iam service-accounts create scheduler-vm-manager
Then, give it the necessary permissions:
# Grant VM control permissions
gcloud projects add-iam-policy-binding impactai-430615 \
--member="serviceAccount:[email protected]" \
--role="roles/compute.instanceAdmin.v1"
3. Automated Schedule Setup
We've created two automated schedules to manage the VM efficiently:
a. VM Startup Schedule (Every 30 Minutes)
gcloud scheduler jobs create http start-parsing-vm-30minutes \
--schedule="*/30 * * * *" \
--location=us-central1 \
--http-method=POST \
--uri="https://www.googleapis.com/compute/v1/projects/impactai-430615/zones/us-central1-a/instances/pdf-parser-vm/start" \
--oauth-service-account-email="[email protected]"
b. Activity Check Schedule (Every Hour)
gcloud scheduler jobs create http check-inactive-vm-hourly \
--schedule="0 * * * *" \
--uri="https://us-central1-impactai-430615.cloudfunctions.net/check-inactive-vm" \
--http-method=POST \
--location=us-central1 \
--oidc-service-account-email="[email protected]"
Managing Your Schedules
View all current schedules:
gcloud scheduler jobs list --location=us-central1
Update a schedule's timing:
gcloud scheduler jobs update http JOB_NAME \
--schedule="NEW_SCHEDULE" \
--location=us-central1
Remove a schedule:
gcloud scheduler jobs delete JOB_NAME --location=us-central1
How Everything Works Together
Our automated system works like a smart building manager:
- Every 30 minutes, it tries to start the VM (like turning on the lights)
- Every hour, it checks if anyone's using the VM (like checking if rooms are occupied)
- If someone needs the VM, the 30-minute check will turn it back on
- If no one's using it, either the hourly check or 2-hour shutdown will turn it off
This ensures we're only running (and paying for) the VM when it's actually needed, while making sure it's always available when someone needs to process PDFs.
The deployment uses:
- n1-highcpu-16 (~$0.424/hour)
- T4 GPU (~$0.35/hour)
- Total: ~$0.774/hour
Scheduled stops ensure cost-effective usage by running only when needed.
Our testing revealed that the optimal configuration uses:
gpt-4o-mini
for text post-processinggpt-4o
for table post-processing
This combination provides the best balance of accuracy and cost-effectiveness.
We evaluated several alternative approaches:
- Zerox PDF processing here
- OCR on PDF images (from scratch)
- PyMuPDF (with and without LLM)
- Adding Yolov10 to extract tables from PDF images
Docling + LLM post-processing consistently outperformed these methods in terms of:
- Text extraction accuracy
- Table structure preservation
- Document formatting retention
- Processing speed
Current limitations that will be addressed in future iterations:
-
Document Completeness
- Occasional missing pages in Docling output
- Solution: Implement page verification and recovery system
-
Table Accuracy
- Some instances of missing rows in complex tables
- Improvement planned: Enhanced table structure validation
-
Numerical Sequences
- Reduced accuracy with very long number sequences
- Future enhancement: Work step by step on the rows with prompt
-
Processing Speed
- Current sequential processing of tables
- Planned: Improved parallelization for table processing
Despite these limitations, the current pipeline produces high-quality results suitable for most academic papers.
Use the following command to run the main_table.py
script:
python src/extraction/get_tables/main_table.py --pdf_path <path_to_your_pdf> --output_folder <output_folder_path>
Replace <path_to_your_pdf> with the path to your input PDF file. Replace <output_folder_path> with the desired output folder where images and CSV files will be saved.
Example Command:
```bash
python src/extraction/get_tables/main_table.py --pdf_path data/pdf/A1.pdf --output_folder data/TableExtraction
```
Script Workflow
1. The script will process the specified PDF file and extract Tables.
2. Extracted data will be saved as images (metadata) and in a .pkl files in the specified output folder.
3. Ensure the output folder exists or will be created automatically in the script.
This script aims to get information about interventions and outcomes (their names and descriptions) from extracted tables.
To run the parsing with Google Gemini, run:
python src/extraction/get_io/get_io_tables.py --tables_folder <path_to_tables> --out_folder <path_to_extraction> --batch <name_of_batch>
With:
tables_folder
: the path to the input tables for the systematic reviews considered.out_folder
: the path to output folder where extractions will be saved as a csv file.batch
: the batch of pdfs being processed.
For example:
python src/extraction/get_io/extract_from_tables.py --tables_folder data/extraction/tables/batch_01/ --out_folder data/extraction/io_tables/ --batch batch_01
The output folder will contain csv files corresponding to input pdf files, each containing the interventions, outcomes and descriptions for a given systematic review.
This script extracts key metadata from academic papers in PDF format, including:
- Title
- Year of Publication
- Authors
- Abstract
- Keywords
Run the metadata extraction script using:
python src/extract_metadata/extract_metadata.py
The script will:
- Process all PDF files in the
data/raw
directory in parallel - Extract metadata using GPT-4
- Save results to
processed/metadata.json
You can customize the extraction by modifying src/extract_metadata/settings.py
:
@dataclass
class Settings:
path_folder: Path = Path("data/raw") # Input PDF folder
path_prompt: Path = Path("config/prompts/metadata-extraction.prompt") # Prompt template
path_output: Path = Path("processed/metadata.json") # Output file
temperature: float = 0.0 # Model temperature
model: str = "gpt-4" # OpenAI model
max_tokens: int = 4096 # Max response tokens
batch_size: int = 10 # Parallel processing batch size
The script generates a JSON file with the following structure:
{
"path/to/paper.pdf": {
"filename": "paper.pdf",
"metadata": {
"title": "Paper Title",
"year": "2023",
"authors": "Author 1, Author 2",
"abstract": "Paper abstract...",
"keywords": "keyword1, keyword2, keyword3"
}
}
}
If a PDF cannot be processed, the output will include an error message:
{
"path/to/paper.pdf": {
"filename": "paper.pdf",
"error": "Error message details"
}
}
Make sure you have:
- Set up your OpenAI API key in
.env
- Installed all dependencies using Poetry
- PDF files in the input directory
This script uses GPT-4o to classify research papers as Randomized Controlled Trials (RCTs) or not using zero-shot learning.
After extracting metadata, run the classification script:
python src/rct_clf/zsl_classify.py
The script will:
- Load metadata from
data/processed/metadata.json
- Process each paper using GPT-4o for RCT classification
- Save results to
data/processed/metadata_rct_classified.json
Customize the classification by modifying src/rct_clf/settings.py
:
@dataclass
class ZSLSettings:
path_prompt: Path = Path("config/prompts/RCT_ZSL.prompt")
path_input: Path = Path("data/processed/metadata.json")
path_output: Path = Path("data/processed/metadata_rct_classified.json")
system_content: str = "You are an expert in economic research."
temperature: float = 1.0
model: str = "gpt-4o"
max_tokens: int = 1024
batch_size: int = 10
The script generates a JSON file that includes the original metadata plus RCT classification:
{
"path/to/paper.pdf": {
"filename": "paper.pdf",
"metadata": {
"title": "Paper Title",
"abstract": "Paper abstract...",
"keywords": "keyword1, keyword2"
},
"rct": "True" // or "False"
}
}
If classification fails, the output will include an error message:
{
"path/to/paper.pdf": {
"filename": "paper.pdf",
"metadata": {...},
"error": "Error message details"
}
}
This script uses GPT-4o to extract metadata and classify research papers as Randomized Controlled Trials (RCTs) or not using zero-shot learning.
Run the metadata extraction and RCT classification script using:
python src/rct_clf/zsl_from_pdf.py
The script will:
- Process all PDF files in the
data/raw
directory in parallel - Extract metadata using GPT-4o
- Save results to
metadata_pdf_rct_classified.json
You can customize the extraction by modifying src/rct_clf/settings.py
:
@dataclass
class PDFZSLSettings:
path_folder: Path = Path("data/raw") # Input PDF folder
path_prompt: Path = Path("config/prompts/RCT_metadata-extraction_ZSL.prompt") # Prompt template
path_output: Path = Path("data/processed/metadata_pdf_rct_classified.json") # Output file
system_content: str = "You are an expert that extracts metadata and classify whether the study is Randomized Controlled Trial (RCT) or not from academic papers." # system message
temperature: float = 0.0 # Model temperature
model: str = "gpt-4o" # OpenAI model
max_tokens: int = 1024 # Max response tokens
batch_size: int = 10 # Parallel processing batch size
The script generates a JSON file with the following structure:
{
"path/to/paper.pdf": {
"filename": "paper.pdf",
"metadata": {
"title": "Paper Title",
"year": "2023",
"authors": "Author 1, Author 2",
"abstract": "Paper abstract...",
"keywords": "keyword1, keyword2, keyword3"
},
"rct": "True", // or "False",
"explanation": "text"
}
}
If a PDF cannot be processed, the output will include an error message:
{
"path/to/paper.pdf": {
"filename": "paper.pdf",
"error": "Error message details"
}
}
To evaluate the RCT classification, run the following command:
python src/rct_clf/evaluate.py
The script will:
- Load predictions from
data/processed/metadata_rct_classified.json
- Load ground truth from
data/raw/RCT_GT.csv
- Compute metrics
- Save results to
data/processed/ZSL_two_steps_metrics.json
You can customize the evaluation by modifying src/rct_clf/settings.py
:
@dataclass
class EvaluationParams:
path_preds: Path = Path("data/processed/metadata_rct_classified.json")
path_true: Path = Path("data/raw/RCT_GT.csv")
path_output: Path = Path("data/processed/ZSL_two_steps_metrics.json")
The script generates a JSON file with the following structure:
{
"accuracy": 92.72727272727272,
"precision": 100.0,
"recall": 89.74358974358975,
"f1": 94.5945945945946
}
This script evaluates the extracted interventions and outcomes in both mentions from the text and tables prior to merging. It should be ran as follows :
python src/evaluation/evaluate_io.py --batch first_batch --separator \t --eval_type sim
The output folder will by default be data/evaluation/scores/
and will contain csv files containing the evaluation scores, as well as pairwise similarities if the save_similarities
argument was set to True
.