Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add pdfminer parameters configuration #3918

Merged
merged 12 commits into from
Feb 17, 2025
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
## 0.16.21-dev4
## 0.16.21-dev5

### Enhancements
- **Use password** to load PDF with all modes

- **use vectorized logic to merge inferred and extracted layouts**. Using the new `LayoutElements` data structure and numpy library to refactor the layout merging logic to improve compute performance as well as making logic more clear

- **Add PDF Miner configuration** Now PDF Miner can be configured via `pdfminer_line_overlap`, `pdfminer_word_margin`, `pdfminer_line_margin` and `pdfminer_char_margin` parameters added to partition method.

### Features

### Fixes
Expand Down
20 changes: 20 additions & 0 deletions test_unstructured/partition/pdf_image/test_pdfminer_processing.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
from unittest.mock import patch

import numpy as np
import pytest
from pdfminer.layout import LAParams
from PIL import Image
from unstructured_inference.constants import Source as InferenceSource
from unstructured_inference.inference.elements import (
Expand All @@ -11,6 +14,7 @@
from unstructured_inference.inference.layout import DocumentLayout, LayoutElement, PageLayout

from test_unstructured.unit_utils import example_doc_path
from unstructured.partition.auto import partition
from unstructured.partition.pdf_image.pdfminer_processing import (
_validate_bbox,
aggregate_embedded_text_by_block,
Expand Down Expand Up @@ -242,3 +246,19 @@ def test_process_file_with_pdfminer():
assert len(layout)
assert "LayoutParser: A Unified Toolkit for Deep\n" in layout[0].texts
assert links[0][0]["url"] == "https://layout-parser.github.io"


@patch("unstructured.partition.pdf_image.pdfminer_utils.LAParams", return_value=LAParams())
def test_laprams_are_passed_from_partition_to_pdfminer(pdfminer_mock):
partition(
filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"),
pdfminer_line_margin=1.123,
pdfminer_char_margin=None,
pdfminer_line_overlap=0.0123,
pdfminer_word_margin=3.21,
)
assert pdfminer_mock.call_args.kwargs == {
"line_margin": 1.123,
"line_overlap": 0.0123,
"word_margin": 3.21,
}
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@
"type": "Title"
},
{
"element_id": "be270e13c935334fa3b17b13066d639b",
"element_id": "9764a7d0d48e56e28ae267d6fe521036",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -522,7 +522,7 @@
],
"page_number": 2
},
"text": "The results of the experiment are presented in this session. The results obtained from weight loss method for stainless steel Type 316 immersed in 0.5 M H2SO4 solution in the absence and presence of different concentrations of egg shell powder (ES) are presented in Figs. 1–3 respectively. It can be seen clearly from these Figures that the efficiency of egg shell powder increase with the inhibitor con- centration, The increase in its efficiency could be as a result of increase in the constituent molecule",
"text": "The results of the experiment are presented in this session. The results obtained from weight loss method for stainless steel Type 316 immersed in 0.5 M H2SO4 solution in the absence and presence of different concentrations of egg shell powder (ES) are presented in Figs.1–3 respectively. It can be seen clearly from these Figures that the efficiency of egg shell powder increase with the inhibitor con- centration, The increase in its efficiency could be as a result of increase in the constituent molecule",
"type": "NarrativeText"
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -465,7 +465,7 @@
"type": "Title"
},
{
"element_id": "0cc9334df550d1730f2d468941a38225",
"element_id": "02c4df0e110486afd2bd74245e7d93d9",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -474,14 +474,14 @@
],
"links": [
{
"start_index": 386,
"start_index": 383,
"text": "https :// orlib . uqcloud . net /",
"url": "https://orlib.uqcloud.net/"
}
],
"page_number": 2
},
"text": "Subject area Operations research More specific subject area Vehicle scheduling Type of data Tables, text files How data were acquired Artificially generated by a C þ þ program on Intels Xeons CPU E5– 2670 v2 with Linux operating system. Data format Raw Experimental factors Sixty randomly generated instances of the MDVSP with the number of depots in (8, 12, 16) and the number of trips in (1500, 2000, 2500, 3000) Experimental features Randomly generated instances Data source location IITB-Monash Research Academy, IIT Bombay, Powai, Mumbai, India. Data accessibility Data can be downloaded from https://orlib.uqcloud.net/ Related research article Kulkarni, S., Krishnamoorthy, M., Ranade, A., Ernst, A.T. and Patil, R., 2018. A new formulation and a column generation-based heuristic for the multiple depot vehicle scheduling problem. Transportation Research Part B: Methodological, 118, pp. 457–487 [3].",
"text": "Subject area Operations research More specific subject area Vehicle scheduling Type of data Tables, text files How data were acquired Artificially generated by a þ program on Intels Xeons CPU E5– 2670 v2 with Linux operating system. Data format Raw Experimental factors Sixty randomly generated instances of the MDVSP with the number of depots in (8,12,16) and the number of trips in (1500, 2000, 2500, 3000) Experimental features Randomly generated instances Data source location IITB-Monash Research Academy, IIT Bombay, Powai, Mumbai, India. Data accessibility Data can be downloaded from https://orlib.uqcloud.net/ Related research article Kulkarni, S., Krishnamoorthy, M., Ranade, A., Ernst, A.T. and Patil, R., 2018. A new formulation and a column generation-based heuristic for the multiple depot vehicle scheduling problem. Transportation Research Part B: Methodological, 118, pp. 457–487 [3].",
"type": "Table"
},
{
Expand Down Expand Up @@ -576,7 +576,7 @@
"type": "Title"
},
{
"element_id": "683993fc4592941bf8b06173870aa63c",
"element_id": "1f3d79f338b86fbfcfa7054f11de28f0",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -585,14 +585,14 @@
],
"links": [
{
"start_index": 611,
"start_index": 609,
"text": "https :// orlib . uqcloud . net",
"url": "https://orlib.uqcloud.net"
}
],
"page_number": 2
},
"text": "The dataset contains 60 different problem instances of the multiple depot vehicle scheduling pro- blem (MDVSP). Each problem instance is provided in a separate file. Each file is named as ‘RN-m-n-k.dat’, where ‘m’, ‘n’, and ‘k’ denote the number of depots, the number of trips, and the instance number for the size, ‘ðm; nÞ’, respectively. For example, the problem instance, ‘RN-8–1500-01.dat’, is the first problem instance with 8 depots and 1500 trips. For the number of depots, m, we used three values, 8, 12, and 16. The four values for the number of trips, n, are 1500, 2000, 2500, and 3000. For each size, ðm; nÞ, five instances are provided. The dataset can be downloaded from https://orlib.uqcloud.net. For each problem instance, the following information is provided:",
"text": "The dataset contains 60 different problem instances of the multiple depot vehicle scheduling pro- blem (MDVSP). Each problem instance is provided in a separate file. Each file is named as ‘RN-m-n-k.dat’, where ‘m’, ‘n’, and ‘k’ denote the number of depots, the number of trips, and the instance number for the size, ‘ðm;nÞ’, respectively. For example, the problem instance, ‘RN-8–1500-01.dat’, is the first problem instance with 8 depots and 1500 trips. For the number of depots, m, we used three values, 8,12, and 16. The four values for the number of trips, n, are 1500, 2000, 2500, and 3000. For each size, ðm;nÞ, five instances are provided. The dataset can be downloaded from https://orlib.uqcloud.net. For each problem instance, the following information is provided:",
"type": "NarrativeText"
},
{
Expand Down Expand Up @@ -661,7 +661,7 @@
"type": "UncategorizedText"
},
{
"element_id": "96ca028aef61c1fd98c9f0232a833498",
"element_id": "39943e8e76f7ddd879284cf782cac2f4",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -670,7 +670,7 @@
],
"page_number": 2
},
"text": "For each trip i A 1; 2; …; n, a start time, ts i , an end time, te i , a start location, ls i , and an end location, le i , and",
"text": "For each trip iA1;2;…;n, a start time, ts i, an end time, te i , a start location, ls i, and an end location, le i , and",
"type": "NarrativeText"
},
{
Expand Down Expand Up @@ -726,7 +726,7 @@
"type": "NarrativeText"
},
{
"element_id": "2bd550b209c7c06c42966aad21822ea5",
"element_id": "9698643b7f3d779d8a5fdb13dffef106",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -735,7 +735,7 @@
],
"page_number": 3
},
"text": "and end location of the trip. A long trip is about 3–5 h in duration and has the same start and end location. For all instances, m r l and the locations 1; …; m correspond to depots, while the remaining locations only appear as trip start and end locations.",
"text": "and end location of the trip. A long trip is about 3–5 h in duration and has the same start and end location. For all instances, mrl and the locations 1;…;m correspond to depots, while the remaining locations only appear as trip start and end locations.",
"type": "NarrativeText"
},
{
Expand Down Expand Up @@ -804,7 +804,7 @@
"type": "NarrativeText"
},
{
"element_id": "9d3f44c51fe13ebdf6b9511859e4f1b7",
"element_id": "02146cfa4d68e86d868e99acab4f7c42",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -813,7 +813,7 @@
],
"page_number": 3
},
"text": "For each instance size ðm; nÞ, Table 1 provides the average of the number of locations, the number of times, the number of vehicles, and the number of possible empty travels, over five instances. The number of locations includes m distinct locations for depots and the number of locations at which various trips start or end. The number of times includes the start and the end time of the planning horizon and the start/end times for the trips. The number of vehicles is the total number of vehicles from all the depots. The number of possible empty travels is the number of possible connections between trips that require a vehicle travelling empty between two consecutive trips in a schedule.",
"text": "For each instance size ðm;nÞ, Table 1 provides the average of the number of locations, the number of times, the number of vehicles, and the number of possible empty travels, over five instances. The number of locations includes m distinct locations for depots and the number of locations at which various trips start or end. The number of times includes the start and the end time of the planning horizon and the start/end times for the trips. The number of vehicles is the total number of vehicles from all the depots. The number of possible empty travels is the number of possible connections between trips that require a vehicle travelling empty between two consecutive trips in a schedule.",
"type": "NarrativeText"
},
{
Expand All @@ -830,7 +830,7 @@
"type": "NarrativeText"
},
{
"element_id": "d9904b5393369c5204af83b64035802a",
"element_id": "fc4b1e0c5bb8b330e2160f6615975401",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -839,7 +839,7 @@
],
"page_number": 3
},
"text": "The dataset also includes a program ‘GenerateInstance.cpp’ that can be used to generate new instances. The program takes three inputs, the number of depots ðmÞ, the number of trips ðnÞ, and the number of instances for each size ðm; nÞ.",
"text": "The dataset also includes a program ‘GenerateInstance.cpp’ that can be used to generate new instances. The program takes three inputs, the number of depots ðmÞ, the number of trips ðnÞ, and the number of instances for each size ðm;nÞ.",
"type": "NarrativeText"
},
{
Expand Down Expand Up @@ -934,7 +934,7 @@
"type": "NarrativeText"
},
{
"element_id": "17e17590003c0f514220c453f88da6b7",
"element_id": "86e18db80eab89d0556c22321732e4e7",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
Expand All @@ -943,7 +943,7 @@
],
"page_number": 4
},
"text": "Number of Number of columns in Description lines each line 1 3 The number of depots, the number of trips, and the number of locations. 1 m The number of vehicles rd at each depot d. n 4 One line for each trip, i ¼ 1; 2; …; n. Each line provides the start location ls i , the start i , the end location le time ts i and the end time te i for the corresponding trip. l l Each element, δij; where i; j A 1; 2; …; l, refers to the travel time between location i and location j.",
"text": "Number of Number of columns in Description lines each line 1 3 The number of depots, the number of trips, and the number of locations. 1 m The number of vehicles rd at each depot d. n 4 One line for each trip, i ¼ 1;2;…;n. Each line provides the start location ls i, the start i, the end location le time ts i and the end time te i for the corresponding trip. l l Each element, δij; where i;jA1;2;…;l, refers to the travel time between location i and location j.",
"type": "Table"
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1799,8 +1799,8 @@
},
{
"type": "Image",
"element_id": "1b93c33208a85ba6d2a69d23babd6def",
"text": "25 24.6 20 18.4 e 15 10 5 4.6 2.8 0 C oal Oil Bio m ass N atural gas 0.07 Wind 0.04 H ydropo w er 0.02 S olar 0.01 N uclear ",
"element_id": "c0a86e51afb417a3b057d7cf101bbed6",
"text": "25 24.6 20 18.4 e 15 10 5 4.6 2.8 0 Coal Oil Bio m ass Natural gas 0.07 Wind 0.04 Hydropower 0.02 Solar 0.01 Nuclear ",
"metadata": {
"filetype": "application/pdf",
"languages": [
Expand Down
Loading
Loading