Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added libs/langchain/colored-1.4.4-py3-none-any.whl
Binary file not shown.
96 changes: 83 additions & 13 deletions libs/langchain/langchain/document_loaders/parsers/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,24 +263,94 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:

class DocumentIntelligenceParser(BaseBlobParser):
"""Loads a PDF with Azure Document Intelligence
(formerly Forms Recognizer) and chunks at character level."""
(formerly Forms Recognizer). Returns Document with
pages or paragraphs, table headers, and rows."""

def __init__(self, client: Any, model: str):
def __init__(self, client: Any, model: str, split_mode: str):
Copy link
Collaborator

@baskaryan baskaryan Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we give this default val, probably "page"? so this isn't a breaking change and default behavior doesn't change too much

Copy link
Author

@annjawn annjawn Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but can we have default here as well, in case this object is instantiated directly by a user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can default "page" here as well @baskaryan

self.client = client
self.model = model
self.split_mode = split_mode

def _generate_docs(self, blob: Blob, result: Any) -> Iterator[Document]:
for p in result.pages:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if split mode is page should we just keep existing logic? is there value in parsing by paragraph and re-assembling pages?

Copy link
Author

@annjawn annjawn Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan the idea of providing paragraphs as an option is to do chunking (splitting) as supported by the azure AI cognitive layout capabilities rather than having to do chunking again using, let’s say a Text Splitter. This would be helpful for generating embeddings of chunks (paragraphs) that will retain the semantic consistency of the text. We won’t reassemble the paragraphs back into pages if paragraph is used rather we will keep it the way Doc intel’s layout extracts it. If the user specifies page explicitly or just doesn’t pass the parameter at initialization then page will be defaulted and entire page text will be generated per page. Hope this makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is why not do something like

if self.split_mode == "page":
    for p in result.pages:
        ...
elif self.split_mode == "paragraph":
    for p in result.paragraphs:
        ...

to save us having to write logic for reassembling paragraphs into pages in the case that split mode is page

Copy link
Author

@annjawn annjawn Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan right, I am actually doing this here. the result object doesn't have each page's full text individually in the pages attribute as it may seem, we actually construct pages by concatenating paragraphs. The highest grouped entity that Doc intelligence goes up to is the entire document (all text from all pages concatenated into one) and then its per page paragraph (then lines, then words). The content object in result is combination of all text from all pages, so it's just easier to assemble per page by paragraph instead of trying to split content into individual pages, but that assembly (of paragraphs) will only happen if self.split_mode == "page". Here's a structure for better explanation.

Screenshot 2023-09-15 at 8 51 57 PM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching a sample JSON output from a 2 page document extracted via prebuilt-read model.

output.json.zip

content = " ".join([line.content for line in p.lines])

d = Document(
page_content=content,
metadata={
"source": blob.source,
"page": p.page_number,
},
)
yield d
page_content_dict = dict()

for paragraph in result.paragraphs:
page_number = paragraph.bounding_regions[0].page_number

if self.split_mode == "page":
if page_number not in page_content_dict:
page_content_dict[page_number] = str()

page_content_dict[page_number] += paragraph.content + "\n\n"
elif self.split_mode == "paragraph":
d = Document(
page_content=paragraph.content,
metadata={
"source": blob.source,
"page": page_number,
"type": "PARAGRAPH",
},
)
yield d

if self.split_mode == "page":
for page, content in page_content_dict.items():
d = Document(
page_content=content.strip(),
metadata={
"source": blob.source,
"page": page,
"type": "PAGE",
},
)
yield d
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan here's page vs paragraph logic. If page is used then we do subsequent collation of paragraphs into individual page's full text and specify "type": "PAGE" in Document. If paragraph is used then we keep it as is and simply yield with paragraphs in the Document schema with "type": "PARAGRAPH"


if self.model in ["prebuilt-document", "prebuilt-layout", "prebuilt-invoice"]:
import csv # noqa: F401
from io import StringIO # noqa: F401

for table_idx, table in enumerate(result.tables):
page_num = table.bounding_regions[0].page_number
headers: list[str] = list()
rows: dict[int, list[str]] = dict()

for cell in table.cells:
if cell.kind == "columnHeader":
headers.append(cell.content)
elif cell.kind == "content":
if cell.row_index not in rows:
rows[cell.row_index] = list()
rows[cell.row_index].append(cell.content)

if headers:
h_op = StringIO()
csv.writer(h_op, quoting=csv.QUOTE_MINIMAL).writerow(headers)
header_string = h_op.getvalue().strip()
hd = Document(
page_content=header_string,
metadata={
"source": blob.source,
"page": page_num,
"type": "TABLE_HEADER",
"table_index": table_idx,
},
)
yield hd

for _, row_cells in sorted(rows.items()):
r_op = StringIO()
csv.writer(r_op, quoting=csv.QUOTE_MINIMAL).writerow(row_cells)
row_string = r_op.getvalue().strip()
rd = Document(
page_content=row_string,
metadata={
"source": blob.source,
"page": page_num,
"type": "TABLE_ROW",
"table_index": table_idx,
},
)
yield rd

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
Expand Down
18 changes: 14 additions & 4 deletions libs/langchain/langchain/document_loaders/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,7 +620,7 @@ def __init__(
file_path: str,
client: Any,
model: str = "prebuilt-document",
headers: Optional[Dict] = None,
split_mode: str = "page",
Copy link
Author

@annjawn annjawn Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan here's where it's defaulted to page, so it won't introduce any breaking change.

) -> None:
"""
Initialize the object for file processing with Azure Document Intelligence
Expand All @@ -639,18 +639,28 @@ def __init__(
A DocumentAnalysisClient to perform the analysis of the blob
model : str
The model name or ID to be used for form recognition in Azure.
split_mode : str
Whether to split by `paragraph` or `page`. Defaults to `page`.

Examples:
---------
>>> obj = DocumentIntelligenceLoader(
... file_path="path/to/file",
... client=client,
... model="prebuilt-document"
... split_mode="page | paragraph"
... )
"""

self.parser = DocumentIntelligenceParser(client=client, model=model)
super().__init__(file_path, headers=headers)

super().__init__(file_path)
if split_mode not in ["page", "paragraph"]:
raise ValueError(
f"Invalid split option {split_mode}, "
"valid values are `page` or `paragraph`."
)
self.parser = DocumentIntelligenceParser(
client=client, model=model, split_mode=split_mode
)

def load(self) -> List[Document]:
"""Load given path as pages."""
Expand Down
Loading