Skip to content

How fecth images along with text and tables in word document using python-dox library. #1485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Darshankb07 opened this issue Apr 22, 2025 · 1 comment

Comments

@Darshankb07
Copy link

I need fecth the text , table and images adress from word document. Using doc.element.body loop I can't detect or recognise the images in the document and using doc.part.rels.values looping I can only get images. How can I get both text and tables along with images and it should in the same order of the source word document.

I used one list variable to store the results. The problem is I am not able to detect image occurrence the document using element.body loop so I can't able to run the doc.part.rels.values loop.

@scanny
Copy link
Contributor

scanny commented Apr 22, 2025

There are two questions here:

  1. How to get both paragraphs and tables from the document body in document/reading order.
  2. How to get images in reading order.

Paragraphs and tables are both block items, meaning they take up a whole vertical segment of the document and extend between the margins, like "blocks" stacked on top of each other.

To get all of these in document order you use document.iter_inner_content() -> Iterator[Paragraph | Table]. There is an example of using that here: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L358-L385 (although it uses sections instead of the whole document).

Images are inline elements, meaning they occur inside block elements (inside Paragraph.runs specifically) and a given paragraph can contain more than one. The closest python-docx can get you to those currently is with run.iter_inner_content() -> Iterator[str | Drawing | RenderedPageBreak] of which images appear in Drawing elements.

Only the XML is available on a Drawing element, on drawing._drawing. Using XPath on that XML, the drawing contains pictures at either ./wp:inline/a:graphic/a:graphicData/pic:pic (an "inline" picture) or ./wp:anchor/a:graphic/a:graphicData/pic:pic (a "placed" or so-called "floating" picture). So you'd have to dig into that XML to get the rId of the picture element and match it up with the corresponding ImagePart of you wanted to get those in document order.

@scanny scanny closed this as completed Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants