How fecth images along with text and tables in word document using python-dox library. #1485

Darshankb07 · 2025-04-22T16:05:57Z

I need fecth the text , table and images adress from word document. Using doc.element.body loop I can't detect or recognise the images in the document and using doc.part.rels.values looping I can only get images. How can I get both text and tables along with images and it should in the same order of the source word document.

I used one list variable to store the results. The problem is I am not able to detect image occurrence the document using element.body loop so I can't able to run the doc.part.rels.values loop.

scanny · 2025-04-22T16:39:45Z

There are two questions here:

How to get both paragraphs and tables from the document body in document/reading order.
How to get images in reading order.

Paragraphs and tables are both block items, meaning they take up a whole vertical segment of the document and extend between the margins, like "blocks" stacked on top of each other.

To get all of these in document order you use document.iter_inner_content() -> Iterator[Paragraph | Table]. There is an example of using that here: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L358-L385 (although it uses sections instead of the whole document).

Images are inline elements, meaning they occur inside block elements (inside Paragraph.runs specifically) and a given paragraph can contain more than one. The closest python-docx can get you to those currently is with run.iter_inner_content() -> Iterator[str | Drawing | RenderedPageBreak] of which images appear in Drawing elements.

Only the XML is available on a Drawing element, on drawing._drawing. Using XPath on that XML, the drawing contains pictures at either ./wp:inline/a:graphic/a:graphicData/pic:pic (an "inline" picture) or ./wp:anchor/a:graphic/a:graphicData/pic:pic (a "placed" or so-called "floating" picture). So you'd have to dig into that XML to get the rId of the picture element and match it up with the corresponding ImagePart of you wanted to get those in document order.

scanny closed this as completed Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How fecth images along with text and tables in word document using python-dox library. #1485

How fecth images along with text and tables in word document using python-dox library. #1485

Darshankb07 commented Apr 22, 2025

scanny commented Apr 22, 2025

How fecth images along with text and tables in word document using python-dox library. #1485

How fecth images along with text and tables in word document using python-dox library. #1485

Comments

Darshankb07 commented Apr 22, 2025

scanny commented Apr 22, 2025