Skip to content

Paragraphs/Numbering in Table of Contents Document Regions #1408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
polanddm opened this issue Jun 18, 2024 · 3 comments
Closed

Paragraphs/Numbering in Table of Contents Document Regions #1408

polanddm opened this issue Jun 18, 2024 · 3 comments

Comments

@polanddm
Copy link

Great work on this library, very useful.

My work has entailed parsing large MS Word documents organized hierarchically (Heading 1, 2, 3, etc.) and extracting text for further processing, tagging, and use in AI scenarios.

I have run across client documents where (somehow) the Word document has been saved in such a manner that named styles have been lost. This a surmountable problem by itself, but there is one nasty little impact I have discovered. This problem is not present if "named styles" are available.

As a result, I cannot (easily) detect the difference between a "content" paragraph ("1.1 Blah, lots more words...") and the "Table of Contents" entry for this paragraph ("1.1 Blah words.............. 42"). I have an acceptable workaround for this by testing if the paragraph text has a run of period characters of unusual length ("......."). However, I have run into another interesting problem in the process...

In the "Table of Contents" entry above "1.1 Blah words.............. 42" there appears to be an unusual character between the paragraph number ("1.1 "; + trailing space) and the next text ("Blah words...."). In Word with show/hide turned ON, the character is like a paragraph marker, but is different, it looks like a backword capital "P"). As near as I can figure out from Google, it is (sometimes) called a "pilcrow" and, apparently, is used in Microsoft Word to mark an "indent" that isn't a tab.

The problem, of course, is that when iterating through the doc.Paragraph list, in the "Table of Content" region, the text "1.1 Blah words.............. 42" is interpreted as two (2) paragraphs: "1.1 ", then "Blah words.............. 42".

This pilcrow/indent character cannot be selected and removed via search and replace as it results in undesirable changes elsewhere to the Word document. I am not at all sure if this weird pilcrow/indent character can be detected and removed so that the visual appearance of the text in the ToC is interpreted in a "natural way" via python-docx, but I wanted to raise the issue.

@scanny
Copy link
Contributor

scanny commented Jun 18, 2024

python-docx does not recognize paragraph boundaries by the presence of a particular character. The XML tells python-docx where those boundaries are. If you're seeing a pilcrow character there it's because that is indeed a separate paragraph. If you inspect the XML I expect you'll see this.

I'm not clear on exactly what you're seeing, probably a small screenshot would help.

If I wanted to skip the TOC if present I would look for the field markers in the XML and remove whatever was in between before iterating the paragraphs in the document.

@polanddm
Copy link
Author

Thank you for the quick feedback. Yes, I agree, your recommendation to look at XML field markers to "skip" the TOC is probably the best approach.

Regarding the screenshot, see below. It is kind of weird as the "symbol" is not strictly consistent (which was the somewhat maddening part).

image

At any rate, the problems are surmountable, but kind of make my python code a little messy and less generic. The client documents I am processing are very... small we say... "diverse" in their mature use of Word styling features.

Frankly (and I know how this sounds), I have wondered if just traversing a "doc.Characters" list would be easier sometimes.

Thanks again!

@scanny
Copy link
Contributor

scanny commented Jun 20, 2024

@polanddm Yeah, that is a little weird looking, but I think that's just a rendering artifact where the e.g. "Program Introduction ..." text is overlapping a little and obscuring the right-hand side of the regular paragraph character. If you widen that tab setting a little I think that will reveal the rest of it.

The actual content of the TOC is in the document btw. The TOC feature generates that content and inserts it when the field is refreshed. So you should be able to see what's actually in there by inspecting the XML for it. You can find it by unzipping the .docx file (DOCX is a zip-archive) and then inspecting the document.xml file within it.

Closing for now as not actionable, but feel free to ask more questions in this issue if you need to.

@scanny scanny closed this as completed Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants