-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Paragraphs/Numbering in Table of Contents Document Regions #1408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not clear on exactly what you're seeing, probably a small screenshot would help. If I wanted to skip the TOC if present I would look for the field markers in the XML and remove whatever was in between before iterating the paragraphs in the document. |
Thank you for the quick feedback. Yes, I agree, your recommendation to look at XML field markers to "skip" the TOC is probably the best approach. Regarding the screenshot, see below. It is kind of weird as the "symbol" is not strictly consistent (which was the somewhat maddening part). At any rate, the problems are surmountable, but kind of make my python code a little messy and less generic. The client documents I am processing are very... small we say... "diverse" in their mature use of Word styling features. Frankly (and I know how this sounds), I have wondered if just traversing a "doc.Characters" list would be easier sometimes. Thanks again! |
@polanddm Yeah, that is a little weird looking, but I think that's just a rendering artifact where the e.g. "Program Introduction ..." text is overlapping a little and obscuring the right-hand side of the regular paragraph character. If you widen that tab setting a little I think that will reveal the rest of it. The actual content of the TOC is in the document btw. The TOC feature generates that content and inserts it when the field is refreshed. So you should be able to see what's actually in there by inspecting the XML for it. You can find it by unzipping the Closing for now as not actionable, but feel free to ask more questions in this issue if you need to. |
Great work on this library, very useful.
My work has entailed parsing large MS Word documents organized hierarchically (Heading 1, 2, 3, etc.) and extracting text for further processing, tagging, and use in AI scenarios.
I have run across client documents where (somehow) the Word document has been saved in such a manner that named styles have been lost. This a surmountable problem by itself, but there is one nasty little impact I have discovered. This problem is not present if "named styles" are available.
As a result, I cannot (easily) detect the difference between a "content" paragraph ("1.1 Blah, lots more words...") and the "Table of Contents" entry for this paragraph ("1.1 Blah words.............. 42"). I have an acceptable workaround for this by testing if the paragraph text has a run of period characters of unusual length ("......."). However, I have run into another interesting problem in the process...
In the "Table of Contents" entry above "1.1 Blah words.............. 42" there appears to be an unusual character between the paragraph number ("1.1 "; + trailing space) and the next text ("Blah words...."). In Word with show/hide turned ON, the character is like a paragraph marker, but is different, it looks like a backword capital "P"). As near as I can figure out from Google, it is (sometimes) called a "pilcrow" and, apparently, is used in Microsoft Word to mark an "indent" that isn't a tab.
The problem, of course, is that when iterating through the doc.Paragraph list, in the "Table of Content" region, the text "1.1 Blah words.............. 42" is interpreted as two (2) paragraphs: "1.1 ", then "Blah words.............. 42".
This pilcrow/indent character cannot be selected and removed via search and replace as it results in undesirable changes elsewhere to the Word document. I am not at all sure if this weird pilcrow/indent character can be detected and removed so that the visual appearance of the text in the ToC is interpreted in a "natural way" via python-docx, but I wanted to raise the issue.
The text was updated successfully, but these errors were encountered: