-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You have not described or demonstrated the unexpected behavior. |
Hi Steve, thanks for the fast reply! Okay to be precise here: I would expect an output table with four columns A, B, C, D. But the package extracts 5 columns: C is twice in the output. This is shown in the two files attached. |
@dennisblaufuss-ci If you inspect the XML you'll see that each row actually does have 5 grid-columns. This is how Word deals with the "misaligned" column boundaries; it inserts another grid column and uses cell-merging to produce the "visual" layout of a cell boundary not aligned with those in rows above and below. Something like this will give you the result you're after I think. Note there is no guarantee that each row will have the same number of cells: from docx.table import _Cell, _Row, Table
for row in table.rows:
print(list(iter_row_cells(row, table)))
def iter_row_cells(row: _Row, table: Table) -> Iterator[_Cell]
tr = row._tr
for tc in tr.tc_lst:
# -- vMerge="continue" indicates a vertically spanned cell, skip that so it doesn't repeat --
if tc.vMerge == "continue":
continue
# -- generate horizontal merges as a single cell --
yield _Cell(tc, table) |
@scanny Thanks for the hint! I did indeed try to solve this by purely looking at the xml as well and found a solution coming to almost the same results. However, now, cells that are merged on purpose or not accounted for correctly. Is there any solution, that would differentiate between cells that are really merged (visually looking at them) and cells that are just misaligned and thus treated as merged (as the cells given in my example). My optimal output would be to treat "really merged" ones just as docx does it vanilla - copying the content into both cells. While, "misaligned merged" ones are treated like the xml/your solution does it. Would you have any thoughts on that? Thanks ahead for your support! |
I can't say I've given it much thought, but my initial impression would be you'd have to rely on heuristics of some sort, likely involving the effective width of a grid column. Some of these characteristics might be useful:
Another approach would be to use an (AI) model of some sort, perhaps in conjunction with a rendering of the table; doesn't sound fast though. In the end I don't believe there is a true answer to the user's intent, just a spectrum of probabilities the answer to "did the author intend this to be a distinct column" is "yes". |
Okay perfect, I tried a couple of ML/AI based approaches but none of them seemed to solve it 100%. Had a short look into the column width and I think this should be the best approach for my problem. Thank you so much for your help! |
Hello there,
Note beforehand: This is my first time contributing, so if something is wrong/off please be kind :)
I've come to encounter something i would call a bug while parsing tables from word into python. The simple_example.docx is parsed into python and saved as tables_test_doc.xlsx for better presentation using the following snippet:
What causes this "bug" is the misalignment of the columns: The column separating line between C and D is sligthly shifted in the second row.
Is there any workaround? Am I missing something?
Again, first timer here, if anything is off feel free to correct me :) If this is not seen as a bug feel free to correct as well!
Thanks a ton ahead for any hints
Dennis
The text was updated successfully, but these errors were encountered: