Skip to content

Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dennisblaufuss-ci opened this issue Feb 21, 2025 · 6 comments

Comments

@dennisblaufuss-ci
Copy link

dennisblaufuss-ci commented Feb 21, 2025

Hello there,

Note beforehand: This is my first time contributing, so if something is wrong/off please be kind :)

I've come to encounter something i would call a bug while parsing tables from word into python. The simple_example.docx is parsed into python and saved as tables_test_doc.xlsx for better presentation using the following snippet:

import pandas as pd
from docx import Document

doc = Document("simple_example.docx")
data = [[cell.text for cell in row.cells] for row in doc.tables[0].rows]
df = pd.DataFrame(data)

df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
df.to_excel("tables_test_doc.xlsx", index=False)

What causes this "bug" is the misalignment of the columns: The column separating line between C and D is sligthly shifted in the second row.

Is there any workaround? Am I missing something?

Again, first timer here, if anything is off feel free to correct me :) If this is not seen as a bug feel free to correct as well!

Thanks a ton ahead for any hints
Dennis

@scanny
Copy link
Contributor

scanny commented Feb 22, 2025

You have not described or demonstrated the unexpected behavior.

@dennisblaufuss-ci
Copy link
Author

Hi Steve, thanks for the fast reply! Okay to be precise here: I would expect an output table with four columns A, B, C, D. But the package extracts 5 columns: C is twice in the output. This is shown in the two files attached.

@scanny
Copy link
Contributor

scanny commented Feb 24, 2025

@dennisblaufuss-ci If you inspect the XML you'll see that each row actually does have 5 grid-columns.

This is how Word deals with the "misaligned" column boundaries; it inserts another grid column and uses cell-merging to produce the "visual" layout of a cell boundary not aligned with those in rows above and below.

Something like this will give you the result you're after I think. Note there is no guarantee that each row will have the same number of cells:

from docx.table import _Cell, _Row, Table

for row in table.rows:
    print(list(iter_row_cells(row, table)))

def iter_row_cells(row: _Row, table: Table) -> Iterator[_Cell]
    tr = row._tr
    for tc in tr.tc_lst:
        # -- vMerge="continue" indicates a vertically spanned cell, skip that so it doesn't repeat --
        if tc.vMerge == "continue":
            continue
        # -- generate horizontal merges as a single cell --
        yield _Cell(tc, table)

@scanny scanny closed this as completed Feb 24, 2025
@dennisblaufuss-ci
Copy link
Author

@scanny Thanks for the hint! I did indeed try to solve this by purely looking at the xml as well and found a solution coming to almost the same results. However, now, cells that are merged on purpose or not accounted for correctly. Is there any solution, that would differentiate between cells that are really merged (visually looking at them) and cells that are just misaligned and thus treated as merged (as the cells given in my example).

My optimal output would be to treat "really merged" ones just as docx does it vanilla - copying the content into both cells. While, "misaligned merged" ones are treated like the xml/your solution does it.

Would you have any thoughts on that? Thanks ahead for your support!

@dennisblaufuss-ci dennisblaufuss-ci changed the title Bug: Tables a parsed faulty with (slightly) shifted column separating lines Bug: Tables a parsed unexpectedly with misaligned column separating lines Feb 25, 2025
@scanny
Copy link
Contributor

scanny commented Feb 25, 2025

I can't say I've given it much thought, but my initial impression would be you'd have to rely on heuristics of some sort, likely involving the effective width of a grid column.

Some of these characteristics might be useful:

  • A misalignment-induced column is likely to be quite narrow.
  • It may be the case that the grid column in question almost always appears as part of a merged cell, never as an isolated cell, throughout the rows of the table.
  • In your example, the column is a leftmost cell in a merge in exactly one row. In the others it is always a rightmost cell in a merge. There might be something there.

Another approach would be to use an (AI) model of some sort, perhaps in conjunction with a rendering of the table; doesn't sound fast though.

In the end I don't believe there is a true answer to the user's intent, just a spectrum of probabilities the answer to "did the author intend this to be a distinct column" is "yes".

@dennisblaufuss-ci
Copy link
Author

Okay perfect, I tried a couple of ML/AI based approaches but none of them seemed to solve it 100%. Had a short look into the column width and I think this should be the best approach for my problem.

Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants