Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470

dennisblaufuss-ci · 2025-02-21T17:55:51Z

Hello there,

Note beforehand: This is my first time contributing, so if something is wrong/off please be kind :)

I've come to encounter something i would call a bug while parsing tables from word into python. The simple_example.docx is parsed into python and saved as tables_test_doc.xlsx for better presentation using the following snippet:

import pandas as pd
from docx import Document

doc = Document("simple_example.docx")
data = [[cell.text for cell in row.cells] for row in doc.tables[0].rows]
df = pd.DataFrame(data)

df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
df.to_excel("tables_test_doc.xlsx", index=False)

What causes this "bug" is the misalignment of the columns: The column separating line between C and D is sligthly shifted in the second row.

Is there any workaround? Am I missing something?

Again, first timer here, if anything is off feel free to correct me :) If this is not seen as a bug feel free to correct as well!

Thanks a ton ahead for any hints
Dennis

The text was updated successfully, but these errors were encountered:

scanny · 2025-02-22T18:32:56Z

You have not described or demonstrated the unexpected behavior.

dennisblaufuss-ci · 2025-02-24T08:14:42Z

Hi Steve, thanks for the fast reply! Okay to be precise here: I would expect an output table with four columns A, B, C, D. But the package extracts 5 columns: C is twice in the output. This is shown in the two files attached.

scanny · 2025-02-24T16:14:17Z

@dennisblaufuss-ci If you inspect the XML you'll see that each row actually does have 5 grid-columns.

This is how Word deals with the "misaligned" column boundaries; it inserts another grid column and uses cell-merging to produce the "visual" layout of a cell boundary not aligned with those in rows above and below.

Something like this will give you the result you're after I think. Note there is no guarantee that each row will have the same number of cells:

from docx.table import _Cell, _Row, Table

for row in table.rows:
    print(list(iter_row_cells(row, table)))

def iter_row_cells(row: _Row, table: Table) -> Iterator[_Cell]
    tr = row._tr
    for tc in tr.tc_lst:
        # -- vMerge="continue" indicates a vertically spanned cell, skip that so it doesn't repeat --
        if tc.vMerge == "continue":
            continue
        # -- generate horizontal merges as a single cell --
        yield _Cell(tc, table)

dennisblaufuss-ci · 2025-02-25T08:34:05Z

@scanny Thanks for the hint! I did indeed try to solve this by purely looking at the xml as well and found a solution coming to almost the same results. However, now, cells that are merged on purpose or not accounted for correctly. Is there any solution, that would differentiate between cells that are really merged (visually looking at them) and cells that are just misaligned and thus treated as merged (as the cells given in my example).

My optimal output would be to treat "really merged" ones just as docx does it vanilla - copying the content into both cells. While, "misaligned merged" ones are treated like the xml/your solution does it.

Would you have any thoughts on that? Thanks ahead for your support!

scanny · 2025-02-25T22:10:27Z

I can't say I've given it much thought, but my initial impression would be you'd have to rely on heuristics of some sort, likely involving the effective width of a grid column.

Some of these characteristics might be useful:

A misalignment-induced column is likely to be quite narrow.
It may be the case that the grid column in question almost always appears as part of a merged cell, never as an isolated cell, throughout the rows of the table.
In your example, the column is a leftmost cell in a merge in exactly one row. In the others it is always a rightmost cell in a merge. There might be something there.

Another approach would be to use an (AI) model of some sort, perhaps in conjunction with a rendering of the table; doesn't sound fast though.

In the end I don't believe there is a true answer to the user's intent, just a spectrum of probabilities the answer to "did the author intend this to be a distinct column" is "yes".

dennisblaufuss-ci · 2025-02-26T11:50:43Z

Okay perfect, I tried a couple of ML/AI based approaches but none of them seemed to solve it 100%. Had a short look into the column width and I think this should be the best approach for my problem.

Thank you so much for your help!

scanny closed this as completed Feb 24, 2025

dennisblaufuss-ci changed the title ~~Bug: Tables a parsed faulty with (slightly) shifted column separating lines~~ Bug: Tables a parsed unexpectedly with misaligned column separating lines Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470

Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470

dennisblaufuss-ci commented Feb 21, 2025 •

edited

Loading

scanny commented Feb 22, 2025

dennisblaufuss-ci commented Feb 24, 2025

scanny commented Feb 24, 2025

dennisblaufuss-ci commented Feb 25, 2025

scanny commented Feb 25, 2025

dennisblaufuss-ci commented Feb 26, 2025

Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470

Bug: Tables a parsed unexpectedly with misaligned column separating lines #1470

Comments

dennisblaufuss-ci commented Feb 21, 2025 • edited Loading

scanny commented Feb 22, 2025

dennisblaufuss-ci commented Feb 24, 2025

scanny commented Feb 24, 2025

dennisblaufuss-ci commented Feb 25, 2025

scanny commented Feb 25, 2025

dennisblaufuss-ci commented Feb 26, 2025

dennisblaufuss-ci commented Feb 21, 2025 •

edited

Loading