Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of list items - missing tabs/spaces. #202

Open
botzill opened this issue Feb 1, 2016 · 9 comments
Open

Incorrect parsing of list items - missing tabs/spaces. #202

botzill opened this issue Feb 1, 2016 · 9 comments

Comments

@botzill
Copy link
Contributor

botzill commented Feb 1, 2016

Hi.

I have such a file:

subsections_format.docx

after converting to html we get:

2ea64158-ae83-11e5-8ed4-d56bff32bef1

As you can see the subsections are not properly formatted. If you guide me where to look for this issue I can submit a pull request to solve this.

Thx a lot.

@botzill botzill changed the title Format section/subsection - missing tabs/spaces. Incorrect parsing of list items - missing tabs/spaces. Feb 1, 2016
@jhubert
Copy link
Contributor

jhubert commented Mar 13, 2016

Before:

image

After:

image

@winhamwr
Copy link
Contributor

It looks like there is one definite bug, along with some confusion about the styling.

"Gather Items for Re-pricing" should definitely be in the same list as "Prepare your markdown gun" and it's not obvious to me why it isn't.

The first step will be adding a fixtures testcase by adding both a .docx and .html file in the fixtures directory. That will let us define the input and then the expected output.

If anyone could help with that part, it would be appreciated. From there, someone will need to dive in to the OOXML in the .docx to figure out why we're parsing the .docx as separate lists instead of one list.

jhubert added a commit to jhubert/pydocx that referenced this issue Mar 15, 2016
jhubert added a commit to jhubert/pydocx that referenced this issue Mar 15, 2016
@jhubert
Copy link
Contributor

jhubert commented Mar 15, 2016

I dove in and took a look at the OOXML for this. I've added the fixtures as well.

It looks like what's happening is that it's being considered three different lists because the bulleted list is breaking up the numeric list.

image

Here is the simplified relevant document.xml OOXML:

<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>one</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>two</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>three</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>AAA</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>BBB</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>CCC</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="2"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>alpha</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>four</w:t></w:r></w:p>
<w:p/>
<w:p/>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr></w:pPr><w:r><w:t>xxx</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="1"/><w:numId w:val="6"/></w:numPr></w:pPr><w:r><w:t>yyy</w:t></w:r></w:p>
<w:p/>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr></w:pPr><w:r><w:t>www</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="1"/><w:numId w:val="7"/></w:numPr></w:pPr><w:r><w:t>zzz</w:t></w:r></w:p>

The full document.xml OOXML is beautified in this gist: https://gist.github.com/jhubert/29f7899073b765e74297

@kylegibson
Copy link
Contributor

Can you include numbering.xml and styles.xml as well?

@jhubert
Copy link
Contributor

jhubert commented Mar 15, 2016

Of course. Gist update: https://gist.github.com/jhubert/29f7899073b765e74297

Also, here is the docx file:
nested_multitype_lists.docx

@jhubert
Copy link
Contributor

jhubert commented Apr 12, 2016

@kylegibson I'm about to work on this issue. Have you already started?

@kylegibson
Copy link
Contributor

Hi Jeremy. None of us have started work on this issue. I expect it will be awhile before we have time to dedicate to fixing this. We'll be happy to review any PRs that you submit!

@jhubert
Copy link
Contributor

jhubert commented Apr 12, 2016

Awesome. Good to know. We'll get a PR together. :)

@jhubert
Copy link
Contributor

jhubert commented Feb 9, 2017

The fix that @botzill put together in #225 is now live in production. No issues so far. 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants