Skip to content

Constructor throw "File is not a zip file" on file created using Word #1452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
swarmttied opened this issue Dec 3, 2024 · 5 comments
Closed

Comments

@swarmttied
Copy link

Hi. I'm trying to parse docx using the BytesIO overload of the Document ctor. when I parse any docx created using python-docx, the codes snippet below works just fine. However, with those saved from Word, I get the "File not valid zip" error. I notice that the Word-created files are indeed larger than those created using python-docx, indicative of it not being a zip file. Is there a way to parse a non-zipped stream in this case? I'm on Python 3.11

image

@scanny
Copy link
Contributor

scanny commented Dec 3, 2024

I'm not seeing any doc.save() call. What's the traceback?

I don't know what the type of stream is, but it will need to be IO[bytes], meaning either an open file or an instance of io.BytesIO, not just bytes.

@swarmttied
Copy link
Author

@scanny , The exception is thrown at the constructor so I still don't have the chance to save the file. Just curious, if I'm just parsing why would I call save()?

The other files created using python-docx work. Only those created in Microsoft Word fail. I got the stream using this function

image

and is based on Microsoft suggestion here

image

And lastly, the trace

image

I hope these make things clear. Thanks for looking into this

@scanny
Copy link
Contributor

scanny commented Dec 3, 2024

It looks like the problem is getting from Azure to BytesIO. I expect if you just save the BytesIO to a file then Word will not open it and probably not zip either. Maybe the .readinto() mechanism defaults to text (str) rather than bytes, I'm not familiar. But the problem occurs before you get to python-docx.

@swarmttied
Copy link
Author

swarmttied commented Dec 3, 2024

They do open after saving to Azure. I'm ruling out ByteIO as the culrpirt because docx created using python-docx just works.
I suspect it's the effect of saving thru MS Word. I'll try to experiment with other versions of the lib and python for the time-being.

Thank you for your time @scanny.

@swarmttied
Copy link
Author

My colleague figured this out. It was the cp037 encoding used by Word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants