Error Loading Word Documents - File is not a zip file #8212

hodgesz · 2023-07-25T01:59:35Z

hodgesz
Jul 25, 2023

Hello,

I am running Python 3.11 and the following is my sample code trying to index some Word document for retrieval augmentation.

loader = UnstructuredWordDocumentLoader("/Users/blah/LangChain/data/blah/load/**/*.docx")
docs = loader.load()

I first tried with wild card with a directory of .docx files, but got the 'File is not a zip file' error. I then tried just one Word file.

loader = UnstructuredWordDocumentLoader("/Users/blah/LangChain/data/blah/load/sem-1001-study_guide.docx")
docs = loader.load()

However, I get the same error and here is the full trace:

Traceback (most recent call last):
  File "/Users/blah/PycharmProjects/blah/blah_ai_workshop/code/indexer.py", line 135, in <module>
    main()
  File "/Users/blah/PycharmProjects/blah/blah_ai_workshop/code/indexer.py", line 131, in main
    create_index()
  File "/Users/blah/PycharmProjects/blah/blah_ai_workshop/code/indexer.py", line 90, in create_index
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 86, in load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/langchain/document_loaders/word_document.py", line 121, in _get_elements
    from unstructured.partition.docx import partition_docx
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/unstructured/partition/docx.py", line 33, in <module>
    from unstructured.partition.text_type import (
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 21, in <module>
    from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 32, in <module>
    _download_nltk_package_if_not_present(package_name, package_category)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
    nltk.find(f"{package_category}/{package_name}")
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/nltk/data.py", line 555, in find
    return find(modified_name, paths)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/site-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/zipfile.py", line 1302, in __init__
    self._RealGetContents()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/langchain_dev/lib/python3.11/zipfile.py", line 1369, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Process finished with exit code 1

Any thoughts on what I am doing wrong? I have typically been using the PyPDFLoader for PDFs and this is my first attempt with Word documents, so certainly could be my setup.

Answered by dosubot[bot]

Jul 25, 2023

Answer generated by a 🤖

Answer

From your description, it appears that you're encountering an error when trying to load Word documents using the UnstructuredWordDocumentLoader in LangChain. The error message suggests that the issue is not directly related to the loading of Word documents, but rather with the NLTK library when it tries to find a package and fails to open it as a zip file. This could be due to a corrupted NLTK data package.

To address this issue, I suggest manually downloading the NLTK packages. You can do this by running the following code:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

These packages ('punkt' and 'averaged_perceptron_tagger'…

View full answer

hodgesz · 2023-07-25T04:33:36Z

dosubot[bot]
bot Jul 25, 2023

Answer generated by a 🤖

Answer

From your description, it appears that you're encountering an error when trying to load Word documents using the UnstructuredWordDocumentLoader in LangChain. The error message suggests that the issue is not directly related to the loading of Word documents, but rather with the NLTK library when it tries to find a package and fails to open it as a zip file. This could be due to a corrupted NLTK data package.

To address this issue, I suggest manually downloading the NLTK packages. You can do this by running the following code:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

These packages ('punkt' and 'averaged_perceptron_tagger') are commonly used for tokenization and part-of-speech tagging, which might be used in the document loading process.

After downloading these packages, please try running your code again. If the error persists, it might be due to a different NLTK package. In that case, you might need to download the entire NLTK package collection using nltk.download('all'). Please be aware that downloading all NLTK packages will require a significant amount of disk space.

It's also worth noting that the UnstructuredWordDocumentLoader class in LangChain supports both .doc and .docx files. However, partitioning .doc files is only supported in unstructured>=0.4.11. If you're using .doc files and your unstructured version is less than 0.4.11, you'll need to upgrade the unstructured package.

Alternatively, you can consider using the Docx2txtLoader class for loading .docx files. This class uses the docx2txt package, which you can install using pip install docx2txt. Here's how you can use it:

from langchain.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("example_data/fake.docx")
data = loader.load()

This is one potential solution to your problem. If you find that this solution works and you believe it's a bug that could impact other users, I encourage you to make a pull request. If this solution doesn't align with your experience, I recommend upgrading to the latest version of LangChain.

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

1 reply

hodgesz Jul 25, 2023
Author

Really appreciate your reply. Downloading nltk package fixed my issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Loading Word Documents - File is not a zip file #8212

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Error Loading Word Documents - File is not a zip file #8212

hodgesz Jul 25, 2023

Answer

Replies: 1 comment · 1 reply

dosubot[bot] bot Jul 25, 2023

Answer

Sources

hodgesz Jul 25, 2023 Author

hodgesz
Jul 25, 2023

Replies: 1 comment 1 reply

dosubot[bot]
bot Jul 25, 2023

hodgesz Jul 25, 2023
Author