-
Hello, I am running Python 3.11 and the following is my sample code trying to index some Word document for retrieval augmentation.
I first tried with wild card with a directory of .docx files, but got the 'File is not a zip file' error. I then tried just one Word file.
However, I get the same error and here is the full trace:
Any thoughts on what I am doing wrong? I have typically been using the PyPDFLoader for PDFs and this is my first attempt with Word documents, so certainly could be my setup. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Answer generated by a 🤖 AnswerFrom your description, it appears that you're encountering an error when trying to load Word documents using the UnstructuredWordDocumentLoader in LangChain. The error message suggests that the issue is not directly related to the loading of Word documents, but rather with the NLTK library when it tries to find a package and fails to open it as a zip file. This could be due to a corrupted NLTK data package. To address this issue, I suggest manually downloading the NLTK packages. You can do this by running the following code: import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') These packages ('punkt' and 'averaged_perceptron_tagger') are commonly used for tokenization and part-of-speech tagging, which might be used in the document loading process. After downloading these packages, please try running your code again. If the error persists, it might be due to a different NLTK package. In that case, you might need to download the entire NLTK package collection using It's also worth noting that the UnstructuredWordDocumentLoader class in LangChain supports both .doc and .docx files. However, partitioning .doc files is only supported in unstructured>=0.4.11. If you're using .doc files and your unstructured version is less than 0.4.11, you'll need to upgrade the unstructured package. Alternatively, you can consider using the Docx2txtLoader class for loading .docx files. This class uses the docx2txt package, which you can install using from langchain.document_loaders import Docx2txtLoader
loader = Docx2txtLoader("example_data/fake.docx")
data = loader.load() This is one potential solution to your problem. If you find that this solution works and you believe it's a bug that could impact other users, I encourage you to make a pull request. If this solution doesn't align with your experience, I recommend upgrading to the latest version of LangChain. Sources
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
Answer generated by a 🤖
Answer
From your description, it appears that you're encountering an error when trying to load Word documents using the UnstructuredWordDocumentLoader in LangChain. The error message suggests that the issue is not directly related to the loading of Word documents, but rather with the NLTK library when it tries to find a package and fails to open it as a zip file. This could be due to a corrupted NLTK data package.
To address this issue, I suggest manually downloading the NLTK packages. You can do this by running the following code:
These packages ('punkt' and 'averaged_perceptron_tagger'…