Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

megaparse.load() returns an empty string instead of parsed content #223

Closed
nataliastrawa opened this issue Jan 30, 2025 · 11 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@nataliastrawa
Copy link

I wanted to test the performance of the parser on a sample file. I used the snippet from README.

from megaparse import MegaParse
from langchain_openai import ChatOpenAI

megaparse = MegaParse()
response = megaparse.load("/content/sample.pdf")
print(response)

I don't get any error. Only a message is displayed: "Switching to Unstructured Parser". response is an empty string.

Environment:

Platform: Google Colab
OS: Ubuntu 22.04.4 LTS
Python: 3.11.11
megaparse: 0.0.53
Copy link

linear bot commented Jan 30, 2025

Copy link

dosubot bot commented Jan 30, 2025

Hey there, @nataliastrawa! 👋 I'm here to help you out with any bugs, questions, or contribution insights while we wait for a human maintainer.

The issue of megaparse.load() returning an empty string could be related to the conditions under which the Unstructured Parser is used. MegaParse switches to the Unstructured Parser if the file extension is not PDF or if the strategy is set to StrategyEnum.FAST. Additionally, if the global strategy is not StrategyEnum.HI_RES, the Unstructured Parser is used [1].

The Unstructured Parser partitions the file into elements and converts them into blocks, but it has limitations, such as incomplete handling of list items and reliance on metadata that may not always be available [2]. These limitations might lead to an empty response if the file content cannot be properly parsed.

To troubleshoot, consider the following steps:

  1. Ensure that the sample PDF file is correctly formatted and contains parseable content.
  2. Check if the file has any special formatting or elements that might not be supported by the Unstructured Parser.
  3. Try using a different strategy, if possible, to see if it affects the output.

If the issue persists, further investigation into the specific content of the PDF and the parser's handling of it might be necessary.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Jan 30, 2025
@nataliastrawa
Copy link
Author

@dosu For the test I used a basic sample pdf from the first google page. I tried using other strategies, but each time the result was the same - an empty string.

Copy link

dosubot bot commented Jan 30, 2025

If you're consistently getting an empty string with different strategies, it might be worth checking a few additional things:

  1. File Content: Ensure that the PDF file actually contains text content and is not just an image or scanned document. The Unstructured Parser might not handle image-based PDFs well without OCR capabilities.

  2. Parser Configuration: Double-check the parser configuration to ensure that it's set up correctly. You might want to explicitly set the strategy to StrategyEnum.HI_RES if you haven't tried that yet, as it uses the DoctrParser for high-resolution text recognition, which might handle certain PDFs better.

  3. PDF Structure: Some PDFs have complex structures or use fonts and encodings that can confuse parsers. Try using a different PDF with a simpler structure to see if the issue persists.

  4. Library Version: Make sure that all dependencies are up to date. Sometimes, issues are resolved in newer versions of the library or its dependencies.

If none of these steps resolve the issue, it might be helpful to look into the specific content and structure of the PDF to identify any elements that could be causing the parser to fail.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Copy link
Contributor

Hey ! Thanks for the report. We will investigate this issue shortly 🙂

@rhlarora84
Copy link

Similar issue. Using unstructured directly results into elements from the partition_pdf function from unstructured but with megaparse, the elements are returned empty.

Tried with different strategy and still results into the same problem. Document contains text, tables and few images.

Copy link
Contributor

@amine can you check it this week ?

Copy link
Contributor

We ae having a lot of issues with unstructured in the last week. It is a more complex issue?

@rhlarora84
Copy link

I did a bit of debugging and it has to do with underlying dependency on nltk. Previous versions of unstructured relied on this URL to download the nltk packages and it returns a 400. This has been fixed in the latest version of unstructured. There is also a flag to disable the default behavior of downloading.

https://utic-public-cf.s3.amazonaws.com/nltk_data.tgz

Image

Hope this helps.

@telboth
Copy link

telboth commented Feb 4, 2025

I did the following (py 3.11):

pip install megaparse==0.0.53 (the latest version)
then
pip uninstall unstructured
then
pip install unstructured (which gave me a newer version 0.15 to 0.16, and an error see below:)
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
megaparse 0.0.53 requires unstructured[all-docs]==0.15.0, but you have unstructured 0.16.17 which is incompatible.

Finally I was able to run the basic example.

This is clearly not a perfect fix, ref the error, but it got the program working...

@AmineDiro
Copy link
Collaborator

The issue was due to NLTK models dependency in unstructured. The url seems to be up now. Running the following should fix the issue with empty response:

python3 -m nltk.downloader all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants