-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
megaparse.load() returns an empty string instead of parsed content #223
Comments
Hey there, @nataliastrawa! 👋 I'm here to help you out with any bugs, questions, or contribution insights while we wait for a human maintainer. The issue of The Unstructured Parser partitions the file into elements and converts them into blocks, but it has limitations, such as incomplete handling of list items and reliance on metadata that may not always be available [2]. These limitations might lead to an empty response if the file content cannot be properly parsed. To troubleshoot, consider the following steps:
If the issue persists, further investigation into the specific content of the PDF and the parser's handling of it might be necessary. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu For the test I used a basic sample pdf from the first google page. I tried using other strategies, but each time the result was the same - an empty string. |
If you're consistently getting an empty string with different strategies, it might be worth checking a few additional things:
If none of these steps resolve the issue, it might be helpful to look into the specific content and structure of the PDF to identify any elements that could be causing the parser to fail. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Hey ! Thanks for the report. We will investigate this issue shortly 🙂 |
Similar issue. Using unstructured directly results into elements from the partition_pdf function from unstructured but with megaparse, the elements are returned empty. Tried with different strategy and still results into the same problem. Document contains text, tables and few images. |
@amine can you check it this week ? |
We ae having a lot of issues with unstructured in the last week. It is a more complex issue? |
I did a bit of debugging and it has to do with underlying dependency on nltk. Previous versions of unstructured relied on this URL to download the nltk packages and it returns a 400. This has been fixed in the latest version of unstructured. There is also a flag to disable the default behavior of downloading. https://utic-public-cf.s3.amazonaws.com/nltk_data.tgz ![]() Hope this helps. |
I did the following (py 3.11): pip install megaparse==0.0.53 (the latest version) Finally I was able to run the basic example. This is clearly not a perfect fix, ref the error, but it got the program working... |
The issue was due to NLTK models dependency in python3 -m nltk.downloader all |
I wanted to test the performance of the parser on a sample file. I used the snippet from README.
I don't get any error. Only a message is displayed: "Switching to Unstructured Parser".
response
is an empty string.Environment:
The text was updated successfully, but these errors were encountered: