You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It silently falls back to markdown. This was very confusing as a user and it should probably throw an error, or at least output a warning if that content_format is not supported.
This should support cleaned_html as this is often 10x smaller and well suited for element extraction using LLMs, layout interpretation etc over markdown.
I think all the changes are needed are here:
content = {
"markdown": markdown,
"html": html,
"cleaned_html": cleaned_html, # New
"fit_markdown": markdown_result.raw_markdown,
}.get(content_format, markdown)
# Use IdentityChunking for HTML input, otherwise use provided chunking strategy
chunking = (
IdentityChunking()
if content_format in ("html", "cleaned_html"). # New
else config.chunking_strategy
)
Awesome package BTW!
Current Behavior
See above
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
All
Python version
All
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
The text was updated successfully, but these errors were encountered:
crawl4ai version
0.4.248
Expected Behavior
Currently, the async_webcrawler prepares content for the extraction strategy here:
crawl4ai/crawl4ai/async_webcrawler.py
Line 656 in 3b1025a
This has two problems:
content_format
is not supported.cleaned_html
as this is often 10x smaller and well suited for element extraction using LLMs, layout interpretation etc over markdown.I think all the changes are needed are here:
Awesome package BTW!
Current Behavior
See above
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
All
Python version
All
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
The text was updated successfully, but these errors were encountered: