-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change HTML conversion backend from boilerpy3 to Trafilatura #7705
Conversation
@@ -57,7 +57,7 @@ dependencies = [ | |||
"more-itertools", # TextDocumentSplitter | |||
"networkx", # Pipeline graphs | |||
"typing_extensions>=4.7", # typing support for Python 3.8 | |||
"boilerpy3", # Fulltext extraction from HTML pages | |||
"trafilatura", # Fulltext extraction from HTML pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my main concern.
boilerpy3 is 163kb, while trafilatura is 1390kb.
If you think it's better, I can add trafilatura as an optional dependency and wrap it in a lazy import block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.4Mb is still ok, let's keep it
Pull Request Test Coverage Report for Build 9116483899Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Code reduction, simplification, better features - a heaven on earth!
Related Issues
Proposed Changes:
As discussed offline, we want to replace boilerpy3 with Trafilatura, which is robust and well-maintained.
During my recent work on AutoQuizzer, I battle-tested this library, which worked well for a diverse range of HTML pages.
I'm trying not to break the existing API. The implementation is simpler.
How did you test it?
CI, new tests.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.