You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, in the Dragnet post-processed dataset, there are 399 files containing the string "!@#$%^&*() COMMENTS" followed by the comment part of the page. It's the only dataset with this kind of information, and since most extractors ignore comments (with the notable exception of trafilatura, which produce an optional comments body), I think the benchmark is slightly improvable in this regard.
The text was updated successfully, but these errors were encountered:
Hi, in the Dragnet post-processed dataset, there are 399 files containing the string "!@#$%^&*() COMMENTS" followed by the comment part of the page. It's the only dataset with this kind of information, and since most extractors ignore comments (with the notable exception of trafilatura, which produce an optional comments body), I think the benchmark is slightly improvable in this regard.
The text was updated successfully, but these errors were encountered: