-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
O(n!) processing in tag name/path for Paragraph in dedupe code #27
Comments
Many thanks, Tom! Ideally, it should be tested on the benchmark data for boilerplate removal to make sure it delivers the same results. |
The fix needs improvement because, although it fixes the processing time issue, it can still exhaust heap in a constrained environment like a Hadoop cluster. I'm testing a revised version which doesn't keep the entire string of tag names, since it doesn't appear to be used anywhere. I don't see any tests in the |
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Also moves all initialization into constructor and simplifies it.
Attempts to process this segment:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz
stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node).
The document is pathological in that its many thousands of levels deeply nested, but it causes the entire segment to fail when the mapper gets killed.
The text was updated successfully, but these errors were encountered: