O(n!) processing in tag name/path for Paragraph in dedupe code #27

tfmorris · 2016-04-03T21:02:43Z

Attempts to process this segment:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz

stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node).

The document is pathological in that its many thousands of levels deeply nested, but it causes the entire segment to fail when the mapper gets killed.

…tor.

habernal · 2016-04-04T07:11:21Z

Many thanks, Tom!

Ideally, it should be tested on the benchmark data for boilerplate removal to make sure it delivers the same results.

tfmorris · 2016-04-04T14:06:36Z

The fix needs improvement because, although it fixes the processing time issue, it can still exhaust heap in a constrained environment like a Hadoop cluster. I'm testing a revised version which doesn't keep the entire string of tag names, since it doesn't appear to be used anywhere.

I don't see any tests in the dkpro-c4corpus-boilerplate sub-project. How does one run the tests you are describing?

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 3, 2016

Fix O(n!) tag name processing. Fixes dkpro#27. Also simplify construc…

5871403

…tor.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 3, 2016

Fix O(n!) tag name processing. Fixes dkpro#27. Also simplify construc…

a9d101d

…tor.

tfmorris mentioned this issue Apr 3, 2016

Fix O(n!) in tag depth issue #28

Open

habernal added the enhancement label Apr 4, 2016

habernal added this to the 1.0.1 milestone Apr 4, 2016

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

bcf1bad

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

62a3a63

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 10, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

775abef

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 13, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

3d8518c

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

5eadee8

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

5dca41f

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 28, 2016

Fix O(n!) tag name processing. Fixes dkpro#27.

033e95c

Also moves all initialization into constructor and simplifies it.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Jun 12, 2020

Fix O(n!) tag name processing. Fixes dkpro#27.

d1cdd2e

Also moves all initialization into constructor and simplifies it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

O(n!) processing in tag name/path for Paragraph in dedupe code #27

O(n!) processing in tag name/path for Paragraph in dedupe code #27

tfmorris commented Apr 3, 2016

habernal commented Apr 4, 2016

tfmorris commented Apr 4, 2016

O(n!) processing in tag name/path for Paragraph in dedupe code #27

O(n!) processing in tag name/path for Paragraph in dedupe code #27

Comments

tfmorris commented Apr 3, 2016

habernal commented Apr 4, 2016

tfmorris commented Apr 4, 2016