Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ChunkDocument.tryParseStream while attempting to read filtered version of GX229 #29

Open
MikeHopcroft opened this issue May 3, 2017 · 1 comment

Comments

@MikeHopcroft
Copy link
Contributor

This crash happens in ChunkDocument.tryParseStream() when buffer[writeCursor++] = (byte)c; attempts to write past the end of buffer. The size of buffer was 256k, based on the assertion that gov2 documents are truncated at 256KB.

The document that causes the crash has length 357895. This document was encountered while processing GX229-1000-1500.chunk, which was a version of GX229.chunk that was filtered by BitFunnel to contain documents with unique posting counts from 1000 to 1500.

Some observations:

  • BitFunnel was able to read GX229.chunk in order to to generate the filtered chunk GX229-1000-1500.chunk. This suggests that GX229.chunk, which was created by mg4j, was well formatted, even if it contained long documents.
  • In the chunk processing pipeline, mg4j generated GX229.chunk, but never attempted to read it.

My leading theory is that the original gov2 GX229 directory contains a bundle (.txt file) with a document, which tikka represents as longer than 256k.

@MikeHopcroft
Copy link
Contributor Author

I have no evidence that this crash is related to BitFunnel issue 387, but I mention it here just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant