[PATCH] Filtering tokens for position and term vector storage [LUCENE-602] #1680

asfimport · 2006-06-16T11:48:32Z

This patch provides a new TokenSelector mechanism to select tokens of interest and creates two new IndexWriter configuration parameters: termVectorTokenSelector and positionsTokenSelector.

termVectorTokenSelector, if non-null, selects which index tokens will be stored in term vectors. If positionsTokenSelector is non-null, then any tokens it rejects will have only their first position stored in each document (it is necessary to store one position to keep the doc freq properly to avoid the token being garbage collected in merges).

This mechanism provides a simple solution to the problem of minimzing index size overhead cause by storing extra tokens that facilitate queries, in those cases where the mere existence of the extra tokens is sufficient. For example, in my test data using reverse tokens to speed prefix wildcard matching, I obtained the following index overheads:

With no TokenSelectors: 60% larger with reverse tokens than without
With termVectorTokenSelector rejecting reverse tokens: 36% larger
With both positionsTokenSelector and termVectorTokenSelector rejecting reverse tokens: 25% larger

It is possible to obtain the same effect by using a separate field that has one occurrence of each reverse token and no term vectors, but this can be hard or impossible to do and a performance problem as it requires either rereading the content or storing all the tokens for subsequent processing.

The solution with TokenSelectors is very easy to use and fast.

Otis, thanks for leaving a comment in QueryParser.jj with the correct production to enable prefix wildcards! With this, it is a straightforward matter to override the wildcard query factory method and use reverse tokens effectively.

Migrated from LUCENE-602 by Chuck Williams, updated Feb 28 2013
Attachments: TokenSelectorAllWithParallelWriter.patch, TokenSelectorSoloAll.patch

asfimport · 2006-06-16T11:49:51Z

Chuck Williams (migrated from JIRA)

TokenSelectorSoloAll.patch applies against today's svn head. It only requires Java 1.4.

asfimport · 2006-06-16T11:58:07Z

Chuck Williams (migrated from JIRA)

TokenSelectorAllWithParallelWriter.patch contains ParallelWriter as well (#1678) as it is also affected.

asfimport · 2008-01-13T15:19:08Z

Grant Ingersoll (@gsingers) (migrated from JIRA)

I think, if I understand the problem correctly, that the new TeeTokenFilter and SinkTokenizer could also solve this problem, right Chuck?

asfimport · 2013-02-28T13:03:55Z

Jan Høydahl (@janhoy) (migrated from JIRA)

This issue has been inactive for more than 4 years. Please close if it's no longer relevant/needed, or bring it up to date if you intend to work on it. SPRING_CLEANING_2013

asfimport mentioned this issue Aug 24, 2022

Clean up old JIRA issues in component "Index" [LUCENE-1106] #2183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PATCH] Filtering tokens for position and term vector storage [LUCENE-602] #1680

[PATCH] Filtering tokens for position and term vector storage [LUCENE-602] #1680

asfimport commented Jun 16, 2006

asfimport commented Jun 16, 2006

asfimport commented Jun 16, 2006 •

edited

Loading

asfimport commented Jan 13, 2008

asfimport commented Feb 28, 2013

[PATCH] Filtering tokens for position and term vector storage [LUCENE-602] #1680

[PATCH] Filtering tokens for position and term vector storage [LUCENE-602] #1680

Comments

asfimport commented Jun 16, 2006

asfimport commented Jun 16, 2006

asfimport commented Jun 16, 2006 • edited Loading

asfimport commented Jan 13, 2008

asfimport commented Feb 28, 2013

asfimport commented Jun 16, 2006 •

edited

Loading