-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make tokenizer emit at least one empty token on empty strings #5532
base: main
Are you sure you want to change the base?
Conversation
@@ -316,6 +316,7 @@ mod tests { | |||
use crate::{create_default_quickwit_tokenizer_manager, BooleanOperand}; | |||
|
|||
#[test] | |||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behavior this test tries for changed, I don't think it make sense to keep it. Note that it doesn't test for a
query, but a body:""
, so it makes sense it no longer is a MatchAll.
emits no FullTextQuery, and is still a MatchAll
Why do we need this? |
@@ -128,6 +132,69 @@ pub fn get_quickwit_fastfield_normalizer_manager() -> &'static TokenizerManager | |||
&QUICKWIT_FAST_FIELD_NORMALIZER_MANAGER | |||
} | |||
|
|||
#[derive(Debug, Clone)] | |||
struct EmitEmptyTokenizer<T>(T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a comment to express the intent on that kind of tokenizer. It is very difficult for a future readers to infer the point of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tokenizer emits the empty token whenever the underlying tokenizer emits no token.
I am not sure this is the behavior we want.
For instance, a stop word would result in the emission of the empty token.
A stricter behavior, emitting an empty token iff the text is effectively empty might actually make more sense. I suspect the code would be simpler too.
|
||
fn token(&self) -> &Token { | ||
match self.state { | ||
EmitEmptyState::Start => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is reachable.
EmitEmptyState::Start => unreachable!(), | |
EmitEmptyState::Start => panic!("token() should not be called before a first call to advance"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to explain the intent in the commit message and in the comment of the tokenizer.
A unit test checking for the behavior end to end (indexing + query side) should be added too.
We also need to think about what the behavior should be for a multivalued field
["hello", ""]
(does it match or not), and add a unit test to validate this behavior.
Today |
Tokenization is already quite expensive during indexing, I think this change may add some non-negligable overhead there. |
Description
always emit at least one token when indexing/querying with an empty field
How was this PR tested?
tested manually. todo: add integration test