Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query parts ending with a colon are handled badly [LUCENE-373] #1451

Open
asfimport opened this issue Apr 15, 2005 · 5 comments
Open

Query parts ending with a colon are handled badly [LUCENE-373] #1451

asfimport opened this issue Apr 15, 2005 · 5 comments

Comments

@asfimport
Copy link

I'm using Lucene 1.4.3, running
Query query = QueryParser.parse(queryString, "contents", new StandardAnalyzer());

If queryString is "search title:" i.e. specifying a field name without a
corresponding value, I get a parsing exception:

Encountered "<EOF>" at line 1, column 8.
Was expecting one of:
"(" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...

If queryString is "title: search", there's no exception. However, the parsed
query which is returned is "title:search". If queryString is "title: contents:
text", the parsed query is "title:contents" and the "text" part is ignored
completely. When queryString is "title: text contents:" the above exception is
produced again.

This seems inconsistent. Given that it's pointless searching for an empty
string (since it has no tokens), I'd expect both "search title:" & "title:
search" to be parsed as "search" (or, given the default field I specified,
"contents:search"), and "title: contents: text" & "title: text contents:" to
parse as "text" ("contents:text") i.e. parts which have no term are ignored. At
worst I'd expect them all to throw a ParseException rather than just the ones
with the colon at the end of the string.


Migrated from LUCENE-373 by Andrew Stevens, 1 vote, updated Jul 28 2015
Environment:

Operating System: Windows 2000
Platform: PC
@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

This is still a problem with QP. I think we should make the behavior consistent. If "xyz field:" throws ParseException, so should "field: xyz". In fact, I think the latter is a bug - the query will search 'xyz' under "field", even if the user didn't intend to do so.

QP already takes Version, so we can fix this bug safely.

@asfimport
Copy link
Author

Jan Fruehwacht (migrated from JIRA)

Do I understand it correct, that you say that it is bug when because there is also a space before the xyz ? So it should be searched for ' xyz' under field ? Am I right ? Or how do you expect that to work.
I totally understand the inconsistency described by Andrew.

@asfimport
Copy link
Author

Erick Erickson (@ErickErickson) (migrated from JIRA)

2013 Old JIRA cleanup

@asfimport
Copy link
Author

Harish Kayarohanam (migrated from JIRA)

My understanding of the above issue and analyze if it really needs a fix, if so where, or to find if it is an enhancement.

section 1:

>>> If queryString is "title: search", there's no exception. However, the parsed
>>> query which is returned is "title:search".
This is as expected.

section 2:

>>> If queryString is "title: contents: text",
>>> the parsed query is "title:contents" and the "text" part is ignored completely.
this needs revisit. may be we should bring in something like
a = b = 2 in java or python or javascript or ruby means 2 is assigned to a and b .
so similar approach can be followed here .This is discussed in detail later in my answer(see section 5 & 7)

section 3:

>>> When queryString is "title: text contents:" the above exception is
>>> produced again.
This is also expected . It breaks the syntax.
Why ? and Why this may not be conceived as a bug ?
We should accept one thing that is that lucene query language is like a language of
its own and it has its own syntax. So we should obey that .
And I would say that it has a meaningful syntax. It is not weird.
why did I make the above statement ?
Let us see what happens in other programming languages(say python or java or javascript or ruby) .
say a = ; ( a =
is an error (unexpected End of input error)
similary
= 2;
is an error ... so
this is something that is common in all most languages and expected ..
why is this the most expected ?
the idea is

  1. if you assign something to nothing it is a bug. = 2
  2. if you assign nothing to something it is a bug. a =

Now lets comes to lucene context :
= something ...
then comes the question "what should we search something against default field of something else?" this is meaningless . so it is best choice made by lucene developers to have considered it as a bug and throw parseException.
something =
what should we search for in field something ... we should not infer anything as value unless told explicitly , so here too it is best choice made by lucene developers to have considered it as a bug and throw parseException. I personally like the decision made.

section 4:

>>> This seems inconsistent. Given that it's pointless searching for an empty
>>> string (since it has no tokens), I'd expect both "search title:" & "title:
>>> search" to be parsed as "search" (or, given the default field I specified,
>>> "contents:search"),
search title:
is like as explained above . I like the present syntax as it is best for a syntax not to assume anything unless
said explicitly. like the cases
= 2
a =
where we cannot assume either the field or the term. so it should be a parseException and that is what we get now.

"title: search" overrides the default field and searches in title field. this is as per design and this cannot do just "search" on default, which breaks the original design. pls refer fields section in http://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description.

section 5:

>>> and "title: contents: text"
this seems meaningful at least to me. But I would not say it is right or wrong .. but it is about what we want
and what most people want and what seems meaningful.
if we want we can bring in a syntax again I would like to see other programing languages to see how a similar syntax is handled
in java, python, javascript, ruby
a = b = c = 2
is allowed
and what it does is assign a, b , c the value of 2 .
so here too we can have syntax to make the text term be searched in fields title and contents . This is a choice which
we can make if the present state is confusing.
I feel that as the person who reported this issue says , just ignoring something that user gave silently seems
unfair .This is just my point of view .
If the community takes a stand that this breaks syntax and we don't want this new syntax, at least we should throw exception .

section 6:

>>> "title: text contents:" to
>>> parse as "text" ("contents:text") i.e. parts which have no term are ignored. At
>>> worst I'd expect them all to throw a ParseException rather than just the ones
>>> with the colon at the end of the string.
pls see my explanation above . this as per my reasoning need not be considered a bug.

Note: I am taking other programming language syntax just to see which design has stood the test of time .. so that I can infer that it is mostly expected from people and is less confusing. These programming languages have evolved over time, so we can take these
syntax as reference and be considered as the most expected ones. I personally would like to go by the most famous
expectations. Please correct me if I am wrong.

section 7:

Further discussion on section 5 :
lets see if the new syntax work in our lucene query language, and how it can work without ambiguity
a : b : hello world h: when
hello will be searched in fields with names a,b
world will be searched in default field
when will be searched in field with name h.

whenever and wherever there are statements like the following

  1. with fieldnames but no terms – a:
  2. terms with intention to assign (with :) but no field name – : hello
    should be flagged as error.
    (already the above is done by query parser..(this is to say that queryparser does not just look for : in begining or end and flags the
    error. This is good. even if I have statements within brackets like (fieldname:) or (:termvalue) it flags error.

The above in section 5 & 7 is just a proposal. Please give your comments. Feel free to point out mistakes.
If there is expectation that this syntax will have a bad impact on performance , even then this syntax need not get inside.

I referred http://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description for better understanding .

@asfimport
Copy link
Author

Harish Kayarohanam (migrated from JIRA)

So as far my analysis goes , there is no must fix bug here . But there is decision that is pending whether the proposal above can be brought
in or not in order to allay the confusing and silent leaving out of terms in search . If the community decides that it is worth doing, then may be I can give it a try to get in this feature .
On the other hand, if the community decides that it is not worth doing, then we can leave it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant