Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 481: Use case insensitive filter and add case insensitive string type. #496

Merged
merged 2 commits into from
Jan 12, 2023

Conversation

kaladay
Copy link
Contributor

@kaladay kaladay commented Jan 11, 2023

Description

Cannot a filter to the solr.StrField.
According to the SOLR documentation, a filter can only be added to something tokenized and a solr.StrField does not allow tokenization. This uses a solr.TextField instead.
Several fields need to have case insensitive searches. A new type is added that uses the KeywordTokenizer, called string_ci and strings_ci. The KeywordTokenizer essentialy is a pretend token. It tokenizes the whole string, which is effectively the same as not having a tokenizer. The documentation even references the KeywordTokenizer as the method of disabling the tokenizer.

Fields that should be case insensitive are moved from string to string_ci and strings to strings_ci respectively.

There are potential performance concerns with using solr.TextField rather than solr.StrField due to the loss of the docvalues optimization feature.

This change requires a change to the solr cor data structure.
I consider this a breaking change.

see: https://solr.apache.org/guide/7_7/field-types-included-with-solr.html#field-types-included-with-solr
see: https://solr.apache.org/guide/7_7/field-type-definitions-and-properties.html#field-type-definitions-and-properties
see: https://solr.apache.org/guide/7_7/field-properties-by-use-case.html#field-properties-by-use-case
see: https://solr.apache.org/guide/7_7/tokenizers.html#keyword-tokenizer
see: https://solr.apache.org/guide/7_7/docvalues.html

Fixes #481

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

  • Manually

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

…g type.

Cannot a filter to the `solr.StrField`.
According to the SOLR documentation, a filter can only be added to something tokenized and a `solr.StrField` does not allow tokenization.
This uses a `solr.TextField` instead.
Several fields need to have case insensitive searches.
A new type is added that uses the `KeywordTokenizer`, called `string_ci` and `strings_ci`.
The `KeywordTokenizer` essentialy is a pretend token.
It tokenizes the whole string, which is effectively the same as not having a tokenizer.
The documentation even references the `KeywordTokenizer` as the method of disabling the tokenizer.

Fields that should be case insensitive are moved from `string` to `string_ci` and `strings` to `strings_ci` respectively.

There are potential performance concerns with using `solr.TextField` rather than `solr.StrField` due to the loss of the docvalues optimization feature.

see: https://solr.apache.org/guide/7_7/field-types-included-with-solr.html#field-types-included-with-solr
see: https://solr.apache.org/guide/7_7/field-type-definitions-and-properties.html#field-type-definitions-and-properties
see: https://solr.apache.org/guide/7_7/field-properties-by-use-case.html#field-properties-by-use-case
see: https://solr.apache.org/guide/7_7/tokenizers.html#keyword-tokenizer
see: https://solr.apache.org/guide/7_7/docvalues.html
@kaladay kaladay requested review from jeremythuff, a user and rmathew1011 January 11, 2023 16:17
@coveralls
Copy link

coveralls commented Jan 11, 2023

Coverage Status

Coverage: 45.24% (+0.03%) from 45.215% when pulling 753b83a on 481-case_sensitive into 850bc32 on staging.

@ghost
Copy link

ghost commented Jan 11, 2023

#481

Suggested approach was a TextField using KeywordTokenizerFactory.

Additionally, was suggested to seperate between index and query time with two analyzers.

Such as

    <fieldType name="whole_strings" class="solr.TextField" omitNorms="true" sortMissingLast="true" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

This would simply change all fields of type whole_strings to afford matching case insensitive by during query by lowercasing. For exact match case sensitive searches, it is common to expect search term to be wrapped in double quotes enforcing exact match on the search field.

The question is, what search behavior changes are not desired by affording all fields to search with case insensitivity? One with minimal change approach may consider minimal change be that of changes to the versioned schema and not to the minimal changes to search behavior. Basically, adding additional field types is not minimal changes to versioned schema (obviously) and the search behavior changes may still be a minimum in term of anticipated or expected search terms.

@ghost
Copy link

ghost commented Jan 11, 2023

Not sure we need the additional field types. What behavior changes are there without the additional field types?

solr/config/managed-schema Outdated Show resolved Hide resolved
…x date_created.

The `strings_ci` is close enough to `whole_strings`, just use `whole_strings`.
There is no `whole_string`.
Rename `string_ci` to `whole_string`.
To better prevent future problems, document these custom field types.

The date_created is not multi-valued so use `whole_string`.
@kaladay kaladay requested a review from a user January 11, 2023 21:09
@kaladay kaladay merged commit 2fae79e into staging Jan 12, 2023
@kaladay kaladay deleted the 481-case_sensitive branch January 27, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Searches are case-sensitive for special searches, like Subject search.
2 participants