Wildcard search does not work properly #3938

andre-hohmann · 2020-08-13T11:07:04Z

Problem

The wildcard search is only possible in specific parts of the process title:

"7746" has the following results:

Aben_774618868-1873032402_02-a
Aben_774618868-1873
...

"1873" has the following results:

Aben_774618868-1873032402_02-a
Aben_774618868-1873
...

"Aben_774618868-1873" has no results

It seems as if "-" blocks the wildcard search.

Solution

It should be possible to search for "Aben_774618868-1873032" and get the following results:

Aben_774618868-18730321
Aben_774618868-18730322

matthias-ronge · 2020-11-03T15:01:45Z

This has to do with the tokenization settings of the search engine.

Tokenization is a necessary step in search engine indexing. Each text is broken down into a number of normalized tokens. For each token, the index stores which data records contain it. That is an essential part of the functionality of a search engine. When searching, the search query is also tokenized and then searched as an AND-search.

From the fact that you can search for partial number sequences, it can be deduced that numbers are tokenized as individual digits (otherwise you would not find Aben_774618868-1873032402_02-a with 7746). Example: The title Aben_774618868-1873032402_02-a is tokenized as (aben, 7, 4, 6, 1, 8, 3, 0, 2, 4, a).

The search query 7746 is tokenized in (7, 4, 6) and then searched as '7' AND '4' AND '6'. The search query Aben_774618868-1873032 is presumably tokenized as (aben, 7, 4, 6, 1, 8, -1, 3, 0, 2) and searched for as 'aben' AND '7' AND '4' AND '6' AND '1' AND '8' AND (NOT '1') AND '3' AND '0' AND '2'. This cannot deliver a result because '1' AND (NOT '1') must always result in an empty hitlist.

The design of a search engine index must always begin with the question of what should be found with which search query. The tokenization must take place accordingly. Specifically: Would you like to find all process titles with a partial number sequence that contain this partial number sequence? Yes, but only in this order? Even if there is a hyphen in between? Should the process title Aben_774618868-1873032402_02-a also be found when searching for 6818? Or just when searching for 68-18? Should it also be found when searching for 68_18 or 68 18? Or 18 68? Or should it only be found when searching for 7746 because it is at the beginning of a sequence of numbers?

Depending on all of these considerations, the tokenization of the documents to be indexed and the search queries must be implemented. The minimum requirement here seems to be that a hyphen that is not preceded by a space should not be interpreted as a negation sign in query tokenization.

andre-hohmann · 2020-11-04T06:41:19Z

Thanks a lot for the extensive explanation.

I would strive for the behaviour in Kitodo.Production 2.x to avoid prospective questions/complaints. In the Kitodo-Wiki, you can find some information:

Regarding your questions:

It should be possible to search for a hyphen that is not preceded by a space, like in Aben_774618868-1873032402_02-a to find a specific process by its processtitle.
If a hyphen is preceded by a space, the following term should be excluded, in order to exclude processes of a specific year, as for example Aben_774618868 -1873
It should be possible to search for parts of the process title as for example 774618868 1873 or Aben 1873 to find all processes of the year 1873

matthias-ronge · 2020-11-04T07:42:23Z

When changing the search from database-based search (string comparison) to index search, it makes sense to reconsider that and not just do everything the same. This is possible, but it leads to an extremely large search engine index (a lot of hard drive space and time for indexing; search response time is not affected).

I notice that the search should not be for individual digits, but that the numbers should be found in the given order (1873 should not find 348716). This means that, during indexing, sequences of numbers must be tokenized into all possible partial sequences, but in search queries, sequences of numbers must be treated as one term.

Do I see it correctly that one is actually only looking for the incipits of sequences of numbers? (1873 does not need to search like 1873 within digit sequences, but is sufficient to search like 1873*) That would greatly reduce the number of terms to be indexed:

Example of all unique partial sequence tokens of 1873032402: 1, 18, 187, 1873, 18730, 187303, 1873032, 18730324, 187303240, 1873032402, 8, 87, 873, 8730, 87303, 873032, 8730324, 87303240, 873032402, 7, 73, 730, 7303, 73032, 730324, 7303240, 73032402, 3, 30, 303, 3032, 30324, 303240, 3032402, 0, 03, 032, 0324, 03240, 032402, 32, 324, 3240, 32402, 2, 24, 240, 2402, 4, 40, 402, 02. (52 index entries)

Example of only initial partial sequence tokens of 1873032402: 1, 18, 187, 1873, 18730, 187303, 1873032, 18730324, 187303240, 1873032402. (10 index entries)

Side note: In our field there are many sequences of numbers that end with a check digit that is calculated according to modulo 11 (letter X as the last number). These Xes at the end, immediately preceeded by at least one digit, should be seen as part of the sequence of numbers and not as a single letter, right?

matthias-ronge · 2020-11-04T08:33:17Z

If my assumption is right, this should do the job for process title indexing tokenization:

import java.text.*;
import java.util.*;
import java.util.regex.*;

static Pattern GROUPS_OF_ALPHANUMERIC_CHARACTERS = Pattern.compile("[\\p{IsLetter}\\p{Digit}]+");

static Set<String> tokenizeProcessTitle(String processTitle) {
    Set<String> tokens = new HashSet<>();
    Matcher matcher = GROUPS_OF_ALPHANUMERIC_CHARACTERS.matcher(processTitle);
    while (matcher.find()) {
        String normalized = normalize(matcher.group());
        int length = normalized.length();
        for (int end = 1; end <= length; end++) {
            tokens.add(normalized.substring(0, end));
        }
    }
    return tokens;
}

static String normalize(String input) {
    StringBuilder umlautsReplaced = replaceUmlauts(input);
    String noDiactitics = Normalizer.normalize(umlautsReplaced, Normalizer.Form.NFD).replaceAll("\\p{M}", "");
    String lowerCase = noDiactitics.toLowerCase();
    return lowerCase;
}

static StringBuilder replaceUmlauts(String input) {
    StringBuilder buffer = new StringBuilder(64);
    final int length = input.length();
    for (int offset = 0; offset < length;) {
        int codepoint = input.codePointAt(offset);
        if (codepoint == 'Ä' || codepoint == 'ä') {
            buffer.append("ae");
        } else if (codepoint == 'Ö' || codepoint == 'ö') {
            buffer.append("oe");
        } else if (codepoint == 'Ü' || codepoint == 'ü') {
            buffer.append("ue");
        } else if (codepoint == 7838 || codepoint == 'ß') {
            buffer.append("ss");
        } else {
            buffer.appendCodePoint(codepoint);
        }
        offset += Character.charCount(codepoint);
    }
    return buffer;
}

"PineSeve_313539383" would be searchable with these input strings: p, pi, pin, pine, pines, pinese, pinesev, pineseve, 3, 31, 313, 3135, 31353, 313539, 3135393, 31353938, 313539383. (17 index records)

I have absolutely no idea where to put that in. This may need to be implemented within ElasticSearch.

andre-hohmann · 2020-11-10T12:12:25Z

@matthias-ronge: Thanks for the examination!

You are right, it is always good think about the opportunities and to improve the current state. As user, the result is more important than the technical basis. However, i am sure we will find a solution.

Do I see it correctly that one is actually only looking for the incipits of sequences of numbers? (1873 does not need to search like 1873 within digit sequences, but is sufficient to search like 1873*) That would greatly reduce the number of terms to be indexed:

I can only describe my demands for the search. For newspaper processes, it is extremely helpful for administrative exports, ... to be able to search for prcoesses by year, month, ...

Aben_399196951-1818
Aben_399196951-181801
Aben_399196951-181802
...
399196951-1818
399196951-181801
399196951-181802
...

Thus, from my point of view, a search for "1873*" would be sufficient

Side note: In our field there are many sequences of numbers that end with a check digit that is calculated according to modulo 11 (letter X as the last number). These Xes at the end, immediately preceeded by at least one digit, should be seen as part of the sequence of numbers and not as a single letter, right?

Yes, from my point of view, the X as for example in the following process title is part of the sequence just like 1, 7, 2, 7, ....

AdleaDM_172788177X

andre-hohmann added the 3.x label Aug 13, 2020

matthias-ronge added the bug label Nov 3, 2020

Kathrin-Huber mentioned this issue Mar 9, 2021

use keyword for wildcardsearch #4263

Merged

Kathrin-Huber closed this as completed in #4263 Apr 13, 2021

andre-hohmann mentioned this issue Feb 11, 2023

Creation of new labels to classify the issues #4985

Closed

matthias-ronge added the search search, filter label Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildcard search does not work properly #3938

Wildcard search does not work properly #3938

andre-hohmann commented Aug 13, 2020

matthias-ronge commented Nov 3, 2020

andre-hohmann commented Nov 4, 2020

matthias-ronge commented Nov 4, 2020

matthias-ronge commented Nov 4, 2020 •

edited

Loading

andre-hohmann commented Nov 10, 2020

Wildcard search does not work properly #3938

Wildcard search does not work properly #3938

Comments

andre-hohmann commented Aug 13, 2020

Problem

Solution

matthias-ronge commented Nov 3, 2020

andre-hohmann commented Nov 4, 2020

matthias-ronge commented Nov 4, 2020

matthias-ronge commented Nov 4, 2020 • edited Loading

andre-hohmann commented Nov 10, 2020

matthias-ronge commented Nov 4, 2020 •

edited

Loading