-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wildcard search does not work properly #3938
Comments
This has to do with the tokenization settings of the search engine. Tokenization is a necessary step in search engine indexing. Each text is broken down into a number of normalized tokens. For each token, the index stores which data records contain it. That is an essential part of the functionality of a search engine. When searching, the search query is also tokenized and then searched as an AND-search. From the fact that you can search for partial number sequences, it can be deduced that numbers are tokenized as individual digits (otherwise you would not find Aben_774618868-1873032402_02-a with The search query The design of a search engine index must always begin with the question of what should be found with which search query. The tokenization must take place accordingly. Specifically: Would you like to find all process titles with a partial number sequence that contain this partial number sequence? Yes, but only in this order? Even if there is a hyphen in between? Should the process title Aben_774618868-1873032402_02-a also be found when searching for Depending on all of these considerations, the tokenization of the documents to be indexed and the search queries must be implemented. The minimum requirement here seems to be that a hyphen that is not preceded by a space should not be interpreted as a negation sign in query tokenization. |
Thanks a lot for the extensive explanation. I would strive for the behaviour in Kitodo.Production 2.x to avoid prospective questions/complaints. In the Kitodo-Wiki, you can find some information:
Regarding your questions:
|
When changing the search from database-based search (string comparison) to index search, it makes sense to reconsider that and not just do everything the same. This is possible, but it leads to an extremely large search engine index (a lot of hard drive space and time for indexing; search response time is not affected). I notice that the search should not be for individual digits, but that the numbers should be found in the given order (1873 should not find 348716). This means that, during indexing, sequences of numbers must be tokenized into all possible partial sequences, but in search queries, sequences of numbers must be treated as one term. Do I see it correctly that one is actually only looking for the incipits of sequences of numbers? (1873 does not need to search like 1873 within digit sequences, but is sufficient to search like 1873*) That would greatly reduce the number of terms to be indexed: Example of all unique partial sequence tokens of 1873032402: 1, 18, 187, 1873, 18730, 187303, 1873032, 18730324, 187303240, 1873032402, 8, 87, 873, 8730, 87303, 873032, 8730324, 87303240, 873032402, 7, 73, 730, 7303, 73032, 730324, 7303240, 73032402, 3, 30, 303, 3032, 30324, 303240, 3032402, 0, 03, 032, 0324, 03240, 032402, 32, 324, 3240, 32402, 2, 24, 240, 2402, 4, 40, 402, 02. (52 index entries) Example of only initial partial sequence tokens of 1873032402: 1, 18, 187, 1873, 18730, 187303, 1873032, 18730324, 187303240, 1873032402. (10 index entries) Side note: In our field there are many sequences of numbers that end with a check digit that is calculated according to modulo 11 (letter X as the last number). These Xes at the end, immediately preceeded by at least one digit, should be seen as part of the sequence of numbers and not as a single letter, right? |
If my assumption is right, this should do the job for process title indexing tokenization: import java.text.*;
import java.util.*;
import java.util.regex.*;
static Pattern GROUPS_OF_ALPHANUMERIC_CHARACTERS = Pattern.compile("[\\p{IsLetter}\\p{Digit}]+");
static Set<String> tokenizeProcessTitle(String processTitle) {
Set<String> tokens = new HashSet<>();
Matcher matcher = GROUPS_OF_ALPHANUMERIC_CHARACTERS.matcher(processTitle);
while (matcher.find()) {
String normalized = normalize(matcher.group());
int length = normalized.length();
for (int end = 1; end <= length; end++) {
tokens.add(normalized.substring(0, end));
}
}
return tokens;
}
static String normalize(String input) {
StringBuilder umlautsReplaced = replaceUmlauts(input);
String noDiactitics = Normalizer.normalize(umlautsReplaced, Normalizer.Form.NFD).replaceAll("\\p{M}", "");
String lowerCase = noDiactitics.toLowerCase();
return lowerCase;
}
static StringBuilder replaceUmlauts(String input) {
StringBuilder buffer = new StringBuilder(64);
final int length = input.length();
for (int offset = 0; offset < length;) {
int codepoint = input.codePointAt(offset);
if (codepoint == 'Ä' || codepoint == 'ä') {
buffer.append("ae");
} else if (codepoint == 'Ö' || codepoint == 'ö') {
buffer.append("oe");
} else if (codepoint == 'Ü' || codepoint == 'ü') {
buffer.append("ue");
} else if (codepoint == 7838 || codepoint == 'ß') {
buffer.append("ss");
} else {
buffer.appendCodePoint(codepoint);
}
offset += Character.charCount(codepoint);
}
return buffer;
} "PineSeve_313539383" would be searchable with these input strings: p, pi, pin, pine, pines, pinese, pinesev, pineseve, 3, 31, 313, 3135, 31353, 313539, 3135393, 31353938, 313539383. (17 index records) I have absolutely no idea where to put that in. This may need to be implemented within ElasticSearch. |
@matthias-ronge: Thanks for the examination! You are right, it is always good think about the opportunities and to improve the current state. As user, the result is more important than the technical basis. However, i am sure we will find a solution.
I can only describe my demands for the search. For newspaper processes, it is extremely helpful for administrative exports, ... to be able to search for prcoesses by year, month, ...
Thus, from my point of view, a search for "1873*" would be sufficient
Yes, from my point of view, the X as for example in the following process title is part of the sequence just like 1, 7, 2, 7, ....
|
Problem
The wildcard search is only possible in specific parts of the process title:
Solution
It should be possible to search for "Aben_774618868-1873032" and get the following results:
The text was updated successfully, but these errors were encountered: