Add BM25 and TFIDF Scoring to the text index #1688

Flixtastic · 2024-12-17T00:52:28Z

While building the textindex one can define the scoring metrics used. Then during index building the scoring metric chosen defines how the score is calculated. In the retrieval the calculated scores are then shown and can be used to sort the relevancy of documents containing searchwords.

…adapted unit tests. Missing e2e tests.

Commit doesn't contain all changes necessary for pull request yet.

…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.

…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.

…ommented it

…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner

…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile

…sts in WordsAndDocsFileParserTest.cpp. Renamed methods in WordsAndDocsFileLineCreator.h to reduce ambiguity. Incorporated requested small changes of PR.

Signed-off-by: Johannes Kalmbach <[email protected]>

…d be outsourced in further refactorings

This reverts commit dfff837, reversing changes made to a4e9509.

…t but if the old scoring is used the scores are written to file as uint16 and read as uint16 even though they are floats internally.

sparql-conformance · 2025-02-24T07:14:51Z

Conformance check passed ✅

No test result changes.

Details: https://qlever.cs.uni-freiburg.de/sparql-conformance-ui?cur=ce236e6da50b8b4c588834ec1ac38f6789776b8a&prev=8fe06428ee1dddbb3ebcb41a1d93525075571bc1

sonarqubecloud · 2025-02-24T08:17:10Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

joka921

A thorough pass on everything but the tests.

joka921 · 2025-02-24T15:06:16Z

src/index/IndexBuilderMain.cpp

@@ -214,6 +217,16 @@ int main(int argc, char** argv) {
  add("add-text-index,A", po::bool_switch(&onlyAddTextIndex),
      "Only build the text index. Assumes that a knowledge graph index with "
      "the same `index-basename` already exists.");
+  add("set-bm25-b-param", po::value(&bScoringParam),
+      "Sets the b param in the BM25 scoring metric. This has to be between "
+      "(including) 0 and 1. The default is 0.75.");


Doesn't boost::program_options automatically show the default?

joka921 · 2025-02-24T15:07:39Z

src/index/IndexBuilderMain.cpp

+  add("set-bm25-b-param", po::value(&bScoringParam),
+      "Sets the b param in the BM25 scoring metric. This has to be between "
+      "(including) 0 and 1. The default is 0.75.");
+  add("set-bm25-k-param", po::value(&kScoringParam),


why not only --bm25-k, the rest of te name is redndante (same for the others, --bm-25-b, --text-score-metric.

But in the commets always add for the fulltext index, to make sure where this applies.

joka921 · 2025-02-24T15:08:40Z

src/index/IndexImpl.Text.cpp

@@ -63,6 +63,7 @@ cppcoro::generator<WordsFileLine> IndexImpl::wordsInTextRecords(
 }

 // _____________________________________________________________________________
+


Suggested change

joka921 · 2025-02-24T15:11:47Z

src/index/IndexImpl.Text.cpp

  // or both (but at least one of them, otherwise this function is not called).
-  if (!contextFile.empty()) {
-    LOG(INFO) << "Reading words from \"" << contextFile << "\"" << std::endl;
+  if (!(wordsAndDocsFile.first.empty() && wordsAndDocsFile.second.empty())) {


What happens if we have only one of the two files specified?

and std::optional is better to model " a filename or nothing"

joka921 · 2025-02-24T15:12:47Z

src/index/IndexImpl.Text.cpp

+    LOG(INFO) << ((wordsAndDocsFile.first.empty() &&
+                   wordsAndDocsFile.second.empty())


That condition is repeated, store it in a variable.

joka921 · 2025-02-24T15:51:36Z

src/index/TextScoring.cpp

+  if (docIdSet_.contains(convertedContextId)) {
+    docId = DocumentIndex::make(contextId.get());
+  } else {
+    auto it = docIdSet_.upper_bound(convertedContextId);
+    if (it == docIdSet_.end()) {
+      if (docIdSet_.empty()) {
+        AD_THROW("docIdSet is empty and shouldn't be");
+      }
+      LOG(DEBUG) << "Requesting a contextId that is bigger than the largest "
+                    "docId. contextId: "
+                 << contextId.get() << " Largest docId: " << *docIdSet_.rbegin()
+                 << std::endl;


I don't quite understand this difference between docId and contextId, but I trust you there. We should have to get rid of this mechanism.

joka921 · 2025-02-24T15:52:44Z

src/index/TextScoring.cpp

+  auto ret1 = innerMap.find(docId);
+  if (ret1 == innerMap.end()) {
+    LOG(DEBUG) << "The calculated docId doesn't exist in the inner Map. docId: "
+               << docId << std::endl;
+    return 0;


Again, is that an error, or can thi happen?

joka921 · 2025-02-24T15:53:36Z

src/index/TextScoring.cpp

+  if (ret2 == docLengthMap_.end()) {
+    LOG(DEBUG)
+        << "The calculated docId doesn't exist in the dochLengthMap. docId: "
+        << docId << std::endl;
+    return 0;


same question.
Is this okay, a programming bug, or a bug in the input data?

joka921 · 2025-02-24T15:53:55Z

src/index/TextScoring.h

+
+#include "index/Index.h"
+#include "parser/WordsAndDocsFileParser.h"
+


Commets please.

joka921 · 2025-02-24T15:54:26Z

src/index/TextScoring.h

+      : scoringMetric_(TextScoringMetric::COUNT),
+        b_(0.75),
+        k_(1.75),


Use default initializers for the class members I am sure sonarcloud also told you this.

Flixtastic and others added 30 commits July 12, 2024 03:12

ql:contains-word now can show the respective word-score.

ea9d39c

Fixed tests and formatted files.

30736ef

New formatting for Word Score Variables. Changed where necessary and …

e752db8

…adapted unit tests. Missing e2e tests.

Merge branch 'ad-freiburg:master' into master

4ef4d93

Merge branch 'ad-freiburg:master' into master

d52063f

Merge branch 'master' of github.com:Flixtastic/qlever.

c6fe0c6

Commit doesn't contain all changes necessary for pull request yet.

Added getWordSCoreVariable for std::string_view

d0b9ee8

Merge branch 'ad-freiburg:master' into master

2eade97

Merge branch 'ad-freiburg:master' into master

595cb57

Merge branch 'ad-freiburg:master' into master

b4c8c3b

Merge branch 'ad-freiburg:master' into master

72e5d64

Merge branch 'ad-freiburg:master' into master

d8f9df4

Made it possible to construct query execution contexts with text inde…

29511c6

…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.

Merge branch 'ad-freiburg:master' into master

3855978

Reduced usage of column copying in TextIndexScanForWord.cpp

6021401

Merge branch 'ad-freiburg:master' into master

d9701ae

Merge branch 'ad-freiburg:master' into master

5f0ce01

Merge branch 'ad-freiburg:master' into master

e2c47cf

Merge branch 'ad-freiburg:master' into master

e6a0cf7

Changed the counting of nofNonLiterals to nofLiterals. Some methods a…

ed9fbda

…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.

Merge branch 'ad-freiburg:master' into master

5ad3d8f

Merge branch 'ad-freiburg:master' into master

af6bd64

Cleaned up the filtering in TextIndexScanForWord::computeResult and c…

56ea531

…ommented it

renamed nofLiterals to nofLiteralsInTextIndex

e1e12e9

Removed redundant method getWordScoreVariable

017588c

added method appendEscapedWord to escape special chars in Variables

46666d0

Added two function in the TextIndexScanTestHelpers.h to add content t…

f36f189

…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner

Added tests for Scores. Also commented tests and refined them

c62a7e6

Changed the getQec function and the respective makeTestIndex to take …

89f0b27

…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile

Merge branch 'ad-freiburg:master' into master

058e8ed

Flixtastic and others added 27 commits January 4, 2025 23:45

Removed unnecessary function getRawId

3d02d84

Merge branch 'ad-freiburg:master' into master

dad2d35

Merge branch 'ad-freiburg:master' into words-and-docs-file-parsing

5f28add

Added comments and necessary tests to WordsAndDocsFileParser

f129ecd

Merge branch 'ad-freiburg:master' into master

2f8ed2d

Merge branch 'ad-freiburg:master' into words-and-docs-file-parsing

b699551

Merge branch 'ad-freiburg:master' into master

c1d763d

Merge branch 'ad-freiburg:master' into words-and-docs-file-parsing

1642175

Added comments to WordsAndDcosFileParser.h. Improved useability of te…

8c8a1a1

…sts in WordsAndDocsFileParserTest.cpp. Renamed methods in WordsAndDocsFileLineCreator.h to reduce ambiguity. Incorporated requested small changes of PR.

Rewrite the tokenizer as a view.

0369de6

Signed-off-by: Johannes Kalmbach <[email protected]>

Improved comment, addressed small requested changes

c412983

Addressed sonar issues

46fbb98

Removed the temporary localeManagers in WordsAndDocsFileParserTest.cpp

1e0fc14

Addressed more SonarQube problems

9f9738c

For now excluding helper functions from code coverage since they coul…

a55f2be

…d be outsourced in further refactorings

Reverting last commit

bea5936

Small improvement

349be6d

Merge branch 'ad-freiburg:master' into words-and-docs-file-parsing

1068746

Revert "Merge branch 'words-and-docs-file-parsing'"

c7d348d

This reverts commit dfff837, reversing changes made to a4e9509.

Merge branch 'words-and-docs-file-parsing'

ff720a6

Merge remote-tracking branch 'upstream/master'

62ee6c7

Fixed merging errors

c15cf14

Merge branch 'ad-freiburg:master' into master

21ef985

Ensured backward compatability with old scoring. Score is set as floa…

bc32268

…t but if the old scoring is used the scores are written to file as uint16 and read as uint16 even though they are floats internally.

Trying to fix compiling issues with gcc11

4849151

Merge branch 'ad-freiburg:master' into master

b95a503

Merge branch 'master' into master

ce236e6

joka921 requested changes Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BM25 and TFIDF Scoring to the text index #1688

Add BM25 and TFIDF Scoring to the text index #1688

Flixtastic commented Dec 17, 2024

sparql-conformance bot commented Feb 24, 2025

sonarqubecloud bot commented Feb 24, 2025

joka921 left a comment

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

joka921 Feb 24, 2025

		@@ -63,6 +63,7 @@ cppcoro::generator<WordsFileLine> IndexImpl::wordsInTextRecords(
		}

		// _____________________________________________________________________________

		LOG(INFO) << ((wordsAndDocsFile.first.empty() &&
		wordsAndDocsFile.second.empty())


		#include "index/Index.h"
		#include "parser/WordsAndDocsFileParser.h"

Add BM25 and TFIDF Scoring to the text index #1688

Are you sure you want to change the base?

Add BM25 and TFIDF Scoring to the text index #1688

Conversation

Flixtastic commented Dec 17, 2024

sparql-conformance bot commented Feb 24, 2025

Conformance check passed ✅

sonarqubecloud bot commented Feb 24, 2025

Quality Gate passed

joka921 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment