Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BM25 and TFIDF Scoring to the text index #1688

Open
wants to merge 78 commits into
base: master
Choose a base branch
from

Conversation

Flixtastic
Copy link
Contributor

While building the textindex one can define the scoring metrics used. Then during index building the scoring metric chosen defines how the score is calculated. In the retrieval the calculated scores are then shown and can be used to sort the relevancy of documents containing searchwords.

Flixtastic and others added 30 commits July 12, 2024 03:12
Commit doesn't contain all changes necessary for pull request yet.
…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.
…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.
…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner
…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile
Flixtastic and others added 27 commits January 4, 2025 23:45
…sts in WordsAndDocsFileParserTest.cpp. Renamed methods in WordsAndDocsFileLineCreator.h to reduce ambiguity. Incorporated requested small changes of PR.
Signed-off-by: Johannes Kalmbach <[email protected]>
This reverts commit dfff837, reversing
changes made to a4e9509.
…t but if the old scoring is used the scores are written to file as uint16 and read as uint16 even though they are floats internally.
@sparql-conformance
Copy link

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thorough pass on everything but the tests.

@@ -214,6 +217,16 @@ int main(int argc, char** argv) {
add("add-text-index,A", po::bool_switch(&onlyAddTextIndex),
"Only build the text index. Assumes that a knowledge graph index with "
"the same `index-basename` already exists.");
add("set-bm25-b-param", po::value(&bScoringParam),
"Sets the b param in the BM25 scoring metric. This has to be between "
"(including) 0 and 1. The default is 0.75.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't boost::program_options automatically show the default?

add("set-bm25-b-param", po::value(&bScoringParam),
"Sets the b param in the BM25 scoring metric. This has to be between "
"(including) 0 and 1. The default is 0.75.");
add("set-bm25-k-param", po::value(&kScoringParam),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not only --bm25-k, the rest of te name is redndante (same for the others, --bm-25-b, --text-score-metric.

But in the commets always add for the fulltext index, to make sure where this applies.

@@ -63,6 +63,7 @@ cppcoro::generator<WordsFileLine> IndexImpl::wordsInTextRecords(
}

// _____________________________________________________________________________

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

// or both (but at least one of them, otherwise this function is not called).
if (!contextFile.empty()) {
LOG(INFO) << "Reading words from \"" << contextFile << "\"" << std::endl;
if (!(wordsAndDocsFile.first.empty() && wordsAndDocsFile.second.empty())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we have only one of the two files specified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and std::optional is better to model " a filename or nothing"

Comment on lines +137 to +138
LOG(INFO) << ((wordsAndDocsFile.first.empty() &&
wordsAndDocsFile.second.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That condition is repeated, store it in a variable.

Comment on lines +109 to +120
if (docIdSet_.contains(convertedContextId)) {
docId = DocumentIndex::make(contextId.get());
} else {
auto it = docIdSet_.upper_bound(convertedContextId);
if (it == docIdSet_.end()) {
if (docIdSet_.empty()) {
AD_THROW("docIdSet is empty and shouldn't be");
}
LOG(DEBUG) << "Requesting a contextId that is bigger than the largest "
"docId. contextId: "
<< contextId.get() << " Largest docId: " << *docIdSet_.rbegin()
<< std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this difference between docId and contextId, but I trust you there. We should have to get rid of this mechanism.

Comment on lines +126 to +130
auto ret1 = innerMap.find(docId);
if (ret1 == innerMap.end()) {
LOG(DEBUG) << "The calculated docId doesn't exist in the inner Map. docId: "
<< docId << std::endl;
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is that an error, or can thi happen?

Comment on lines +139 to +143
if (ret2 == docLengthMap_.end()) {
LOG(DEBUG)
<< "The calculated docId doesn't exist in the dochLengthMap. docId: "
<< docId << std::endl;
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question.
Is this okay, a programming bug, or a bug in the input data?


#include "index/Index.h"
#include "parser/WordsAndDocsFileParser.h"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commets please.

Comment on lines +15 to +17
: scoringMetric_(TextScoringMetric::COUNT),
b_(0.75),
k_(1.75),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use default initializers for the class members I am sure sonarcloud also told you this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants