-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BM25 and TFIDF Scoring to the text index #1688
base: master
Are you sure you want to change the base?
Conversation
…adapted unit tests. Missing e2e tests.
Commit doesn't contain all changes necessary for pull request yet.
…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.
…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.
…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner
…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile
…sts in WordsAndDocsFileParserTest.cpp. Renamed methods in WordsAndDocsFileLineCreator.h to reduce ambiguity. Incorporated requested small changes of PR.
Signed-off-by: Johannes Kalmbach <[email protected]>
…d be outsourced in further refactorings
…t but if the old scoring is used the scores are written to file as uint16 and read as uint16 even though they are floats internally.
Conformance check passed ✅No test result changes. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thorough pass on everything but the tests.
@@ -214,6 +217,16 @@ int main(int argc, char** argv) { | |||
add("add-text-index,A", po::bool_switch(&onlyAddTextIndex), | |||
"Only build the text index. Assumes that a knowledge graph index with " | |||
"the same `index-basename` already exists."); | |||
add("set-bm25-b-param", po::value(&bScoringParam), | |||
"Sets the b param in the BM25 scoring metric. This has to be between " | |||
"(including) 0 and 1. The default is 0.75."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't boost::program_options automatically show the default?
add("set-bm25-b-param", po::value(&bScoringParam), | ||
"Sets the b param in the BM25 scoring metric. This has to be between " | ||
"(including) 0 and 1. The default is 0.75."); | ||
add("set-bm25-k-param", po::value(&kScoringParam), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not only --bm25-k
, the rest of te name is redndante (same for the others, --bm-25-b
, --text-score-metric
.
But in the commets always add for the fulltext index
, to make sure where this applies.
@@ -63,6 +63,7 @@ cppcoro::generator<WordsFileLine> IndexImpl::wordsInTextRecords( | |||
} | |||
|
|||
// _____________________________________________________________________________ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// or both (but at least one of them, otherwise this function is not called). | ||
if (!contextFile.empty()) { | ||
LOG(INFO) << "Reading words from \"" << contextFile << "\"" << std::endl; | ||
if (!(wordsAndDocsFile.first.empty() && wordsAndDocsFile.second.empty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we have only one of the two files specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and std::optional
is better to model " a filename or nothing"
LOG(INFO) << ((wordsAndDocsFile.first.empty() && | ||
wordsAndDocsFile.second.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That condition is repeated, store it in a variable.
if (docIdSet_.contains(convertedContextId)) { | ||
docId = DocumentIndex::make(contextId.get()); | ||
} else { | ||
auto it = docIdSet_.upper_bound(convertedContextId); | ||
if (it == docIdSet_.end()) { | ||
if (docIdSet_.empty()) { | ||
AD_THROW("docIdSet is empty and shouldn't be"); | ||
} | ||
LOG(DEBUG) << "Requesting a contextId that is bigger than the largest " | ||
"docId. contextId: " | ||
<< contextId.get() << " Largest docId: " << *docIdSet_.rbegin() | ||
<< std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this difference between docId and contextId, but I trust you there. We should have to get rid of this mechanism.
auto ret1 = innerMap.find(docId); | ||
if (ret1 == innerMap.end()) { | ||
LOG(DEBUG) << "The calculated docId doesn't exist in the inner Map. docId: " | ||
<< docId << std::endl; | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, is that an error, or can thi happen?
if (ret2 == docLengthMap_.end()) { | ||
LOG(DEBUG) | ||
<< "The calculated docId doesn't exist in the dochLengthMap. docId: " | ||
<< docId << std::endl; | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question.
Is this okay, a programming bug, or a bug in the input data?
|
||
#include "index/Index.h" | ||
#include "parser/WordsAndDocsFileParser.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commets please.
: scoringMetric_(TextScoringMetric::COUNT), | ||
b_(0.75), | ||
k_(1.75), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use default initializers for the class members I am sure sonarcloud also told you this.
While building the textindex one can define the scoring metrics used. Then during index building the scoring metric chosen defines how the score is calculated. In the retrieval the calculated scores are then shown and can be used to sort the relevancy of documents containing searchwords.