-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write external vocabulary directly to .vocabulary.external
#1253
Write external vocabulary directly to .vocabulary.external
#1253
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1253 +/- ##
==========================================
+ Coverage 87.52% 87.60% +0.08%
==========================================
Files 308 308
Lines 27909 27922 +13
Branches 3122 3124 +2
==========================================
+ Hits 24426 24461 +35
+ Misses 2346 2324 -22
Partials 1137 1137 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thorough 1-1 with Johannes, cool and interesting stuff. Reviews everything except Consumerator.h
(which should be renamed to Consumer.h
) and the tests.
if (!_noIdMapsAndIgnoreExternalVocab) { | ||
outfileExternal_ = | ||
ad_utility::makeOfstream(basename + EXTERNAL_LITS_TEXT_FILE_NAME); | ||
} | ||
|
||
// Open and prepare all infiles and mmap output vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the words "infiles" and "mmap output vectors" still appropriate?
src/index/VocabularyOnDisk.cpp
Outdated
@@ -48,6 +48,34 @@ std::optional<string> VocabularyOnDisk::operator[](uint64_t idx) const { | |||
return result; | |||
} | |||
|
|||
// _____________________________________________________________________________ | |||
ad_utility::ConsumeratorImpl<std::string_view> | |||
VocabularyOnDisk::getWordWriterImpl(std::string outFileName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get
is confusing, just wordWriterImpl
would be better.
src/index/VocabularyOnDisk.cpp
Outdated
while (co_await ad_utility::valueWasPushedTag) { | ||
const auto& word = co_await ad_utility::nextValueTag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find better names for the tags. And do we really need two co_await
s here? Most natural would be something like for (const auto& word = co_await nextWordTag) { ... }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions, I still have to look at the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looked at the tests now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor changes left
src/index/VocabularyGenerator.h
Outdated
// helper struct used in the priority queue for merging. | ||
// represents tokens/words in a certain partial vocabulary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// helper struct used in the priority queue for merging. | |
// represents tokens/words in a certain partial vocabulary | |
// Helper `struct` for a word from a partial vocabulary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Will test this again with DBLP. The prevous version worked fine with UniProt.
.vocabulary.external
|
So far, the code that merges the partial vocabularies (class
VocabularyMerger
) first wrote the external vocabulary to a temporary file which was afterward copied to the final. vocabulary.external
file. This wasted space during the index build and on HDD also cost considerable time. Now the result of the merge is written directly to. vocabulary.external
.