Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write external vocabulary directly to .vocabulary.external #1253

Merged
merged 22 commits into from
Feb 7, 2024

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jan 31, 2024

So far, the code that merges the partial vocabularies (class VocabularyMerger) first wrote the external vocabulary to a temporary file which was afterward copied to the final . vocabulary.external file. This wasted space during the index build and on HDD also cost considerable time. Now the result of the merge is written directly to . vocabulary.external.

Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (ac9db29) 87.52% compared to head (ee3f340) 87.60%.
Report is 5 commits behind head on master.

Files Patch % Lines
src/index/VocabularyMergerImpl.h 83.92% 1 Missing and 8 partials ⚠️
src/index/VocabularyOnDisk.cpp 90.90% 0 Missing and 4 partials ⚠️
src/util/ExceptionHandling.h 96.15% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1253      +/-   ##
==========================================
+ Coverage   87.52%   87.60%   +0.08%     
==========================================
  Files         308      308              
  Lines       27909    27922      +13     
  Branches     3122     3124       +2     
==========================================
+ Hits        24426    24461      +35     
+ Misses       2346     2324      -22     
  Partials     1137     1137              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough 1-1 with Johannes, cool and interesting stuff. Reviews everything except Consumerator.h (which should be renamed to Consumer.h) and the tests.

src/index/VocabularyGenerator.h Outdated Show resolved Hide resolved
src/index/VocabularyGenerator.h Outdated Show resolved Hide resolved
if (!_noIdMapsAndIgnoreExternalVocab) {
outfileExternal_ =
ad_utility::makeOfstream(basename + EXTERNAL_LITS_TEXT_FILE_NAME);
}

// Open and prepare all infiles and mmap output vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the words "infiles" and "mmap output vectors" still appropriate?

src/index/VocabularyGeneratorImpl.h Outdated Show resolved Hide resolved
src/index/VocabularyGeneratorImpl.h Outdated Show resolved Hide resolved
src/index/VocabularyGeneratorImpl.h Outdated Show resolved Hide resolved
src/index/VocabularyGeneratorImpl.h Outdated Show resolved Hide resolved
@@ -48,6 +48,34 @@ std::optional<string> VocabularyOnDisk::operator[](uint64_t idx) const {
return result;
}

// _____________________________________________________________________________
ad_utility::ConsumeratorImpl<std::string_view>
VocabularyOnDisk::getWordWriterImpl(std::string outFileName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get is confusing, just wordWriterImpl would be better.

Comment on lines 60 to 61
while (co_await ad_utility::valueWasPushedTag) {
const auto& word = co_await ad_utility::nextValueTag;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find better names for the tags. And do we really need two co_awaits here? Most natural would be something like for (const auto& word = co_await nextWordTag) { ... }.

src/index/VocabularyOnDisk.h Outdated Show resolved Hide resolved
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions, I still have to look at the tests

src/index/VocabularyOnDisk.h Outdated Show resolved Hide resolved
src/index/VocabularyOnDisk.cpp Outdated Show resolved Hide resolved
src/util/ExceptionHandling.h Outdated Show resolved Hide resolved
src/index/ConstantsIndexBuilding.h Outdated Show resolved Hide resolved
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looked at the tests now

@hannahbast hannahbast marked this pull request as ready for review February 5, 2024 18:57
Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor changes left

Comment on lines 148 to 149
// helper struct used in the priority queue for merging.
// represents tokens/words in a certain partial vocabulary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// helper struct used in the priority queue for merging.
// represents tokens/words in a certain partial vocabulary
// Helper `struct` for a word from a partial vocabulary.

Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Will test this again with DBLP. The prevous version worked fine with UniProt.

@hannahbast hannahbast changed the title Immediately write the vocabulary to a binary file. Write external vocabulary directly to .vocabulary.external Feb 5, 2024
Copy link

sonarqubecloud bot commented Feb 6, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

10 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@joka921 joka921 merged commit fcd824b into ad-freiburg:master Feb 7, 2024
18 checks passed
@joka921 joka921 deleted the one-phase-external-vocabulary branch February 7, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants