Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow warming partially an sstable for an automaton #2559

Merged
merged 15 commits into from
Jan 8, 2025

Conversation

trinity-1686a
Copy link
Contributor

No description provided.

sstable/src/streamer.rs Outdated Show resolved Hide resolved
// TODO this operation is often cheap for "friendly" automatons, but can be very costly for
// "unfriendly" ones such as ".*a{50}" (very few terms if any match this pattern, but we
// can't know early). In this case, we decompress and iterate over the entire sstable, while
// still being in async context. Ideally we should spawn this on a threadpool.
let range_to_load = term_info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it is difficult for readers to know what is the type of range_to_load.

// TODO this operation is often cheap for "friendly" automatons, but can be very costly for
// "unfriendly" ones such as ".*a{50}" (very few terms if any match this pattern, but we
// can't know early). In this case, we decompress and iterate over the entire sstable, while
// still being in async context. Ideally we should spawn this on a threadpool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So if I understand correctly, the trouble is that the computation is done within an iterator.

We eventually yield, but that's only when we effectively get an element.

Is decompressing a single block on the runtime is fast enough, we could probably write ugly code and have our iterator return Option<Range<usize>> to offer the runtime to yield more often.

If we do it this way, we might also want to actively yield.
I believe tokio only force yield on await, after await has been reach N times. (to be checked). That N might be way too high for us.

The second is probably not of a problem

.get_term_range_async(.., automaton.clone(), None, merge_holes_under_bytes)
.await?;

// we build a 2nd iterator, this one with no holes, so we don't go through blocks we can't
// match, and just download them to reduce our query count. This makes the assumption
// there is a caching layer below, which might not always be true, but is in Quickwit.
let term_info = self.get_term_range_async(.., automaton, None, 0).await?;
let term_infos = self.get_term_range_async(.., automaton, None, 0).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let term_infos = self.get_term_range_async(.., automaton, None, 0).await?;
let term_info_stream = self.get_term_range_async(.., automaton, None, 0).await?;

@fulmicoton
Copy link
Collaborator

(approved but we need to fix the runtime footgun you identified of course)

let merge_holes_under_bytes = (80 * 1024 * 1024 * 50) / 1000;
// we build a first iterator to download everything. Simply calling the function already
// loads everything, but doesn't start iterating over the sstable.
let mut _term_infos = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut _term_infos = self
let _term_infos = self

src/termdict/mod.rs Outdated Show resolved Hide resolved
// we could do without an iterator, but this allows us access to coalesce which simplify
// things
let posting_ranges_iter =
std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we break it into several statement.

For instance:

  • Create the iterator of postings ranges.
  • doe the coalesce thing after.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

@trinity-1686a trinity-1686a merged commit d281ca3 into main Jan 8, 2025
4 checks passed
@trinity-1686a trinity-1686a deleted the trinity/sstable-partial-automaton branch January 8, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants