Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize topn requests #5075

Merged
merged 3 commits into from
Jul 5, 2024
Merged

optimize topn requests #5075

merged 3 commits into from
Jul 5, 2024

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Jun 4, 2024

add logic to detect which splits will deliver the top n results for
requests. This is only supported for match_all requests, with optional
sort_by on timestamp sorting.

The change extends the python tests to distribute ndjson to random splits

start_timestamp, end_timestamp as well as a filter on the timestamp field
is not supported currently but could be.
search_after is also not supported currently

https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1861,1862&search_metric=engine_duration

Compare S3 Fetch Requests:
https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1799,1864&search_metric=object_storage_fetch_requests

Addresses #5032

add logic to detect which splits will deliver the top n results for
requests. This is only supported for match_all requests, with optional
sort_by on timestamp sorting.

start_timestamp, end_timestamp as well as a filter on the timestamp field
is not supported currently but could be.
// we want to detect cases where we can convert some split queries to count only queries
let num_requested_docs = request.start_offset + request.max_hits;
match self {
CanSplitDoBetter::SplitIdHigher(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we apply the logic on CanSplitDoBetter instead of the actual sort order here?
Can't we have CanSplitDoBetter set to informative even though docs are requested to be ordered by timestamp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah answering my own question. if sorted by timestmap, but we don't have a bound, then CanSplitDoBetter is set to Higher or Lower but with None

//
// Calculate the number of splits which are guaranteed to deliver enough documents.
let num_splits = count_required_splits(&split_with_req, num_requested_docs);
assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the justification of this assert? Why is this true? Where is the code that enforces it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there to safeguard the code below for changes later, but I changed the algorithm to handle num_splits=0

.unwrap();
for (split, ref mut request) in split_with_req.iter_mut().skip(num_splits) {
if split.timestamp_end() < smallest_start_timestamp {
disable_search_request_hits(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't request count, the resulting disable_search_request does nothing.

Do we have a pass after that to entirely remove this split / request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strangely currently we don't have a no count parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have an CountHits enum. We could add that or use the Option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't request count, the resulting disable_search_request does nothing.

No, it disables the returning of hits from that split. That leaves only the count, and in this simple query case the count can be served from metadata and is basically free.
Initially we don't know which split may contain the best splits so we return hits from all splits.

You already have an CountHits enum. We could add that or use the Option?

Yes, but that would be relevant for cases where we can't serve counts from metadata. But in that case we have already the run_all_splits optimization, which means we probably run num_cpu splits before early exiting.

"We should always have at least one split to search"
);
//
// If we know that some splits will deliver enough documents, we can convert the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great explanation

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Copy link

github-actions bot commented Jun 6, 2024

On SSD:

ERROR: Not the same queries, cannot compare, difference: {'big_term_query_count_only', 'match_all_count_only', 'last_6_hours_sort_timestamp_2', 'last_6_days_sort_timestamp', 'last_6_hours_sort_timestamp'}

On GCS:

ERROR: Not the same queries, cannot compare, difference: {'match_all_count_only', 'last_6_hours_sort_timestamp', 'last_6_days_sort_timestamp', 'big_term_query_count_only', 'last_6_hours_sort_timestamp_2'}

@PSeitz PSeitz merged commit 7845137 into main Jul 5, 2024
4 of 5 checks passed
@PSeitz PSeitz deleted the count_opt branch July 5, 2024 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants