optimize topn requests #5075

PSeitz · 2024-06-04T01:34:41Z

add logic to detect which splits will deliver the top n results for
requests. This is only supported for match_all requests, with optional
sort_by on timestamp sorting.

The change extends the python tests to distribute ndjson to random splits

start_timestamp, end_timestamp as well as a filter on the timestamp field
is not supported currently but could be.
search_after is also not supported currently

https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1861,1862&search_metric=engine_duration

Compare S3 Fetch Requests:
https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1799,1864&search_metric=object_storage_fetch_requests

Addresses #5032

add logic to detect which splits will deliver the top n results for requests. This is only supported for match_all requests, with optional sort_by on timestamp sorting. start_timestamp, end_timestamp as well as a filter on the timestamp field is not supported currently but could be.

quickwit/quickwit-search/src/leaf.rs

fulmicoton · 2024-06-05T06:09:52Z

quickwit/quickwit-search/src/leaf.rs

+        // we want to detect cases where we can convert some split queries to count only queries
+        let num_requested_docs = request.start_offset + request.max_hits;
+        match self {
+            CanSplitDoBetter::SplitIdHigher(_) => {


why do we apply the logic on CanSplitDoBetter instead of the actual sort order here?
Can't we have CanSplitDoBetter set to informative even though docs are requested to be ordered by timestamp?

ah answering my own question. if sorted by timestmap, but we don't have a bound, then CanSplitDoBetter is set to Higher or Lower but with None

quickwit/quickwit-search/src/leaf.rs

fulmicoton · 2024-06-05T06:18:44Z

quickwit/quickwit-search/src/leaf.rs

+                //
+                // Calculate the number of splits which are guaranteed to deliver enough documents.
+                let num_splits = count_required_splits(&split_with_req, num_requested_docs);
+                assert!(


what is the justification of this assert? Why is this true? Where is the code that enforces it?

It was there to safeguard the code below for changes later, but I changed the algorithm to handle num_splits=0

fulmicoton · 2024-06-05T06:37:05Z

quickwit/quickwit-search/src/leaf.rs

+                    .unwrap();
+                for (split, ref mut request) in split_with_req.iter_mut().skip(num_splits) {
+                    if split.timestamp_end() < smallest_start_timestamp {
+                        disable_search_request_hits(request);


if we don't request count, the resulting disable_search_request does nothing.

Do we have a pass after that to entirely remove this split / request?

strangely currently we don't have a no count parameter

You already have an CountHits enum. We could add that or use the Option?

if we don't request count, the resulting disable_search_request does nothing.

No, it disables the returning of hits from that split. That leaves only the count, and in this simple query case the count can be served from metadata and is basically free.
Initially we don't know which split may contain the best splits so we return hits from all splits.

You already have an CountHits enum. We could add that or use the Option?

Yes, but that would be relevant for cases where we can't serve counts from metadata. But in that case we have already the run_all_splits optimization, which means we probably run num_cpu splits before early exiting.

fulmicoton · 2024-06-05T06:38:47Z

quickwit/quickwit-search/src/leaf.rs

+                    "We should always have at least one split to search"
+                );
+                //
+                // If we know that some splits will deliver enough documents, we can convert the


great explanation

quickwit/quickwit-search/src/leaf.rs

fulmicoton

see comments

github-actions · 2024-06-06T06:19:10Z

On SSD:

ERROR: Not the same queries, cannot compare, difference: {'big_term_query_count_only', 'match_all_count_only', 'last_6_hours_sort_timestamp_2', 'last_6_days_sort_timestamp', 'last_6_hours_sort_timestamp'}

On GCS:

ERROR: Not the same queries, cannot compare, difference: {'match_all_count_only', 'last_6_hours_sort_timestamp', 'last_6_days_sort_timestamp', 'big_term_query_count_only', 'last_6_hours_sort_timestamp_2'}

PSeitz force-pushed the count_opt branch from 6cae67f to 14af027 Compare June 4, 2024 01:58

fulmicoton reviewed Jun 4, 2024

View reviewed changes

quickwit/quickwit-search/src/leaf.rs Show resolved Hide resolved

PSeitz requested a review from trinity-1686a June 4, 2024 05:51

PSeitz mentioned this pull request Jun 4, 2024

Add optimization for pure count and count aggregation #5032

Open

fulmicoton reviewed Jun 5, 2024

View reviewed changes

quickwit/quickwit-search/src/leaf.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Jun 5, 2024

View reviewed changes

quickwit/quickwit-search/src/leaf.rs Outdated Show resolved Hide resolved

fulmicoton requested changes Jun 5, 2024

View reviewed changes

PSeitz force-pushed the count_opt branch from d83b044 to 41805a0 Compare June 6, 2024 05:48

PSeitz force-pushed the count_opt branch from 41805a0 to e8057e5 Compare June 6, 2024 06:24

move to function, refactor

c454944

PSeitz force-pushed the count_opt branch from e8057e5 to c454944 Compare June 6, 2024 07:00

PSeitz requested a review from fulmicoton June 7, 2024 00:44

fulmicoton approved these changes Jul 5, 2024

View reviewed changes

Merge branch 'main' into count_opt

813d7a1

PSeitz merged commit 7845137 into main Jul 5, 2024
4 of 5 checks passed

PSeitz deleted the count_opt branch July 5, 2024 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize topn requests #5075

optimize topn requests #5075

PSeitz commented Jun 4, 2024 •

edited

Loading

fulmicoton Jun 5, 2024

fulmicoton Jun 5, 2024

fulmicoton Jun 5, 2024

PSeitz Jun 6, 2024

fulmicoton Jun 5, 2024

PSeitz Jun 6, 2024

fulmicoton Jun 6, 2024

PSeitz Jun 6, 2024

fulmicoton Jun 5, 2024

fulmicoton left a comment

github-actions bot commented Jun 6, 2024 •

edited

Loading

optimize topn requests #5075

optimize topn requests #5075

Conversation

PSeitz commented Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 6, 2024 • edited Loading

On SSD:

On GCS:

PSeitz commented Jun 4, 2024 •

edited

Loading

github-actions bot commented Jun 6, 2024 •

edited

Loading