Add optimization for pure count and count aggregation #5032

fulmicoton · 2024-05-25T00:46:07Z

as observed on airmail

We are slower than Opensearch on

_count on AllQuery
_count queries on termqueries
aggregation on AllQuery like the one as follows

{
    size: 0
  , aggs: {
      hosts: {
        terms: {
          field: '_host'
        , size: 50
        }
      }
    , apps: {
        terms: {
          field: '_app'
        , size: 50
        }
      }
    , ingesters: {
        terms: {
          field: '_ingester'
        , size: 50
        }
      }
    , tags: {
        terms: {
          field: '_tag'
        , size: 50
        }
      }
    }
  }

All of the above can be addressed just looking at the split metadata, and/or the termdictionary.

The text was updated successfully, but these errors were encountered:

Related to #5032

fulmicoton · 2024-05-26T15:27:01Z

@PSeitz can you take over this issue.

You can ditch or use #5034 if you want.
For the last part: 'term aggregations on all query', it would be nice to detect that this can be addressed via the termdictionary alone..

Problem 1)
It will require creative software engineering between tantivy and quickwit as the warm up is totally different.

Problem 2)
multiplicity of the same value will actually cause different different results if we have multivalued columns (and the same value can appear more than once)

Problem 3)
Technicallly, doing it using the column is faster if the term dictionary has a lot of distinct values, and the size of the aggregation is limited. In practise, we can probably just do a simple rule on the size of the term dictionary.
(e.g.: if term dictionary < 500KB or < column).

========

Other optim opportunity
it would be nice to have an optim on aggregation over columns for all queries.

We could directly operate over values, instead of filling a doc id buffer, and fetching those.

PSeitz · 2024-05-27T15:13:20Z

Count All

_count with AllQuery works already. Uses only metadata.

Implemented here:
#4410

Count All + Top N

_count with AllQuery + top_n: Does a full search currently.

To support _count + top n, we could go for a hybrid approach and get the counts from metadata split and the top n from a selected split.
Question would be which top n we get should get for the AllQuery (it's sorted by the highest split id)

Implemented here:
#5075

Count All + Top N + Sort By Date

This optimization could also be extended
_count with AllQuery + top_n, sort by timestamp.

Identify splits which can't contain data from the top n, with the time_range metadata:

/// If a timestamp field is available, the min / max timestamp in
/// the split, expressed in seconds.
pub time_range: Option<RangeInclusive<i64>>,

Implemented here:
#5075 (detect which splits will deliver the results)
#5048 (passing threshold between splits during search)

Count All + Top N + (Sort By Date) + Time Range Query

#5048 implements this, but we could also do some early detection which splits will deliver enough results by extending #5075

Implemented here:
#5048 (passing threshold between splits during search)

Count All + Top N + Sort By Other Field

We don't have metadata for other fields on the split metadata, so there are two options:

Similar to tags, we could flag these field so we pull out the min max values and store them in the metadata. This would then be used the same way as time_range. Could also be automatic for all defined numeric fields.
We have min max values in the fast field metadata, in combination with CanSplitBeBetter or similar, we could do some early pruning.

Not implemented

fulmicoton · 2024-05-29T11:06:26Z

I could not tell if your text is describing the current state or not.
For count all, we already have an optimization in main... but one thing to consider is COUNT + TIME RANGE QUERY.

In that case, the optimization in main won't kick in.
The one in this PR will.

The idea is that if a split is entirely within the range query, we transform the range query into an all query.

* optimization requests by passing threshold in leaf search * Execute query.count() instead of QuickwitCollector for count searches We have 100 concurrent split searches by default, but num_cpus worker threads. This means most search futures will wait to be scheduled. When they are scheduled they can check the new threshold from the preceding searches and maybe skip the search. Switches to RWLock for the threshold since we read more often now. Future Work: We run num_cpu full searches in some cases before the threshold kicks in. But in some cases we could statically analyze from which split the best results come and generate count only requests for the others. Addresses #5032

* count optimization for multisplits * optimization requests by passing threshold in leaf search * Execute query.count() instead of QuickwitCollector for count searches We have 100 concurrent split searches by default, but num_cpus worker threads. This means most search futures will wait to be scheduled. When they are scheduled they can check the new threshold from the preceding searches and maybe skip the search. Switches to RWLock for the threshold since we read more often now. Future Work: We run num_cpu full searches in some cases before the threshold kicks in. But in some cases we could statically analyze from which split the best results come and generate count only requests for the others. Addresses #5032 * add comments

PSeitz · 2024-06-05T10:16:56Z

I update the comment to make it more clear on the current state implemented.

On using the term dictionary for aggregations, this would only work if the same tokenizer is used (which isn't the case by default). In the fast field dictionary we don't store counts currently. It's also quite special, e.g. it would not support any nested aggregations.

Given how special it is and how easily cache-able the query is, I'm not sure the payoff is high enough to add a new code path in the aggregations for this.

We could directly operate over values, instead of filling a doc id buffer, and fetching those.

I think we could detect if an incoming block of docids is contiguous in collect_block and do some optimized fetching of the values. In that case we would just require the start and end index.

fulmicoton added the bug Something isn't working label May 25, 2024

fulmicoton self-assigned this May 26, 2024

fulmicoton added a commit that referenced this issue May 26, 2024

Added a count-only optimisation to QuickwitCollector

9aaf96e

Related to #5032

fulmicoton mentioned this issue May 26, 2024

Added a count-only optimisation to QuickwitCollector #5034

Closed

fulmicoton added a commit that referenced this issue May 26, 2024

Added a count-only optimisation to QuickwitCollector

dd97875

Related to #5032

fulmicoton assigned PSeitz and unassigned fulmicoton May 26, 2024

PSeitz mentioned this issue May 30, 2024

count optimization for multisplits #5048

Merged

PSeitz mentioned this issue Jun 4, 2024

optimize topn requests #5075

Merged

PSeitz added enhancement New feature or request performance and removed bug Something isn't working labels Jun 5, 2024

fmassot added this to Quickwit 0.9 Jul 14, 2024

fmassot removed this from Quickwit 0.9 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimization for pure count and count aggregation #5032

Add optimization for pure count and count aggregation #5032

fulmicoton commented May 25, 2024

fulmicoton commented May 26, 2024

PSeitz commented May 27, 2024 •

edited

Loading

fulmicoton commented May 29, 2024

PSeitz commented Jun 5, 2024 •

edited

Loading

Add optimization for pure count and count aggregation #5032

Add optimization for pure count and count aggregation #5032

Comments

fulmicoton commented May 25, 2024

fulmicoton commented May 26, 2024

PSeitz commented May 27, 2024 • edited Loading

Count All

Count All + Top N

Count All + Top N + Sort By Date

Count All + Top N + (Sort By Date) + Time Range Query

Count All + Top N + Sort By Other Field

fulmicoton commented May 29, 2024

PSeitz commented Jun 5, 2024 • edited Loading

PSeitz commented May 27, 2024 •

edited

Loading

PSeitz commented Jun 5, 2024 •

edited

Loading