Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimization for pure count and count aggregation #5032

Open
fulmicoton opened this issue May 25, 2024 · 4 comments
Open

Add optimization for pure count and count aggregation #5032

fulmicoton opened this issue May 25, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request performance

Comments

@fulmicoton
Copy link
Contributor

as observed on airmail

We are slower than Opensearch on

  • _count on AllQuery
  • _count queries on termqueries
  • aggregation on AllQuery like the one as follows
{
    size: 0
  , aggs: {
      hosts: {
        terms: {
          field: '_host'
        , size: 50
        }
      }
    , apps: {
        terms: {
          field: '_app'
        , size: 50
        }
      }
    , ingesters: {
        terms: {
          field: '_ingester'
        , size: 50
        }
      }
    , tags: {
        terms: {
          field: '_tag'
        , size: 50
        }
      }
    }
  }

All of the above can be addressed just looking at the split metadata, and/or the termdictionary.

@fulmicoton fulmicoton added the bug Something isn't working label May 25, 2024
@fulmicoton fulmicoton self-assigned this May 26, 2024
fulmicoton added a commit that referenced this issue May 26, 2024
fulmicoton added a commit that referenced this issue May 26, 2024
@fulmicoton
Copy link
Contributor Author

@PSeitz can you take over this issue.

You can ditch or use #5034 if you want.
For the last part: 'term aggregations on all query', it would be nice to detect that this can be addressed via the termdictionary alone..

Problem 1)
It will require creative software engineering between tantivy and quickwit as the warm up is totally different.

Problem 2)
multiplicity of the same value will actually cause different different results if we have multivalued columns (and the same value can appear more than once)

Problem 3)
Technicallly, doing it using the column is faster if the term dictionary has a lot of distinct values, and the size of the aggregation is limited. In practise, we can probably just do a simple rule on the size of the term dictionary.
(e.g.: if term dictionary < 500KB or < column).

========

Other optim opportunity
it would be nice to have an optim on aggregation over columns for all queries.

We could directly operate over values, instead of filling a doc id buffer, and fetching those.

@fulmicoton fulmicoton assigned PSeitz and unassigned fulmicoton May 26, 2024
@PSeitz
Copy link
Contributor

PSeitz commented May 27, 2024

Count All

_count with AllQuery works already. Uses only metadata.

Implemented here:
#4410

Count All + Top N

_count with AllQuery + top_n: Does a full search currently.

To support _count + top n, we could go for a hybrid approach and get the counts from metadata split and the top n from a selected split.
Question would be which top n we get should get for the AllQuery (it's sorted by the highest split id)

Implemented here:
#5075

Count All + Top N + Sort By Date

This optimization could also be extended
_count with AllQuery + top_n, sort by timestamp.

Identify splits which can't contain data from the top n, with the time_range metadata:

/// If a timestamp field is available, the min / max timestamp in
/// the split, expressed in seconds.
pub time_range: Option<RangeInclusive<i64>>,

Implemented here:
#5075 (detect which splits will deliver the results)
#5048 (passing threshold between splits during search)

Count All + Top N + (Sort By Date) + Time Range Query

#5048 implements this, but we could also do some early detection which splits will deliver enough results by extending #5075

Implemented here:
#5048 (passing threshold between splits during search)

Count All + Top N + Sort By Other Field

We don't have metadata for other fields on the split metadata, so there are two options:

  • Similar to tags, we could flag these field so we pull out the min max values and store them in the metadata. This would then be used the same way as time_range. Could also be automatic for all defined numeric fields.
  • We have min max values in the fast field metadata, in combination with CanSplitBeBetter or similar, we could do some early pruning.

Not implemented

@fulmicoton
Copy link
Contributor Author

I could not tell if your text is describing the current state or not.
For count all, we already have an optimization in main... but one thing to consider is COUNT + TIME RANGE QUERY.

In that case, the optimization in main won't kick in.
The one in this PR will.

The idea is that if a split is entirely within the range query, we transform the range query into an all query.

PSeitz added a commit that referenced this issue May 30, 2024
* optimization requests by passing threshold in leaf search
* Execute query.count() instead of QuickwitCollector for count searches

We have 100 concurrent split searches by default, but num_cpus worker
threads. This means most search futures will wait to be
scheduled. When they are scheduled they can check the new threshold from
the preceding searches and maybe skip the search.

Switches to RWLock for the threshold since we read more often now.

Future Work:
We run num_cpu full searches in some cases before the threshold kicks
in. But in some cases we could statically
analyze from which split the best results come and generate count only
requests for the others.

Addresses #5032
PSeitz added a commit that referenced this issue May 30, 2024
* optimization requests by passing threshold in leaf search
* Execute query.count() instead of QuickwitCollector for count searches

We have 100 concurrent split searches by default, but num_cpus worker
threads. This means most search futures will wait to be
scheduled. When they are scheduled they can check the new threshold from
the preceding searches and maybe skip the search.

Switches to RWLock for the threshold since we read more often now.

Future Work:
We run num_cpu full searches in some cases before the threshold kicks
in. But in some cases we could statically
analyze from which split the best results come and generate count only
requests for the others.

Addresses #5032
PSeitz added a commit that referenced this issue May 31, 2024
* optimization requests by passing threshold in leaf search
* Execute query.count() instead of QuickwitCollector for count searches

We have 100 concurrent split searches by default, but num_cpus worker
threads. This means most search futures will wait to be
scheduled. When they are scheduled they can check the new threshold from
the preceding searches and maybe skip the search.

Switches to RWLock for the threshold since we read more often now.

Future Work:
We run num_cpu full searches in some cases before the threshold kicks
in. But in some cases we could statically
analyze from which split the best results come and generate count only
requests for the others.

Addresses #5032
PSeitz added a commit that referenced this issue May 31, 2024
* count optimization for multisplits

* optimization requests by passing threshold in leaf search
* Execute query.count() instead of QuickwitCollector for count searches

We have 100 concurrent split searches by default, but num_cpus worker
threads. This means most search futures will wait to be
scheduled. When they are scheduled they can check the new threshold from
the preceding searches and maybe skip the search.

Switches to RWLock for the threshold since we read more often now.

Future Work:
We run num_cpu full searches in some cases before the threshold kicks
in. But in some cases we could statically
analyze from which split the best results come and generate count only
requests for the others.

Addresses #5032

* add comments
@PSeitz PSeitz added enhancement New feature or request performance and removed bug Something isn't working labels Jun 5, 2024
@PSeitz
Copy link
Contributor

PSeitz commented Jun 5, 2024

I update the comment to make it more clear on the current state implemented.

On using the term dictionary for aggregations, this would only work if the same tokenizer is used (which isn't the case by default). In the fast field dictionary we don't store counts currently. It's also quite special, e.g. it would not support any nested aggregations.

Given how special it is and how easily cache-able the query is, I'm not sure the payoff is high enough to add a new code path in the aggregations for this.

We could directly operate over values, instead of filling a doc id buffer, and fetching those.

I think we could detect if an incoming block of docids is contiguous in collect_block and do some optimized fetching of the values. In that case we would just require the start and end index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

2 participants