Aggregations: allow `terms` bucket aggregations #130

max-moser · 2023-02-23T16:18:43Z

This PR adds support for bucket aggregations as cousins to the already supported metric aggregations.
Its main purpose is to enable simple document counts based on keywords ("how often does each value for a key occur in all events"), e.g. for the boolean via_api flag for record-view events:

reference payload:

{
  "timestamp": "2023-02-21T00:00:00",
  "unique_id": "recid_mnccv-j8r20",
  "count": 3,
  "unique_count": 1,
  "via_api": {
    "true": 2,
    "false": 1
  },
  "countries": {},
  "recid": "mnccv-j8r20",
  "parent_recid": "b9jbe-qq607"
}

Limitations:

The bucket aggregations here just add "flat" (keyword count) information to the resulting aggregation, but can't be used to do nested bucketing. That would probably increase complexity a lot and isn't required right now.
For now, only terms-type bucket aggregations are supported.

Discussion:

v1:

{
  "timestamp": "2023-02-21T00:00:00",
  "unique_id": "recid_mnccv-j8r20",
  "count": 3,
  "unique_count": 1,
  "via_api": {
    "true": 2,
    "false": 1
  },
  "countries": {
    "portugal": 3,
    "china": 0
  },
  "recid": "mnccv-j8r20",
  "parent_recid": "b9jbe-qq607"
}

v2:

{
  // ...
  "countries": [
    {"key": "portugal", "count": 3},
    {"key": "china", "count": 0}
  ]
}

Use cases:

requests on zenodo support: from which countries was the record viewed/downloaded

Charts:
#120

max-moser · 2023-02-23T18:00:07Z

⚠️ As per discussion with @slint (discord chat here), this feature needs a bit more discussion regarding which use cases we expect in the future & how we want to design the data structure s.t. we're not blocking ourselves in.

* because we don't need the actual query hits, we just need the aggregations * also, remove the unused temporary variable assignment

* because we're reusing parts of the aggregation query results directly, and per default they're wrapped in DSL objects (e.g. AttrList) that aren't JSON-serializable * see: elastic/elasticsearch-dsl-py#913

max-moser · 2023-03-30T11:25:38Z

As we've discussed, we couldn't come up with immediate use-cases that would actually depend on this feature being merged in.
We only found a few nice-to-have future use cases (show from which countries the records/files were downloaded), but they would need to be fleshed out some more.

The tricky part here is that aggregations are only an "intermediate" result, because the queries still need some way of aggregating the aggregations (events -> aggregations -> queries).
As such, the aggregations need to be aggregatable themselves.
🥴

max-moser requested a review from slint February 23, 2023 16:18

max-moser linked an issue Feb 23, 2023 that may be closed by this pull request

Aggregations: Allow calculation of bucket fields #127

Open

max-moser force-pushed the mm/bucket-aggregations branch from 4caeee4 to 8cf84b5 Compare February 23, 2023 16:24

max-moser mentioned this pull request Feb 23, 2023

2.0/3.0 release changes #122

Open

max-moser added 4 commits February 28, 2023 15:55

aggregations: set result size to 0

a5e9daf

* because we don't need the actual query hits, we just need the aggregations * also, remove the unused temporary variable assignment

aggregations: turn agg query results into dictionaries

4f33820

* because we're reusing parts of the aggregation query results directly, and per default they're wrapped in DSL objects (e.g. AttrList) that aren't JSON-serializable * see: elastic/elasticsearch-dsl-py#913

aggregations: add support for "terms" bucket aggregation

6d39f42

global: remove unused imports

da9f4ef

max-moser force-pushed the mm/bucket-aggregations branch from 8cf84b5 to da9f4ef Compare February 28, 2023 14:55

max-moser closed this Mar 30, 2023

max-moser mentioned this pull request Mar 30, 2023

Minor updates #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: allow `terms` bucket aggregations #130

Aggregations: allow `terms` bucket aggregations #130

max-moser commented Feb 23, 2023 •

edited by kpsherva

Loading

max-moser commented Feb 23, 2023 •

edited

Loading

max-moser commented Mar 30, 2023

Aggregations: allow terms bucket aggregations #130

Aggregations: allow terms bucket aggregations #130

Conversation

max-moser commented Feb 23, 2023 • edited by kpsherva Loading

Discussion:

max-moser commented Feb 23, 2023 • edited Loading

max-moser commented Mar 30, 2023

Aggregations: allow `terms` bucket aggregations #130

Aggregations: allow `terms` bucket aggregations #130

max-moser commented Feb 23, 2023 •

edited by kpsherva

Loading

max-moser commented Feb 23, 2023 •

edited

Loading