Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: allow terms bucket aggregations #130

Closed

Conversation

max-moser
Copy link
Contributor

@max-moser max-moser commented Feb 23, 2023

This PR adds support for bucket aggregations as cousins to the already supported metric aggregations.
Its main purpose is to enable simple document counts based on keywords ("how often does each value for a key occur in all events"), e.g. for the boolean via_api flag for record-view events:

reference payload:

{
  "timestamp": "2023-02-21T00:00:00",
  "unique_id": "recid_mnccv-j8r20",
  "count": 3,
  "unique_count": 1,
  "via_api": {
    "true": 2,
    "false": 1
  },
  "countries": {},
  "recid": "mnccv-j8r20",
  "parent_recid": "b9jbe-qq607"
}

Limitations:

  1. The bucket aggregations here just add "flat" (keyword count) information to the resulting aggregation, but can't be used to do nested bucketing. That would probably increase complexity a lot and isn't required right now.
  2. For now, only terms-type bucket aggregations are supported.

Discussion:

v1:

{
  "timestamp": "2023-02-21T00:00:00",
  "unique_id": "recid_mnccv-j8r20",
  "count": 3,
  "unique_count": 1,
  "via_api": {
    "true": 2,
    "false": 1
  },
  "countries": {
    "portugal": 3,
    "china": 0
  },
  "recid": "mnccv-j8r20",
  "parent_recid": "b9jbe-qq607"
}

v2:

{
  // ...
  "countries": [
    {"key": "portugal", "count": 3},
    {"key": "china", "count": 0}
  ]
}

Use cases:

  1. requests on zenodo support: from which countries was the record viewed/downloaded

Charts:
#120

@max-moser max-moser requested a review from slint February 23, 2023 16:18
@max-moser max-moser linked an issue Feb 23, 2023 that may be closed by this pull request
@max-moser max-moser force-pushed the mm/bucket-aggregations branch from 4caeee4 to 8cf84b5 Compare February 23, 2023 16:24
@max-moser
Copy link
Contributor Author

max-moser commented Feb 23, 2023

⚠️ As per discussion with @slint (discord chat here), this feature needs a bit more discussion regarding which use cases we expect in the future & how we want to design the data structure s.t. we're not blocking ourselves in.

* because we don't need the actual query hits, we just need the
  aggregations
* also, remove the unused temporary variable assignment
* because we're reusing parts of the aggregation query results directly,
  and per default they're wrapped in DSL objects (e.g. AttrList) that
  aren't JSON-serializable
* see: elastic/elasticsearch-dsl-py#913
@max-moser max-moser force-pushed the mm/bucket-aggregations branch from 8cf84b5 to da9f4ef Compare February 28, 2023 14:55
@max-moser
Copy link
Contributor Author

As we've discussed, we couldn't come up with immediate use-cases that would actually depend on this feature being merged in.
We only found a few nice-to-have future use cases (show from which countries the records/files were downloaded), but they would need to be fleshed out some more.

The tricky part here is that aggregations are only an "intermediate" result, because the queries still need some way of aggregating the aggregations (events -> aggregations -> queries).
As such, the aggregations need to be aggregatable themselves.
🥴

@max-moser max-moser closed this Mar 30, 2023
@max-moser max-moser mentioned this pull request Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregations: Allow calculation of bucket fields
1 participant