Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clients_daily_scalar_aggregates_v1 contains a lot of redundant values #2712

Open
edugfilho opened this issue Feb 20, 2024 · 0 comments
Open

Comments

@edugfilho
Copy link
Collaborator

From mozilla/bigquery-etl#919

mozilla/bigquery-etl#919 (comment)

clients_daily_scalar_aggregates contains metric name and types for every key, process, agg_type combination in each row:

Screen Shot 2020-04-21 at 11 12 01 AM

This can be improved by using a nested struct for each metric so that the name, type, and key are only stored once per row:

Screen Shot 2020-04-21 at 11 20 17 AM

This would break the existing schema but that should be easy to deal with by modifying the existing table in place and updating the downstream query.

A further improvement is to not store metrics with only null values but I'm not sure if there's some downstream need for retaining those values.

From testing on a 1% sample, the above schema change would reduce the table size to ~29%, not storing nulls would reduce to ~31%, and doing both would reduce to ~12.3%. This reduce storage costs significantly. If we change to a 7 day retention period for clients_daily_scalar, this would save around $150-$180 a week on storage and some more on downstream etl data scanned (maybe another couple hundred). This isn't huge but worth considering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant