-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace materialized views with incremental roll ups #45
base: master
Are you sure you want to change the base?
Replace materialized views with incremental roll ups #45
Conversation
@@ -1,5 +1,4 @@ | |||
-- view with a schema that matches the legacy changesets table | |||
CREATE VIEW changesets AS | |||
CREATE TABLE changesets AS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? As-is, this will duplicate data in raw_changesets
.
The historical reason for raw_changesets
(table) vs changesets
(view) was that I needed new table names for replacement data when I was updating v1. The API (etc.) assume the existence of relations (tables or views) without the raw_
prefix, so creating views was the easiest way to preserve compatibility.
raw_*
should probably go away in favor of tables without the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on updating the parts of this project that write raw changeset stats to write to these tables (retiring raw_changesets
)? If so, the API project may also need to be updated to refer to the correct tables.
src/housekeeping.js
Outdated
JOIN raw_changesets c ON c.id = ch.changeset_id | ||
JOIN raw_hashtags h ON h.id = ch.hashtag_id | ||
GROUP BY hashtag | ||
ON CONFLICT do nothing;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where you're going with this..nice.
The key to the incremental updates is the ON CONFLICT
clause as well as tracking which augmented diffs / changesets have already had their stats contributed.
@@ -1,5 +1,4 @@ | |||
-- view with a schema that matches the legacy changesets table | |||
CREATE VIEW changesets AS | |||
CREATE TABLE changesets AS | |||
SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than SELECT
ing (and subsequently truncating, if all you wanted was the structure), CREATE TABLE changesets LIKE raw_changesets INCLUDING ALL
will replicate the structure and indices without any of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Though in this case, some of the column names changed, so yeah... ;-)
@@ -1,5 +1,4 @@ | |||
-- view with a schema that matches the legacy changesets table | |||
CREATE VIEW changesets AS | |||
CREATE TABLE changesets AS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on updating the parts of this project that write raw changeset stats to write to these tables (retiring raw_changesets
)? If so, the API project may also need to be updated to refer to the correct tables.
JOIN raw_changesets c ON c.id = ch.changeset_id | ||
JOIN raw_hashtags h ON h.id = ch.hashtag_id | ||
GROUP BY hashtag | ||
ON CONFLICT do update;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks incomplete.
The other aggregated tables also need to be updated incrementally.
More importantly, this doesn't actually update incrementally; it's effectively the same as what REFRESH MATERIALIZED VIEW
does, rewriting each of the rows in the aggregated table (it'll take just as long on a fully-populated table). Instead, it should detect which changesets
rows changed since the last run and INSERT
new aggregated values / UPDATE
existing aggregated values by adding to them / updating counts / updating max values.
From AmericanRedCross/osm-stats#56, started with replacing the materialized views approach. Here's what has been done so far:
On running housekeeping script with this modification I could see new updates in hashtag_stats
Next Actions:
cc @dakotabenjamin @smit1678