Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace materialized views with incremental roll ups #45

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ramyaragupathy
Copy link
Member

From AmericanRedCross/osm-stats#56, started with replacing the materialized views approach. Here's what has been done so far:

  • Use tables instead of views in sql scripts. Initially defined columns for table without populating data, but later just used the original select command to populate the table with data during the first housekeeping run
  • Replaced refresh queries with upsert queries in js file. There was a conflict in running the if condition for josm editor count. For now using the filled up with josm with dummy values.

On running housekeeping script with this modification I could see new updates in hashtag_stats
Next Actions:

  • Talk through the approach w/ @mojodna
  • Identify and fix query related issues in src/houseekeeping.js

cc @dakotabenjamin @smit1678

@@ -1,5 +1,4 @@
-- view with a schema that matches the legacy changesets table
CREATE VIEW changesets AS
CREATE TABLE changesets AS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? As-is, this will duplicate data in raw_changesets.

The historical reason for raw_changesets (table) vs changesets (view) was that I needed new table names for replacement data when I was updating v1. The API (etc.) assume the existence of relations (tables or views) without the raw_ prefix, so creating views was the easiest way to preserve compatibility.

raw_* should probably go away in favor of tables without the prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on updating the parts of this project that write raw changeset stats to write to these tables (retiring raw_changesets)? If so, the API project may also need to be updated to refer to the correct tables.

JOIN raw_changesets c ON c.id = ch.changeset_id
JOIN raw_hashtags h ON h.id = ch.hashtag_id
GROUP BY hashtag
ON CONFLICT do nothing;`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're going with this..nice.

The key to the incremental updates is the ON CONFLICT clause as well as tracking which augmented diffs / changesets have already had their stats contributed.

@@ -1,5 +1,4 @@
-- view with a schema that matches the legacy changesets table
CREATE VIEW changesets AS
CREATE TABLE changesets AS
SELECT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than SELECTing (and subsequently truncating, if all you wanted was the structure), CREATE TABLE changesets LIKE raw_changesets INCLUDING ALL will replicate the structure and indices without any of the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though in this case, some of the column names changed, so yeah... ;-)

@ramyaragupathy ramyaragupathy changed the title [WIP] Replace materialized views with incremental roll ups Replace materialized views with incremental roll ups Aug 6, 2019
@@ -1,5 +1,4 @@
-- view with a schema that matches the legacy changesets table
CREATE VIEW changesets AS
CREATE TABLE changesets AS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on updating the parts of this project that write raw changeset stats to write to these tables (retiring raw_changesets)? If so, the API project may also need to be updated to refer to the correct tables.

JOIN raw_changesets c ON c.id = ch.changeset_id
JOIN raw_hashtags h ON h.id = ch.hashtag_id
GROUP BY hashtag
ON CONFLICT do update;`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incomplete.

The other aggregated tables also need to be updated incrementally.

More importantly, this doesn't actually update incrementally; it's effectively the same as what REFRESH MATERIALIZED VIEW does, rewriting each of the rows in the aggregated table (it'll take just as long on a fully-populated table). Instead, it should detect which changesets rows changed since the last run and INSERT new aggregated values / UPDATE existing aggregated values by adding to them / updating counts / updating max values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants