Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT-1338: WIP: Potential replacement for the remaining 35 minute matview: Test analysis by job #2074

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

dgoodwin
Copy link
Contributor

@dgoodwin dgoodwin commented Oct 31, 2024

In an attempt to ultimately allow the test details page to show more history than 2 weeks, I went after the last slow matview. The theory was we were wasting time recalculating past days that no longer change. I wanted to replace it with a daily summary, calculated by bigquery, stored in a permanent postgresql table.

This is implemented here, the problem is that the insert for a single day is 25 minutes (about 1.5 million rows a day), vs the 35 minutes for 14 days prior. Inserting is very slow.

On a day by day basis, we could probably live with that, there would just be a many hour initial load (we could immediately go back as far as we want). We should then be able to go further back with the charts.

However it is slow even for a day at a time.

Alternatively, with this query implemented, we could just live query from bigquery for that API and begin mixing bigquery into sippy classic. We'd just need to filter down to the jobs sippy knows about. (A variant could actually be helpful there)

@dgoodwin dgoodwin changed the title append test analysis by job WIP: Potential replacement for the remaining 35 minute matview: Test analysis by job Oct 31, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2024
Copy link
Contributor

openshift-ci bot commented Oct 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2024
@stbenjam
Copy link
Member

Did you try any bigger batch sizes? These rows are small, I bet you could insert 50K at once

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Nov 1, 2024

Did you try any bigger batch sizes? These rows are small, I bet you could insert 50K at once

I tried 10k and quickly ran into that parameter size problem we always used to see on postgres. I'll test 5 or so.

@stbenjam
Copy link
Member

stbenjam commented Nov 1, 2024

Oh ok, probably not a huge deal, we can just let it run over a weekend for the initial seeding

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Nov 1, 2024

Down to 17m per day by wrapping the whole creation in one transaction instead of many smaller ones. This would only be for one fetchdata per day once we cross the 8am UTC threshold.

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Nov 1, 2024

How do we feel about the general approach? I've got it loading up a weeks worth of data to prod now, if we like it I can try to seed 2 months manually letting it run for approx 18 hours.

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Nov 1, 2024

I've loaded up a week of data in prod db in 2 hours. I then setup testview to do the variant query against this new table and it still returns immediately. I suspect this approach will let us chart much longer.

@dgoodwin dgoodwin changed the title WIP: Potential replacement for the remaining 35 minute matview: Test analysis by job TRT-1338: WIP: Potential replacement for the remaining 35 minute matview: Test analysis by job Nov 14, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 14, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 14, 2024

@dgoodwin: This pull request references TRT-1338 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

In an attempt to ultimately allow the test details page to show more history than 2 weeks, I went after the last slow matview. The theory was we were wasting time recalculating past days that no longer change. I wanted to replace it with a daily summary, calculated by bigquery, stored in a permanent postgresql table.

This is implemented here, the problem is that the insert for a single day is 25 minutes (about 1.5 million rows a day), vs the 35 minutes for 14 days prior. Inserting is very slow.

On a day by day basis, we could probably live with that, there would just be a many hour initial load (we could immediately go back as far as we want). We should then be able to go further back with the charts.

However it is slow even for a day at a time.

Alternatively, with this query implemented, we could just live query from bigquery for that API and begin mixing bigquery into sippy classic. We'd just need to filter down to the jobs sippy knows about. (A variant could actually be helpful there)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2024
@dgoodwin
Copy link
Contributor Author

Added a couple new commits that import one day at a time transactionally, and autocreate partitions for each day. Then will test how queries perform with this in place, hopefully we can do our standard 2 week query and allow the user to select longer ranges and wait if desired, as we discussed.

Creation of the partition table is manual, I'll have to figure that out before this can go in.

CREATE TABLE test_analysis_by_job_by_dates (                                                                                                                             
    date timestamp with time zone,                                                                                                                                       
    test_id bigint,                                                                                                                                                      
    release text,                                                                                                                                                        
    job_name text,                                                                                                                                                       
    test_name text,                                                                                                                                                      
    runs bigint,                                                                                                                                                         
    passes bigint,                                                                                                                                                       
    flakes bigint,                                                                                                                                                       
    failures bigint                                                                                                                                                      
) PARTITION BY RANGE (date);                                                                                                                                             
                                                                                                                                                                         
CREATE UNIQUE INDEX test_release_date                                                                                                                                    
ON test_analysis_by_job_by_dates (date, test_id, release, job_name); 
INFO[2024-11-13T13:54:28.094-04:00] loading variants from bigquery...                                                                                                    INFO[2024-11-13T13:54:35.234-04:00] variants loaded from bigquery in 7.140694503s  jobs=15069
INFO[2024-11-13T13:54:46.678-04:00] job cache created with 7714 entries from database 
INFO[2024-11-13T13:54:46.678-04:00] starting 1 loaders...                                                                                                                
INFO[2024-11-13T13:54:46.678-04:00] starting loader "prow" with metrics wrapper                                                                                          
INFO[2024-11-13T13:54:46.753-04:00] importing test analysis by job for dates: [2024-10-29 2024-10-30 2024-10-31 2024-11-01 2024-11-02 2024-11-03 2024-11-04 2024-11-05 20
24-11-06 2024-11-07 2024-11-08 2024-11-09 2024-11-10 2024-11-11 2024-11-12]       
INFO[2024-11-13T13:54:47.398-04:00] job cache created with 7714 entries from database            
INFO[2024-11-13T13:54:52.402-04:00] test cache created with 125532 entries from database         
INFO[2024-11-13T13:54:52.402-04:00] Loading test analysis by job daily summaries  date=2024-10-29
INFO[2024-11-13T13:54:52.402-04:00] CREATE TABLE IF NOT EXISTS test_analysis_by_job_by_dates_2024_10_29 PARTITION OF test_analysis_by_job_by_dates
                FOR VALUES FROM ('2024-10-29') TO ('2024-10-30');  date=2024-10-29
INFO[2024-11-13T13:54:52.433-04:00] partition created                             date=2024-10-29
INFO[2024-11-13T13:55:15.442-04:00] inserting 1406285 rows                        date=2024-10-29                                                                        INFO[2024-11-13T14:27:41.439-04:00] insert complete after 32m25.997654534s        date=2024-10-29
INFO[2024-11-13T14:27:41.439-04:00] Loading test analysis by job daily summaries  date=2024-10-30
INFO[2024-11-13T14:27:41.439-04:00] CREATE TABLE IF NOT EXISTS test_analysis_by_job_by_dates_2024_10_30 PARTITION OF test_analysis_by_job_by_dates
                FOR VALUES FROM ('2024-10-30') TO ('2024-10-31');  date=2024-10-30                                                                                       
INFO[2024-11-13T14:27:41.521-04:00] partition created                             date=2024-10-30
INFO[2024-11-13T14:28:06.475-04:00] inserting 1542754 rows                        date=2024-10-30
        INFO[2024-11-13T14:58:18.866-04:00] insert complete after 30m12.390888535s        date=2024-10-30
INFO[2024-11-13T14:58:18.866-04:00] Loading test analysis by job daily summaries  date=2024-10-31                                                                        INFO[2024-11-13T14:58:18.866-04:00] CREATE TABLE IF NOT EXISTS test_analysis_by_job_by_dates_2024_10_31 PARTITION OF test_analysis_by_job_by_dates
                FOR VALUES FROM ('2024-10-31') TO ('2024-11-01');  date=2024-10-31                                                                                       
INFO[2024-11-13T14:58:18.914-04:00] partition created                             date=2024-10-31
INFO[2024-11-13T14:58:39.345-04:00] inserting 1427864 rows                        date=2024-10-31
INFO[2024-11-13T15:23:49.438-04:00] insert complete after 25m10.093027879s        date=2024-10-31                     

@dgoodwin
Copy link
Contributor Author

First test, oct 29 - nov 14, 11s with no caching.
Daily imports take 20-30 minutes. I'm manually triggering them each day. Soon we'll have a good bit more than 2 weeks and can see how it would work if someone wanted to extend the date range.

Copy link
Contributor

openshift-ci bot commented Nov 18, 2024

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint 12c6d5b link true /test lint
ci/prow/e2e 12c6d5b link true /test e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants