-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest Terra metrics #478
Ingest Terra metrics #478
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #478 +/- ##
=============================================
+ Coverage 54.31% 56.75% +2.43%
- Complexity 267 300 +33
=============================================
Files 31 31
Lines 1725 1947 +222
Branches 143 165 +22
=============================================
+ Hits 937 1105 +168
- Misses 725 762 +37
- Partials 63 80 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Quality Gate failedFailed conditions 0.0% Coverage on New Code (required ≥ 80%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good... do you have a sense on whether it's slow with all the API calls fetching the primary descriptor paths, or posting? Or maybe it's just both, or neither, and the size of the data is the reason.
It's probably not worth the optimization effort, but depending on the answer above and how many different of a workflow are run, I wonder if it wouldn't be more efficient to fetch all versions for a workflow, and then calculating the primary descriptors? There are a lot of variables there, e.g., how many versions of the same workflow are run, so probably not worth further effort unless it really goes out of control.
There are some workflows that do not have absolute primary paths, see SEAB-5898. Will those work? If not, can we assume they're absolute by prepending a forward slash?
final SourceUrlTrsInfo sourceUrlTrsInfo = sourceUrlToSourceUrlTrsInfo.get(sourceUrl); | ||
final ExecutionsRequestBody executionsRequestBody = new ExecutionsRequestBody().runExecutions(List.of(workflowExecution.get())); | ||
try { | ||
extendedGa4GhApi.executionMetricsPost(executionsRequestBody, Partner.TERRA.toString(), sourceUrlTrsInfo.trsId(), sourceUrlTrsInfo.version(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth having a dry-run
type argument, where you can see what would get posted as a sort of sanity check before actually creating s3 data.
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
Hmm, based on my simple evaluation of measuring the time elapsed before and after the call, the call to fetch descriptors takes anywhere from 0.01 - 0.09 seconds. Posting the metric takes around the same time, in the 0.0x range, but I'll see if grouping executions and posting helps
Hmm, just tested this and it actually takes longer about 2 minutes longer for 40,000 rows, likely because we're getting versions that maybe were never executed. I have an optimization that reduces the 1,000,000 processing time from 1h14m29s to 56m22s, which I'll push soon. So with 4,000,000 rows, we're looking at about 4 hours, hopefully less.
Good point, I will double check |
That's probably fine; we're going to run this very rarely. |
👍 |
Still thinking this is odd, can you output a few examples and track this in a new ticket for investigation? |
Refresh my memory, what happens when we run this twice with the same data? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment above
In https://ucsc-cgl.atlassian.net/browse/SEAB-5943, we're changing the S3 file key from the time submitted to the execution ID, so if this is run twice, it will error for the execution IDs that already exist in the same directory. I think we don't want users to accidentally overwrite the old data using the POST metrics endpoint, hence the erroring |
ok, thinking we definitely don't have to worry then. Any idea what the aggregation runtimes will be like vs ingestion? |
For 200,000 rows, ingestion took about 30 mins which submitted 170480 executions. Aggregation of these executions took 1H19M54S, submitting 841 metrics to the database. The aggregation can definitely take advantage of parallel processing, so I'll add that |
Adding parallel processing to the aggregator reduced the time down to 26M39S |
Noting the following stats for a CSV file of 1,000,000 executions:
Submitted 844,397 executions and skipped 155,603, which makes alot more sense than the stats for 1,000,000 executions in the draft PR description. There was a bug that was causing it to skip executions it shouldn't (now fixed). This bug fixed caused the time to increase from |
We could also leave it in BigQuery or AWS's equivalent (RedShift?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments/questions
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
@@ -23,7 +23,7 @@ | |||
</encoder> | |||
</appender> | |||
|
|||
<root level="error"> | |||
<root level="info"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wanted to see the INFO logs otherwise its eerily quiet on the CLI 😁
} | ||
|
||
RunExecution workflowExecution = new RunExecution(); | ||
// TODO: uncomment below when the update executions endpoint PR is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note
Another alternative is to use AWS' Athena to query the metrics in S3, which feels like a more future-proof solution 💭 |
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
...ggregator/src/main/java/io/dockstore/metricsaggregator/client/cli/TerraMetricsSubmitter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that builds will fail until I update the Dockstore webservice version to a tag containing the corresponding webservice changes.
Ah ok. WIll swing back again after that tag
...regator/src/main/java/io/dockstore/metricsaggregator/client/cli/MetricsAggregatorClient.java
Outdated
Show resolved
Hide resolved
final ExecutionsRequestBody executionsRequestBody = new ExecutionsRequestBody().runExecutions(workflowExecutionsToSubmit); | ||
try { | ||
extendedGa4GhApi.executionMetricsPost(executionsRequestBody, Partner.TERRA.toString(), sourceUrlTrsInfo.trsId(), sourceUrlTrsInfo.version(), | ||
"Terra metrics from Q4 2023"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded and arguably not correct, unless it's the date they gave it to us (metrics are for 2022, but they were given to us in Q4 2023). If we expect them to switch to invoking the API in the future, then maybe we don't need to worry about it; just flagging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think I'll make it an optional argument instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the description more generic and also added description
as a command argument
FYI, need to merge #479 first before updating the webservice version in this PR to 1.15.0-rc.0 because the other PR has changes required for the metrics aggregator to build successfully |
I'm guessing, workflow that was renamed? |
That's one possible scenario. Other ones are:
|
Updated the release checklist with instructions on how to ingest the Terra metrics. Per discussion in #479 (comment), we will be wiping the metrics bucket and re-ingesting the tooltester AGC executions and DNAstack miniwdl metrics. Local stats of running through the checklist
EDIT: investigating why it skipped 67 metrics. |
I created a webservice PR that fixes the aggregator so that it doesn't skip the 67 metrics. See details in dockstore/dockstore#5788. I will still merge this PR, but will skip ingesting the DNAstack metrics in staging until the next RC release because the metrics that were skipped were some of the DNAstack metrics. |
Quality Gate failedFailed conditions 1 Security Hotspot |
Companion PRs:
PR Description
This PR ingests the metrics that Terra provided.
Approach:
source_url
and sending the executions with the same TRS ID and version in one request instead of multiple (one for each execution).Unfortunately, from my testing, the webservice changes in Add update execution metrics endpoint dockstore#5778 that creates a file for each execution greatly increases the time for this PR. Processing 200,000 executions without the changes in the linked PR takes 1 minute, but with those changes, it takes 32 minutes. Optimization efforts will have to be in that PR.EDIT: I will be reverting the changes in that PR that resulted in the performance decrease in metric ingestion. See the comment in that PR for more details.github.com/broadinstitute/gatk
.ah_var_store
. We see if any of these workflows have a primary descriptor that match the descriptor in the source_url.Note that builds will fail until I update the Dockstore webservice version to a tag containing the corresponding webservice changes.
Draft PR Description
This is a draft PR for ingesting the Terra metrics that the Broad team provided. I'm looking for feedback on the approach - I still need to add a few tests, clean it up, documentation, etc.
Approach:
dateExecuted
from theworkflow_start
but theworkflow_start
is valid 🤔. It works okay in a smaller dataset so something is up. EDIT 2: It was a bug on my end, the program was skipping executions that it shouldn't have. The run time ends up being longer due to this fix.github.com/broadinstitute/gatk
.ah_var_store
. We see if any of these workflows have a primary descriptor that match the descriptor in the source_url.Review Instructions
Ingest Terra metrics in QA or staging then aggregate the metrics.
Issue
SEAB-5963
Security
If there are any concerns that require extra attention from the security team, highlight them here.
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
in the project that you have modified (until https://ucsc-cgl.atlassian.net/browse/SEAB-5300 adds multi-module support properly)