TSPS-342 Update reheader seq dict wdl for better efficiency #146
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Make reheadering of VCFs with correct sequence dictionary more efficient, since we'll need to run it on the AoU+AnVIL reference panel VCFs and that will take a long time.
Changes:
Some learnings about timing:
Note we ran this with a monitoring script and found that the task was not close to using all the memory available (memory utilization did not exceed 11%), so we don't think an optimization there would help. We also weren't pegged on iops (didn't exceed 40 MiB/s, and the max on the machine is >200 MiB/s).
Evaluation:
Before these changes, this wdl run on the AoU-only chr20 vcf took 29+ hours, cost $2.37.
After these changes, the same input took 23 hours (20% improvement), cost $1.87 (21% improvement).
Jira Ticket
https://broadworkbench.atlassian.net/browse/TSPS-342