Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSPS-342 Update reheader seq dict wdl for better efficiency #146

Merged
merged 8 commits into from
Oct 24, 2024

Conversation

mmorgantaylor
Copy link
Collaborator

@mmorgantaylor mmorgantaylor commented Oct 22, 2024

Description

Make reheadering of VCFs with correct sequence dictionary more efficient, since we'll need to run it on the AoU+AnVIL reference panel VCFs and that will take a long time.

Changes:

  • extract the header, fix it, then reheader the original, rather than fixing it in the original in one step
  • use bcftools to create index instead of tabix
  • add echos with dates for better progress tracking

Some learnings about timing:

  • the vast majority of this task is spent creating the index
  • next longest step is localization of the files (1 hour for chr20 vcf)

Note we ran this with a monitoring script and found that the task was not close to using all the memory available (memory utilization did not exceed 11%), so we don't think an optimization there would help. We also weren't pegged on iops (didn't exceed 40 MiB/s, and the max on the machine is >200 MiB/s).

Evaluation:
Before these changes, this wdl run on the AoU-only chr20 vcf took 29+ hours, cost $2.37.
After these changes, the same input took 23 hours (20% improvement), cost $1.87 (21% improvement).

Jira Ticket

https://broadworkbench.atlassian.net/browse/TSPS-342

Copy link

sonarcloud bot commented Oct 22, 2024

@mmorgantaylor mmorgantaylor marked this pull request as ready for review October 24, 2024 13:03
@mmorgantaylor mmorgantaylor merged commit e08e533 into main Oct 24, 2024
12 checks passed
@mmorgantaylor mmorgantaylor deleted the TSPS-342_mma_reheader_seq_dict branch October 24, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants