Single threaded optimization #15

marissafujimoto · 2025-01-07T00:38:16Z

Description

Addresses some of #13. I ran pgmap under a profiler (vizviewer) and saw that the hamming function and the file parsing was taking the most time. We have been using seq.io for fastx parsing and it is not optimized for our use case (insofar as it parses then discards a lot of information about quality, coordinates, etc). I switched to just using python built-ins to parse fastx and that was ~2x improvement on my machine. Then using the levenshtein package for the sequence alignment functions which added a small boost as well.

To give a rough idea of performance we are looking at < 15 minutes to run on ~10gb fastq files on my macbook air m3. Then on the hutch cluster it may be 5x that so a bit over an hour for a typical file.

Dependent on #14

Type of change

Optimization. No functional change.

How Has This Been Tested?

Tests coverage unchanged.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Update requirements.txt

cansavvy

This looks great and sounds like its a great start to making the code quicker!

cansavvy · 2025-01-16T16:29:08Z

I'm going to go ahead and merge unless you have any issues @marissafujimoto !

marissafujimoto added 2 commits January 6, 2025 15:30

Do fastx parsing without seqio

780b30c

Switch to Levenshtein package for alignment functions

73ad242

marissafujimoto requested a review from cansavvy January 7, 2025 00:39

marissafujimoto marked this pull request as ready for review January 7, 2025 00:58

cansavvy added 4 commits January 15, 2025 15:40

Merge branch 'main' into optimization

8737ada

Update requirements.txt

a65e670

Update readme partially

81d008e

Don't even worry about 3.9

ee2ebab

marissafujimoto mentioned this pull request Jan 16, 2025

Update requirements.txt #17

Merged

12 tasks

cansavvy added 2 commits January 16, 2025 09:35

Got rid of editdistance

2024ae0

Merge pull request #17 from FredHutch/cansavvy/requirements

82be8e6

Update requirements.txt

cansavvy approved these changes Jan 16, 2025

View reviewed changes

cansavvy merged commit bc8dfca into main Jan 16, 2025
5 checks passed

cansavvy deleted the optimization branch January 16, 2025 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single threaded optimization #15

Single threaded optimization #15

marissafujimoto commented Jan 7, 2025 •

edited

Loading

cansavvy left a comment

cansavvy commented Jan 16, 2025

Single threaded optimization #15

Single threaded optimization #15

Conversation

marissafujimoto commented Jan 7, 2025 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

cansavvy left a comment

Choose a reason for hiding this comment

cansavvy commented Jan 16, 2025

marissafujimoto commented Jan 7, 2025 •

edited

Loading