Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Addresses some of #13. I ran pgmap under a profiler (vizviewer) and saw that the hamming function and the file parsing was taking the most time. We have been using seq.io for fastx parsing and it is not optimized for our use case (insofar as it parses then discards a lot of information about quality, coordinates, etc). I switched to just using python built-ins to parse fastx and that was ~2x improvement on my machine. Then using the levenshtein package for the sequence alignment functions which added a small boost as well.
To give a rough idea of performance we are looking at < 15 minutes to run on ~10gb fastq files on my macbook air m3. Then on the hutch cluster it may be 5x that so a bit over an hour for a typical file.
Dependent on #14
Type of change
Optimization. No functional change.
How Has This Been Tested?
Tests coverage unchanged.
Checklist: