-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Far too large pull - grid/slurm support, checkpointing, extra features #73
Open
flowers9
wants to merge
57
commits into
xiaochuanle:master
Choose a base branch
from
flowers9:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
packages) with the -G and -S options. It does checkpointing for mecat2pw and mecat2cns, allowing restarting of failed jobs. It also allows correction against (in mecat2pw) and of (in mecat2cns) a subset of the given reads with the -R option. The -k option of mecat2cns has been changed to default to zero rather than 10, with assumption that a quicker partitioning is better (and setting -k to zero is now the same as using a negative value, rather than creating an infinite loop).
Some minor warnings (signed/unsigned comparisons and such) were fixed, and packed_db has been reformatted in preparation to allow larger read sets.
It was just the position of the entry in the array, which is kinda pointless.
The rand() calls for non-AGCT basepairs in packed_db got replaced by a deterministic function to allow identical output from reruns. Some more reformatting while planning the upcoming change allowing for large fasta files in mecat2cns.
commented out methods that weren't used anywhere
in mecat2cns; also moved packed_db into mecat2cns, since that's the only place it's used (the bits that were kinda used in common (lookup_table and split_database) shouldn't have been using it, as they were treating it as subroutines with a fixed interface, not a class
also removed unneeded aserts from dw
slightly worried about the memory footprint so currently using u4_t to hold the read index which limits total reads to 2^32-1, rather than 2^63-1 for the rest of the program. Easy to change, but will up the memory footprint of the reordering by taking 32 bytes per candidate rather than 16
if there are too many reads to reorder up front, check for it and fail back to the older method (i.e., splitting candidates by read id and reordering inside each partition)
mainly to make reordering optional for now, as it needs more work - it's too memory intensive for something that's supposed to mainly be used when memory is low. Also made sure checks for minimum coverage were always applied regardless of what processing options were chosen.
Changed structures to help lower memory usage, some refactoring to help add read sorting to also help with memory usage
pulling recent changes into branch, since it's never going the other way
Also changed spun off processes to not bother listing options if they're defaults
in the end, it would just take too much memory to hold the read-read pairings, which doesn't work well when the whole point is to limit memory usage
mainly to reduce memory usage (no need to copy the list when I can simply sort it instead)
also in the middle of some memory testing
mostly - two of the subclasses still have small mallocs
…rings however, this does appear to have slowed thigns down a bit, I suspect mainly because of the clearing/recreating of strings, but that can be addressed now that we're off static arrays
also created unified buffer for output instead of left/right buffers
but now getting free errors in the boost routines, for some reason
finally nailed all the bugs (I hope) I created by changing dw.cpp, and the changes should speed up alignment creation as well as reduce memory usage
also testing d_path as a deque rather than a vector
turns out dynamic allocation comes with a large cost - 33% slower, and not appreciably less memory usage. The other changes made a major speed increase, though, 2.5-3x increase.
though it's not currently settable
renamed some variables, made end of band calculations a bit quicker
use actual error_rate, not .25, and correct align size, which should be based directly off the extend size, not k_offset
changed a few vector<char> to vectir<uint1> when they just held values from 0-4; changed pthread mutexes to std::mutex, which requires c++11
it's not just a right triangle, it's a bounded one
got rid of argument.*, which was no longer used, added and removed various #includes to better reflect what was actually needed, changed Align() to finish out the inner loop (k_min to k_max) when it hit the termination condition and choose the best of the terminating k values rather than the first one
The lto additions might not be portable, though (particularly the change to src/mecat2cns/main.mk, as I had to specify the plugin location for ar)
Got rid of non-standard basic type definitions in mecat2cns, changed all asserts to assert()
vectorized using SSE2 commands and a touch of assembly more conversion of idx_t to int64_t
improved both the vectorized and non-vectorized string comparison in the inner loop; vectorization relies on sse2 gnu intrinsics and the bsfl/bsrl assembly commands vectorized version is roughly 10% faster
some variables renamed to be more expressive, some int64_t changed to int
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In brief, this pull provides basic support for grid/slurm (and possibly other remote queueing packages) with the -G and -S options (it supports job queueing and tracking of job completion). It does checkpointing for mecat2pw and mecat2cns, allowing restarting of failed jobs (but only for can runs, not m4). It also allows correction against (in mecat2pw) and of (in mecat2cns) a subset of the given reads with the -R option.
The -k option of mecat2cns has been changed to default to zero rather than 10, with the assumption that a quicker partitioning is better (and setting -k to zero is now the same as using a negative value, rather than creating an infinite loop).
It also changes index_t to idx_t (to avoid a solaris namespace conflict) and arbitrarily changes to the code style to one I can read more easily in code I needed to make changes to.
There was a small bug fix to findErrors.C as well to prevent crashes from chunks with no matching reads.