Far too large pull - grid/slurm support, checkpointing, extra features #73

flowers9 · 2019-02-19T21:22:53Z

In brief, this pull provides basic support for grid/slurm (and possibly other remote queueing packages) with the -G and -S options (it supports job queueing and tracking of job completion). It does checkpointing for mecat2pw and mecat2cns, allowing restarting of failed jobs (but only for can runs, not m4). It also allows correction against (in mecat2pw) and of (in mecat2cns) a subset of the given reads with the -R option.

The -k option of mecat2cns has been changed to default to zero rather than 10, with the assumption that a quicker partitioning is better (and setting -k to zero is now the same as using a negative value, rather than creating an infinite loop).

It also changes index_t to idx_t (to avoid a solaris namespace conflict) and arbitrarily changes to the code style to one I can read more easily in code I needed to make changes to.

There was a small bug fix to findErrors.C as well to prevent crashes from chunks with no matching reads.

packages) with the -G and -S options. It does checkpointing for mecat2pw and mecat2cns, allowing restarting of failed jobs. It also allows correction against (in mecat2pw) and of (in mecat2cns) a subset of the given reads with the -R option. The -k option of mecat2cns has been changed to default to zero rather than 10, with assumption that a quicker partitioning is better (and setting -k to zero is now the same as using a negative value, rather than creating an infinite loop).

Some minor warnings (signed/unsigned comparisons and such) were fixed, and packed_db has been reformatted in preparation to allow larger read sets.

It was just the position of the entry in the array, which is kinda pointless.

The rand() calls for non-AGCT basepairs in packed_db got replaced by a deterministic function to allow identical output from reruns. Some more reformatting while planning the upcoming change allowing for large fasta files in mecat2cns.

commented out methods that weren't used anywhere

in mecat2cns; also moved packed_db into mecat2cns, since that's the only place it's used (the bits that were kinda used in common (lookup_table and split_database) shouldn't have been using it, as they were treating it as subroutines with a fixed interface, not a class

also removed unneeded aserts from dw

slightly worried about the memory footprint so currently using u4_t to hold the read index which limits total reads to 2^32-1, rather than 2^63-1 for the rest of the program. Easy to change, but will up the memory footprint of the reordering by taking 32 bytes per candidate rather than 16

if there are too many reads to reorder up front, check for it and fail back to the older method (i.e., splitting candidates by read id and reordering inside each partition)

mainly to make reordering optional for now, as it needs more work - it's too memory intensive for something that's supposed to mainly be used when memory is low. Also made sure checks for minimum coverage were always applied regardless of what processing options were chosen.

Changed structures to help lower memory usage, some refactoring to help add read sorting to also help with memory usage

pulling recent changes into branch, since it's never going the other way

Also changed spun off processes to not bother listing options if they're defaults

in the end, it would just take too much memory to hold the read-read pairings, which doesn't work well when the whole point is to limit memory usage

mainly to reduce memory usage (no need to copy the list when I can simply sort it instead)

also in the middle of some memory testing

mostly - two of the subclasses still have small mallocs

…rings however, this does appear to have slowed thigns down a bit, I suspect mainly because of the clearing/recreating of strings, but that can be addressed now that we're off static arrays

also created unified buffer for output instead of left/right buffers

but now getting free errors in the boost routines, for some reason

finally nailed all the bugs (I hope) I created by changing dw.cpp, and the changes should speed up alignment creation as well as reduce memory usage

also testing d_path as a deque rather than a vector

turns out dynamic allocation comes with a large cost - 33% slower, and not appreciably less memory usage. The other changes made a major speed increase, though, 2.5-3x increase.

though it's not currently settable

renamed some variables, made end of band calculations a bit quicker

use actual error_rate, not .25, and correct align size, which should be based directly off the extend size, not k_offset

changed a few vector<char> to vectir<uint1> when they just held values from 0-4; changed pthread mutexes to std::mutex, which requires c++11

it's not just a right triangle, it's a bounded one

got rid of argument.*, which was no longer used, added and removed various #includes to better reflect what was actually needed, changed Align() to finish out the inner loop (k_min to k_max) when it hit the termination condition and choose the best of the terminating k values rather than the first one

The lto additions might not be portable, though (particularly the change to src/mecat2cns/main.mk, as I had to specify the plugin location for ar)

Got rid of non-standard basic type definitions in mecat2cns, changed all asserts to assert()

vectorized using SSE2 commands and a touch of assembly more conversion of idx_t to int64_t

improved both the vectorized and non-vectorized string comparison in the inner loop; vectorization relies on sse2 gnu intrinsics and the bsfl/bsrl assembly commands vectorized version is roughly 10% faster

some variables renamed to be more expressive, some int64_t changed to int

Dave Flowers added 30 commits February 19, 2019 14:49

Mild reformat and warning removal

5268334

Some minor warnings (signed/unsigned comparisons and such) were fixed, and packed_db has been reformatted in preparation to allow larger read sets.

removed id from seq index

c121957

It was just the position of the entry in the array, which is kinda pointless.

remove rand() calls, reformatting

d97eada

The rand() calls for non-AGCT basepairs in packed_db got replaced by a deterministic function to allow identical output from reruns. Some more reformatting while planning the upcoming change allowing for large fasta files in mecat2cns.

changed min/max read id to overlap counts

21fc258

minor cleanup

48db467

cleaned up packdb

c0c3cd9

commented out methods that weren't used anywhere

refactored candidate thread loop

ef131a7

also removed unneeded aserts from dw

made fasta to pac conversion restartable

c74943c

added branch for number of reads vs u4_t

7ef87a8

if there are too many reads to reorder up front, check for it and fail back to the older method (i.e., splitting candidates by read id and reordering inside each partition)

more prep towards reordering properly

4a2f1d6

added option for binary output to mecat2pw

c272e9d

shifting candidate handling to new format

04003aa

Changed structures to help lower memory usage, some refactoring to help add read sorting to also help with memory usage

Merge branch 'master' of https://github.com/xiaochuanle/MECAT

502f7b5

pulling recent changes into branch, since it's never going the other way

added MECAT2 defaults, fixed memory reduction bugs

e57eb43

Also changed spun off processes to not bother listing options if they're defaults

random cleanup: stderr -> cerr, timer cleanup

3a71f31

trashed reorder attempt

dfac5c6

in the end, it would just take too much memory to hold the read-read pairings, which doesn't work well when the whole point is to limit memory usage

changed sorting ec_list to better order reads

64b0266

mainly to reduce memory usage (no need to copy the list when I can simply sort it instead)

cleaned up some comparison functions

dd464d0

added k/m/g suffixes to integer options

8f773ad

changed some fprintf to use cerr instead

cf26d54

also in the middle of some memory testing

convert id_list to vector<>

dc89bb8

changed cns_table to vector

5e38cb8

whoops, forgot to check for errors before commiting

cf9127a

changed CnsAln and CnsAlns to vectors

d924f37

removed some duplicate and obsolete code

d872961

converted M5Record to use vector instead of malloc

83c5425

Dave Flowers added 27 commits April 19, 2019 16:29

changed DiffRunningData from malloc to vectors

3b43bc2

mostly - two of the subclasses still have small mallocs

added max_aln_size back in after accidentally removing it

21349b5

finished changes from defined (large) array buffers to vectors and st…

625a676

…rings however, this does appear to have slowed thigns down a bit, I suspect mainly because of the clearing/recreating of strings, but that can be addressed now that we're off static arrays

remove extra buffer read/write when getting alignments

5199f25

also created unified buffer for output instead of left/right buffers

moved to copy() over memcpy, unified left/right buffers

56383b1

but now getting free errors in the boost routines, for some reason

in the middle of testing, this is a save point

6b1fa4f

still working on bug, but got rid of bsearch in dw.cpp

1e82d14

fixed bugs, improved speed of alignment finding

7d62c5d

finally nailed all the bugs (I hope) I created by changing dw.cpp, and the changes should speed up alignment creation as well as reduce memory usage

increase RM from 100k to 500k

7025c3c

also testing d_path as a deque rather than a vector

back to static allocation

80fdb91

turns out dynamic allocation comes with a large cost - 33% slower, and not appreciably less memory usage. The other changes made a major speed increase, though, 2.5-3x increase.

perform better static sizing of GetAlignment() buffers

dab77e2

added exit files to allow detection of failed execution of sub-tasks

36cf1e7

general cleanup, moved error_rate to options

bcd394e

though it's not currently settable

cleaned up Align() a bit

c2aa22a

renamed some variables, made end of band calculations a bit quicker

minor pre-alloc fix

da5b08c

use actual error_rate, not .25, and correct align size, which should be based directly off the extend size, not k_offset

char -> uint1, pthread mutex -> c++11

d53721f

changed a few vector<char> to vectir<uint1> when they just held values from 0-4; changed pthread mutexes to std::mutex, which requires c++11

minor cleanup and reformatting

9b7c353

narrowed d_path buffer to reflect cutoff

5d8af26

it's not just a right triangle, it's a bounded one

comments, refined d_path size

6cc539c

more cleanup of header file includes

b0fe741

added timing to output

e9f3a9f

Minor cleanup, added lto, moved if out of inner Align loop

33c64a3

The lto additions might not be portable, though (particularly the change to src/mecat2cns/main.mk, as I had to specify the plugin location for ar)

type cleanup, header file cleanup

c96a93b

Got rid of non-standard basic type definitions in mecat2cns, changed all asserts to assert()

vectorization of mecat2cns inner loop, some cleanup

9c9a0cc

vectorized using SSE2 commands and a touch of assembly more conversion of idx_t to int64_t

improved inner loop

c21c2fc

improved both the vectorized and non-vectorized string comparison in the inner loop; vectorization relies on sse2 gnu intrinsics and the bsfl/bsrl assembly commands vectorized version is roughly 10% faster

random cleanup

8e172fa

some variables renamed to be more expressive, some int64_t changed to int

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Far too large pull - grid/slurm support, checkpointing, extra features #73

Far too large pull - grid/slurm support, checkpointing, extra features #73

flowers9 commented Feb 19, 2019

Far too large pull - grid/slurm support, checkpointing, extra features #73

Are you sure you want to change the base?

Far too large pull - grid/slurm support, checkpointing, extra features #73

Conversation

flowers9 commented Feb 19, 2019