-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chess sim command runtime improvement heuristics? #18
Comments
Hi @jmcbroome ,
The conversion will take a while, but you'll have to do it only once, then loading the data will be a lot faster. |
@jmcbroome , do you get an acceptable runtime after the conversion? |
Sorry for not following up more quickly, I've been trying to make it work. I've been precalculating O/E matrices with the HiCExplorer implementation (hicTransform in obs_exp mode) which only takes a few minutes. However, it appears to be hanging on reading in reference contact data, and has spent 3 days so far reading in a obs-exp transformed chicken HiC matrix as reference. It's only using one thread for this step though I gave it access to 18 cores. This seems to be a serious bottleneck problem. I may try breaking my data into a series of syntenic chromosome pairs and associated syntenic region pairs and starting several parallel CHESS runs so that each individual CHESS run has to read in less total contact data, since this step is not multithreaded in the software. Current output: It is now 2020-11-23 and no additional lines have been printed. |
Yeah, this is indeed a bottleneck. Besides the O/E calculation, CHESS currently also reads everything in the reference and query matrices into memory. That also means a lot of data is read that will never be used, because it lies outside of the region pairs of interest. For txt file input, this is a difficult problem to solve, as one has to iterate over the entire file anyways to see which contacts are relevant. For FAN-C compatible matrices (i.e. Juicer and FAN-C, while Cooler does not have support for expected values on file), however, we might be able to retrieve regions on the fly. Then the |
Hello again! I did not have time yet to refactor the CHESS code to load submatrices from file on demand - I'm going to need @nickmachnik's help for this, too, particularly for the multiprocessing part. However, worked on the Cooler compatibility layer in FAN-C yesterday night and found some inefficiencies, which I fixed. On my machine (SSD), I now get a reading speed of 15-20 million pixels per minute (don't ask what it was before...). This should load high-res Cooler matrices much more quickly. The changes are already available via GitHub and PyPi (FAN-C version Cheers, Kai |
I updated the fanc dependency to 0.9.8, will put this on PyPi with the next patch (0.3.6). |
I’m encountering significant runtime problems when calculating a similarity score for a single target region. I’m performing a comparison between two species over a single syntenic region with balanced but not O/E cooler format inputs with 18 threads, though it doesn’t seem to be using most of those threads most of the time. I’m using the --background-query and --limit-background options.
It is at more than six days of runtime at this point and still has not actually returned any output. Is there a way to speed this up significantly, by selecting a restricted background decided in some other way or using different format files? I will be needing to run the sim command many more times on many more regions if I want to incorporate it into my project, and this is unfeasibly slow.
Output below:
2020-11-09 11:28:04,305 INFO Running '/home/jmcbroome/anaconda3/bin/chess sim --background-query -p 18 --limit-background human_chr1_hic_test.corrected.cool chicken_hic.corrected.cool chess_test.bedpe chess_single_test.txt'
2020-11-09 11:28:09,670 INFO CHESS version: 0.3.5
2020-11-09 11:28:09,670 INFO FAN-C version: 0.9.7
2020-11-09 11:28:09,673 INFO Loading reference contact data
Expected 100% (20923106 of 20923106) |################################################################################################################################################################| Elapsed Time: 1:01:49 Time: 1:01:49
2020-11-09 17:21:29,197 INFO Loading query contact data
Expected 99% (25409016 of 25665672) |###################################################################################################################################################### | Elapsed Time: 1 day, 19:55:08 ETA: 0:00:45
Expected 100% (25665672 of 25665672) || Elapsed Time: 6 days, 18:56:44 Time: 6 days, 18:56:44
The text was updated successfully, but these errors were encountered: