-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make install does not work, pufferfish binary segfaults #15
Comments
@mathog — thanks for the detailed report. @mohsenzakeri, can you please take a look into this? The easiest way may be to use a CentOS8 container to get the same environment to build a MWE that we can play with. |
Yes @rob-p, I'll be looking into this now. |
Hi @mathog, I followed all the commands you provided here to build a pufferfish binary from scratch in CentOS8 container created by "docker pull centos". (creates a container with the following version: CentOS Linux release 8.1.1911) It seemed that at the end of the process, binary located at "$TOPDIR/bin/pufferfish" is working fine. It builds the index on the toy example you provided in this post as well. Are you running the binary at this location too? |
It does the same thing no matter where I run it. Interestingly it runs in valgrind, but also throws some errors which are presumably causing the segfault when it runs alone. I rebuilt it like this (accidentally overwriting
Unfortunately not too many line numbers resulted. At least the routine names are visible. See the attached file. I'm using a fairly old machine with only 8GB of RAM for this (it was a spare compute node I kept for parts but took home to work on during the epidemic.) Processor is "Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz". I suppose something about the pipeline might be causing the difference between a crash on this and no crash on a newer machine. There are no "-march" flags, so it isn't compiling for something incompatible, and I verified that all the -mss* modes called for are supported by that CPU. There are some lines like:
In any case, I suspect that those valgrind detected errors are the problem, and that it only causes segfaults on certain machines due to CPU or other differences. This is probably also why salmon blows up on this machine, as the valgrind messages for it include many of the same sections of code (and for some reason more line numbers):
|
Hi @mathog, These logs are very useful; thank you! Two quick questions (we will look into the log). (1) Do the pre-compiled versions of salmon run on your machine? Specifically, either this binary, or the bioconda binary. (2) How were you successfully able to run valgrind? There is a specific (stated) incompatibility with jemalloc and valgrind (see e.g. here). This is the reason we can't reliably use valgrind when linking with jemalloc (though we do debug using valgrind by explicitly changing the CMake to not link jemalloc). Thanks! |
precompiled salmon binary does work. I believe the problem with valgrind and jemalloc concerns finding leaks when the program exits, but that wasn't what it showed here, rather an access violation or use of an uninitialized variable. The precompiled one has the same (or at least similar) valgrind messages. Whether it explodes or not would depend upon what happens to be at the improperly accessed locations, and that need not be the same for the two binaries.
|
I agree that the logs here show access to an uninitialized variable. That may, indeed, be happening (and we will look into this ASAP). I also agree this is an issue that would depend on program layout, and what is in the relevant locations in memory space when the program runs. However, my understanding from the relevant issues with jemalloc and valgrind is that they are simply incompatible in the large (not just for leak checking). Specifically, since both jemalloc and valgrind replace default system allocator calls, you cannot reliably use valgrind on an executable complied and linked against jemalloc and expect correct results. We've run into this when doing memory validation ourselves, where we've had to compile without jemalloc to even get valgrind to not immediately throw an error and kill the process. |
Hi again @mathog, Thanks again for the detailed report of the problem you are experiencing. Thanks, |
transcripts.fasta is from the sample_data.tgz in the salmon distribution. Running salmon the same way but with the "some.fasta" from the first post also is full of valgrind errors, but not exactly the same ones. So I don't think the sequence is the problem. Yes. it is possible that the valgrind messages are just noise due to interactions with jemalloc. But since the binaries built here do segfault when run normally or in gdb there is must be something wrong with them which is in some way memory related. So, it looked like the src code for salmon only used malloc, realloc, and calloc, and not any of the other jemalloc specific functions. So I relinked it without the jemalloc library (well, maybe, the usual link command but minus "/usr/lib64/libjemalloc.so"):
and then ran without segfaulting like:
and it worked. Probably has nothing to do with not having jemalloc in there, access violations that work in one place and not another often will flip between working and back on rebuilds or relinks. So next:
Valgrind flagged the same lines in that as this (with jemalloc) |
Hi @mathog, Thanks again for all of the detailed information. This has helped us track down the source of these uninitialized reads. We have eliminated them, and valgrind now runs cleanly (with 0 reported errors) for In one case, part of a byte at the end of When you have a chance, could you please pull from the |
Rebuilt from develop branch. Test with
Segfaults. Relinked pufferfish without jemalloc.so (removed the two jemalloc.so entries on the linker command line) to make pufferfish_nojemalloc. Test that:
Ran normally. Ran "valgrind --track-origins=yes" on both versions. BOTH ran to completion and neither saw any errors. (Attached.) Finally ran the jemalloc version in gdb, `gdb --args /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar For help, type "help". Program received signal SIGSEGV, Segmentation fault. I am trying to obtain the src.rpm for jemalloc to rebuild it. There seems to be something wrong with it (on my test system anyway). |
Thank you for the detailed new report @mathog! So this is behavior we have seen before. On our builds, its sort of rare, but every once in a while, we will get a jemalloc segfault from just that function. Everytime we attempt to link without jemalloc to use valgrind (or address and memory sanitizer), we get a clean profile. I am often reticent to point a finger at a wildely-used library like jemalloc, but I have been unable to reproduce such issues under the system allocator, or valgrinds, or even other multithreaded allocators like tcmalloc. |
Clearly something strange is going on with this jemalloc. Today I rebuilt it and reinstalled it:
Then reran the test case:
But the rebuild also made the debug RPMS, so
That's right, adding the debug information resulted in gdb itself vaporizing during the run. That is not the sign of healthy software! Let's revisit the CentOS container example from above. Where did the jemalloc used in that case come from? It isn't a standard part of the operating system. Mine came from the EPEL 8 repository. Perhaps the container one used a different version of jemalloc? This is what mine looks like (after install of rebuild):
|
In the interests of finally being done with this (without having to wait for some obscure jemalloc issue to be hunted down at those developers' leisure), is there some world shattering problem that will occur when using pufferfish/salmon without jemalloc? I know these programs do not build that way by default, but it is easy enough to do so by hand, and doing that has the distinct advantage that the programs will actually run on this system! Also, I just double checked, and the salmon binary your lab distributes is not actually linked against jemalloc, neither are any of the libraries which ship with it. |
If built with just the system malloc, the programs will run fine (i.e. they will produce the correct output). The main reason we link against jemalloc is that both pufferfish and salmon have embarrassingly parallel phases, and so they benefit greatly from both being optimized for multithreaded execution. In that context, scalable memory allocation becomes a bottleneck. So, I would expect the performance profile to change when leaving out jemalloc. We've not tested in that configuration extensively, so I can't give you a good estimate on how much. Alternatively, there are other "drop-in" replacements for malloc and free that you could substitute, like tcmalloc or mimalloc (which actually looks to perform better than jemalloc under most metrics). However, these are all runtime differences, which are secondary to having the program actually run! |
P.S. As you can see, I reported this upstream to the jemalloc folks. One suggestion they have is to compile jemalloc with |
The releases for pufferfish and salmon have not been updated since the nature of the jemalloc/libtbb problem was identified. Are there fixed versions in one of the branches on github? |
Yes, the fix is on develop on both. The next salmon release (1.3.0) has a few other improvements and fixes and will likely be out latet this week or next week. |
Pufferfish's develop develop branch compiled (with warnings) and the resulting binary worked. No segfault even though it is linked to both lilbjemalloc and libtbbmalloc_proxy. pufferfish index -r transcripts.fasta -o /tmp/pufftest This patch eliminates the compiler warnings outside of htslib: It is hard to apply though because some of the patched files don't appear until during the build (external file is downloaded then). So this is how I had to build it:
Note that "make install" did not actually install the binaries into the target, that had to be done later. There were many compiler warnings in htslib (I fixed one then decided to stop). When htslib 1.10 is installed on this system (for samtools) it builds cleanly. The htslib in pufferfish seems to be older. Perhaps at some point it can be updated? That will remove the remaining compiler warnings. Here are the log files from the build; |
Thanks for the patch and the detailed logs. The version of HTSlib is due to the recursive-dependency in seqlib. The Regarding applying the patch, there is an option to perform a "patch" step to a downloaded and decompressed |
Pufferfish does not "make install" fully. This is on CentOS 8.
After that these are the contents of $TOPDIR:
Unclear to me which if any of the files that install initially placed under $TOPDIR really should be there. The documentation for building pufferfish only describes up to "make".
The other issue - pufferfish built this way does not run.
The text was updated successfully, but these errors were encountered: