-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAKETIME_SKIP_CMDS does not seem to work in certain automake cases #355
Comments
Thanks for pointing this out, interesting use case you have there! Indeed we have a long-standing incompatibility with jemalloc, as discussed in some of the (closed) issues you already pointed to, and back then we concluded that there is no viable way on libfaketime's side to fix this problem. The I think the problem here is that However, even for a "skipped command", libfaketime is LD_PRELOADED, and thus clashes with jemalloc during initialisation, leading to the lock-up. So, unfortunately, using I know that this requires a much more fine-grained approach to start some programs with libfaketime LD_PRELOADed and other without in any non-trivial build process, and this is unsatisfying, but this sort of control must be exerted from outside libfaketime, because once libfaketime becomes active, it already is too late. One possible approach would be a wrapper, so programs are not directly started, but through this wrapper, and the wrapper decides (based on the name of the program to start) whether libfaketime should be LD_PRELOADed or not. That's not very elegant, and adds another layer of complexity to the build process that might not be worth the effort, but it's also somewhat simple and straight-forward. |
Still need to do more testing but the master branch with the PR for jemalloc seems to solve the issue. |
Just an update I am using the patch for libjemalloc. I am not sure if this was merged or not. It seems to solve the issue for me... it seem to be gone from the PR page. However the issues I am seeing now is that in -j based builds I tend to see with high job/thread values is
I don't know if there is something that can be done to address this. |
The PR (#333) was closed by @ronrother without merging as it does not (and probably can't) work in all cases. I can't tell how close your new problem is related to this (if you applied the PR). Basically, during initialisation libfaketime accesses semaphores and shared memory for inter-process synchronisation of settings. It tries 3 times to either access existing ones or to re-create those, which are supposed to already exist (because their filename was passed in the environment variable FAKETIME_SHARED). If all attempts fail, libfaketime exits with error code 1. You could try to increase the number of attempts in Line 527 in f26242b
or add short delays before re-tries. Usually, the cleanest solution is to use libfaketime on a more long-lived parent process (whois PID is then used as part of the semaphore filename) |
I will give it a try and let you know what I see. I am happy that I have 100% reproducible builds now! it just the random failures I am seeing which seem related to fixing the jemalloc issues |
I am still testing.. but so far it is still failing... I admit I don't understand why we need to have a shared semaphore in this code. Can you explain a little? It seems that it would be simpler without this. I am sure there is a good reason related to the spawning process or something. I am getting in the logs stuff like this:
which suggests some process making lots of subprocesses and they are failing as they are referring to a parent process PID value? ( I think that is what is happening) Is the path being used bad? it seem making a file in root '/' is going to fail |
Ok .. looking around and have a better idea now... I don't this code works as you stated... the ft_shim_init() functions check for the ENV var as you stated and does the open call as it "should" exist from a running parent process with the
This fails so it does the else statement of Lines 533 to 536 in f26242b
however, it just calls itself. There is no recreation attempt. it just tries to reopen an existing item that we know does not exist. Looking at what should be called to recreate the item being tested for |
I have been digging into the issue more today. I think I see the issue. The My question is do you feel we should tweak this to try to create an object with the parent PIP defined via FAKETIME_SHARED or should FAKETIME_SHARED be cleared and try to recreate it based on the child PID? |
|
Right.. so the fix is to re create() via the parent or the child. The current logic tries to create on the child then returns without FAKETIME_SHARED being updated to the child value.. so it fails as it always mismatched. It seems if i read between the lines the expected logic is to use the child pid and have FAKETIME_SHARED updated correctly |
Can you check your |
I have lots of files .. 1000s |
Please delete them. They must be left over from something that went wrong (so |
I applied this patch
This seems to have solved the issue. I feel I need to test it more.. but the issue has not shown itself again so far. As far as files in |
There should not be any semaphore files except for those processes which are still running. It'd be interesting to figure out why they are left over, especially whether it's application crashes or a libfaketime-internal error that causes abnormal process termination. I get the intention of your patch and appreciate it. However, |
I think the main issue is that on short processes there is a race in which the parent dies and the children are still running. This errors out in the child as the FAKETIME_SHARED values are invalid, and the process tries to open a file that does not exist, and if it tries to re-create the file it uses a PID value that is a mismatch. This is clearly a bug at the moment in the main code. |
I'm really failing to follow you on where you clearly see a bug. libfaketime tries to open the file whose name was passed to it explicitly through an environment variable by its parent process. If the semaphore file does not exist any more, because the parent process already cleaned up and exited, a new file is created, obviously using the current process' PID to make it unique, because the current process could become the parent process of yet other spawned processes later. Using the parent's PID instead of its own PID for this purpose seems like a horrible idea, because that other process has already exited and its PID might be re-used by a completely different process sooner or later. You're right that one basic assumption is that libfaketime initialisation is run while the parent process still exists. That's usually the case when applications are started, e.g., via an interactive shell, or when a wrapper like |
I agree. The patch I had above clears the value of FAKETIME_SHARED so when the |
Just as a note. I did change the n value to 100. it was clear in the dump of test that the parent ID was always used, not the new child ID value that was made in ft_shm_create(). Everytime the ft_shm_init() was called within the process it was getting the old parent ID value from the getenv() call. I was testing this on centos 7 and rhel 8 |
You mentioned |
So for me at the moment everything seems to be working. The CI was set up to just run over and over again and test the outputs.. not one error. I agree with what you are saying, and was not sure what was wrong. However, it is hard to debug as you can imagine and it might be related to the jemalloc patch as well. I have to deal with some other items today, but I think I can post here a small example of the setup I have, that would at least show the leaking of the files in |
Good to know, thanks. Files are typically left in |
This might be a user error issue, or a question might be a real issue. I am trying to use libfaketime to help with certain reproducible issues I have when building cmake and automaker-based projects. In general, this tool seems to work great. I use the value of
FAKETIME_SKIP_CMDS=ls
to address automake configure script issues related to "sane" build environments. However, I have found a few cases related to components that use jemalloc in which configure will lock up when the configure script tries to check for jemalloc values. This seems to be easy to work around as the configure script is trying to run a
conftest
program. I tried adding this to the FAKETIME_SKIP_CMDS variable, however, it does not seem to be accepted. I didFAKETIME_SKIP_CMDS=ls,conftest
As stated this locks up as described in issue #327 In this case I was compiling Apache Trafficserver (https://github.com/apache/trafficserver) with jemalloc support. I am unclear if I need to use a different value to allow conftest to be ignored. Ideally, conftest should be ignored in most cases as it testing some value we normally don't need to tweak with libfaketime. For me, this would address the jemalloc issue for building these components as I just need to ignore the tests the configure script is compiling and running.
The text was updated successfully, but these errors were encountered: