-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tasks fail to restart because of failure to launch logmon #13198
Comments
specially in line 160 nomad starts throwing logmon errors & everything goes side ways for all nomad jobs |
Task hook failed logmon: Unrecognized remote plugin message: This usually means that the plugin is either invalid or simply needs to be recompiled to support the latest protocol |
@krishnaprateek6 I've edited your original request because all the information was hidden inside an HTML comment for some reason. The relevant bits from the attached logs look to be here:
It's a little unclear to me from your description how the situation starts. Is this happening after restarting the Nomad client agent, or after restarting the Docker container? If you mean the Docker container, was it restarted because it failed or was it restarted through a Possible duplicate of #11939 or #6715, except it looks like this is a non-Windows case. |
possibly something related to this #5577 but looks like this was fixed in older version of nomad but after when we start seeing these logmon errors in nomad client logs that's when we notice odd behavior with nomad where it cannot re-attach existing job even though docker ps shows that the container is up & running. |
@tgross looks like i have more concrete info now, the nomad job itself is running but the job that is running on static port becomes inaccessible after we see above logmon errors its as if nomad cannot release static ports after job restarts itself may be something that would be useful for you. https://discuss.hashicorp.com/t/nomad-job-unreachable-in-bridge-mode-networking-even-though-container-runs-fine/38202/2 |
@krishnaprateek6 let's try to avoid hypothesizing for now and focus on describing the sequence of events and symptoms you're seeing. So far we only know that:
Stuff that would be helpful for me to help debug:
|
@tgross to answer your first question no the client did not get restarted but yes task got restarted by nomad when we first looked at it was a db connection issue then we resolved it but second time it happened there were no errors task was running in nomad but port on which task was running became inaccessible. FYI, We have all Firewall rules open for static ports nomad somehow not sure why tries to hold up static port so when the job tries to restart on nomad even thought there is no port that's allocated it thinks port is already allocated so service itself running on that static port becomes inaccessible. Unfortunately we are unable to replicate the missing jobs issue in nomad. Output docker version: 19.03.13, build 4484c46d9d ( This is on Centos7 VM) |
@krishnaprateek6 if you have a running task that can't be reached on the network, that's a completely different problem than the logmon attachment. Please open a separate issue for that. |
is there any info or progress on why systemctl restart nomad takes too long for nomad service to restart please |
@krishnaprateek6 open a new issue detailing that problem. Please don't pollute existing issues with unrelated questions. |
Hi @tgross, we are running into the same problem. We're running 1.4.1 on Debian 11. It happens on a nomad client restart. We pretty much only restart clients when upgrading nomad, but I imagine it can happen on a normal restart as well. Up to now we've only detected it via expiring vault tokens which are still in use by allocations which failed to restart after receiving new secrets. Looking at the logs it appears that the logmon error is happening to most, if not all, allocations after a nomad client restart. However, the number of allocations that continue using old (expired) vault tokens after that is very small (most recently, 1 allocation out of 14 allocations on 1 node, out of 3 nodes total with roughly ~14-15 allocations per node), so it would appear that most are restarting successfully in the end anyway. It's a strange issue and we're not sure how to proceed. Is there anything you can suggest to debug further? |
Hi there,
Nomad version
v1.2.6
Operating system and Environment details
Centos 7
Issue
Once docker container is restarted successfully nomad fails to re attach the job in nomad. This issue was reported long time ago #5296 but this seems to be an issue on nomad v1.2.6. As the docker container is still running on VM but nomad does not show up as running job & this seems to be an intermittent problem where containers still running but magically jobs in nomad disappear. Also other issue we notice is any running docker containers are sending out tons of warnings "error logs unable to read json" in docker & when we do systemctl restart nomad it takes too long than expected but eventually starts. So we are hoping that above is issue is interconnected. The big problem we are unable to reproduce this issue at will but this keeps on coming up every now & then & then we have to react. restarting docker & nomad seems to fix this issue but we need to do this every time we see this problem.
Reproduction steps
nomad stop ;nomad run docker container is up & running but job fails to show up on nomad UI or nomad status even though docker ps shows container is up. even though it is not consistent but seems to happen every now & then.
Expected Result
jobs need to show up in nomad whenever docker containers are fully up.
Actual Result
docker ps shows containers is up but does not show up in nomad as a registered job.
Job file (if appropriate)
Nomad Client logs (if appropriate)
I have attached nomad logs below
The text was updated successfully, but these errors were encountered: