Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tasks fail to restart because of failure to launch logmon #13198

Open
krishnaprateek6 opened this issue Jun 1, 2022 · 13 comments
Open

tasks fail to restart because of failure to launch logmon #13198

krishnaprateek6 opened this issue Jun 1, 2022 · 13 comments

Comments

@krishnaprateek6
Copy link

krishnaprateek6 commented Jun 1, 2022

Hi there,

Nomad version

v1.2.6

Operating system and Environment details

Centos 7

Issue

Once docker container is restarted successfully nomad fails to re attach the job in nomad. This issue was reported long time ago #5296 but this seems to be an issue on nomad v1.2.6. As the docker container is still running on VM but nomad does not show up as running job & this seems to be an intermittent problem where containers still running but magically jobs in nomad disappear. Also other issue we notice is any running docker containers are sending out tons of warnings "error logs unable to read json" in docker & when we do systemctl restart nomad it takes too long than expected but eventually starts. So we are hoping that above is issue is interconnected. The big problem we are unable to reproduce this issue at will but this keeps on coming up every now & then & then we have to react. restarting docker & nomad seems to fix this issue but we need to do this every time we see this problem.

Reproduction steps

nomad stop ;nomad run docker container is up & running but job fails to show up on nomad UI or nomad status even though docker ps shows container is up. even though it is not consistent but seems to happen every now & then.

Expected Result

jobs need to show up in nomad whenever docker containers are fully up.

Actual Result

docker ps shows containers is up but does not show up in nomad as a registered job.

Job file (if appropriate)

dockerd[30372]: time="2022-06-01T09:15:25.317857180-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=170
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317871168-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=171
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317884580-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=172
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317899266-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=173
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317913751-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=174
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317928000-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=175
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317942272-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=176
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317956568-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=177
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317980363-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=178
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318001335-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=179
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318019534-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=180
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318034561-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=181
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318048961-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=182
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318068103-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=183
Jun 01 09:15:25 dockerd[30372]: time="2022-06-01T09:15:25.318085230-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=184
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318099485-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=185
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318113410-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=186
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318128129-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=187
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.323296331-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=188

Nomad Client logs (if appropriate)

I have attached nomad logs below

@krishnaprateek6
Copy link
Author

docker logs

@krishnaprateek6
Copy link
Author

NomadFailure (1).txt

specially in line 160 nomad starts throwing logmon errors & everything goes side ways for all nomad jobs

@krishnaprateek6
Copy link
Author

Task hook failed logmon: Unrecognized remote plugin message: This usually means that the plugin is either invalid or simply needs to be recompiled to support the latest protocol

@tgross
Copy link
Member

tgross commented Jun 6, 2022

@krishnaprateek6 I've edited your original request because all the information was hidden inside an HTML comment for some reason. The relevant bits from the attached logs look to be here:

May 31 06:06:50 test-server nomad[3720]:  client.driver_mgr.docker: created container: driver=docker container_id=6fd72c65b07b8791b2693ec3170a69fa3a9004e1d3bcf029c8
May 31 06:06:50 test-server sh[3714]: 2022-05-31T06:06:50.459-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=6fd72c65b07b8791b
May 31 06:07:31 test-server nomad[3720]: client.driver_mgr.docker: failed to start container: driver=docker container_id=ea0fea5546d4058a99d92ba88f1bbc5ab29fa3737ed
May 31 06:07:31 test-server sh[3714]: 2022-05-31T06:07:31.281-0600 [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=ea0fea554
May 31 06:07:31 test-server nomad[3720]: client.alloc_runner.task_runner: running driver failed: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 task=service error=
May 31 06:07:31 test-server nomad[3720]:  client.alloc_runner.task_runner: restarting task: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 task=service reason="Res
May 31 06:07:31 test-server sh[3714]: 2022-05-31T06:07:31.293-0600 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=fd98c3f3-a865-c30d-14db-
May 31 06:07:31 test-server sh[3714]: 2022-05-31T06:07:31.293-0600 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fd98c3f3-a865-c30d-14db-7d8420
May 31 06:07:46 test-server nomad[3720]: client.alloc_runner.task_runner.task_hook: failed to launch logmon process: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 t
                                          error=
                                          | Unrecognized remote plugin message:
                                          |
                                          | This usually means that the plugin is either invalid or simply
                                          | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server sh[3714]: 2022-05-31T06:07:46.960-0600 [ERROR] client.alloc_runner.task_runner.task_hook: failed to launch logmon process: alloc_id=fd98
May 31 06:07:46 test-server sh[3714]: error=
May 31 06:07:46 test-server sh[3714]: | Unrecognized remote plugin message:
May 31 06:07:46 test-server sh[3714]: |
May 31 06:07:46 test-server sh[3714]: | This usually means that the plugin is either invalid or simply
May 31 06:07:46 test-server sh[3714]: | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server nomad[3720]: client.alloc_runner.task_runner: prestart failed: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 task=service
                                          error=
                                          | prestart hook "logmon" failed: Unrecognized remote plugin message:
                                          |
                                          | This usually means that the plugin is either invalid or simply
                                          | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server sh[3714]: 2022-05-31T06:07:46.971-0600 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=fd98c3f3-a865-c30d-14db-7d8420
May 31 06:07:46 test-server sh[3714]: error=
May 31 06:07:46 test-server sh[3714]: | prestart hook "logmon" failed: Unrecognized remote plugin message:
May 31 06:07:46 test-server sh[3714]: |
May 31 06:07:46 test-server sh[3714]: | This usually means that the plugin is either invalid or simply
May 31 06:07:46 test-server sh[3714]: | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server sh[3714]: 2022-05-31T06:07:46.971-0600 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fd98c3f3-a865-c30d-14db-7d8420

It's a little unclear to me from your description how the situation starts. Is this happening after restarting the Nomad client agent, or after restarting the Docker container? If you mean the Docker container, was it restarted because it failed or was it restarted through a docker command?

Possible duplicate of #11939 or #6715, except it looks like this is a non-Windows case.

@krishnaprateek6
Copy link
Author

possibly something related to this #5577 but looks like this was fixed in older version of nomad but after when we start seeing these logmon errors in nomad client logs that's when we notice odd behavior with nomad where it cannot re-attach existing job even though docker ps shows that the container is up & running.

@krishnaprateek6
Copy link
Author

krishnaprateek6 commented Jun 7, 2022

@tgross looks like i have more concrete info now, the nomad job itself is running but the job that is running on static port becomes inaccessible after we see above logmon errors its as if nomad cannot release static ports after job restarts itself may be something that would be useful for you. https://discuss.hashicorp.com/t/nomad-job-unreachable-in-bridge-mode-networking-even-though-container-runs-fine/38202/2

@tgross
Copy link
Member

tgross commented Jun 8, 2022

@krishnaprateek6 let's try to avoid hypothesizing for now and focus on describing the sequence of events and symptoms you're seeing. So far we only know that:

  • The nomad logs show that a task is restarting, followed by logmon launch errors.
  • The dockerd logs show a bunch of JSON errors.
  • The nomad logs show that we may be launching a docker container but then it's immediately getting killed because Nomad can't run logmon.

Stuff that would be helpful for me to help debug:

  • Earlier I asked if you knew what started the problem in the first place. Did the client host get restarted, or did the task restart for some other reason? Did this problem start appearing after an upgrade?
  • Can you provide the output of docker --version? (Also: is this upstream Docker or RedHat's podman pretending to be docker?)

@krishnaprateek6
Copy link
Author

@tgross to answer your first question no the client did not get restarted but yes task got restarted by nomad when we first looked at it was a db connection issue then we resolved it but second time it happened there were no errors task was running in nomad but port on which task was running became inaccessible. FYI, We have all Firewall rules open for static ports nomad somehow not sure why tries to hold up static port so when the job tries to restart on nomad even thought there is no port that's allocated it thinks port is already allocated so service itself running on that static port becomes inaccessible. Unfortunately we are unable to replicate the missing jobs issue in nomad.

Output docker version: 19.03.13, build 4484c46d9d ( This is on Centos7 VM)

@tgross
Copy link
Member

tgross commented Jun 8, 2022

@krishnaprateek6 if you have a running task that can't be reached on the network, that's a completely different problem than the logmon attachment. Please open a separate issue for that.

@tgross
Copy link
Member

tgross commented Jun 8, 2022

I don't have a solution for this but I'm going to mark this issue for further investigation. There may be something related to #11939 or #6715 here but we'll need to build an independent reproduction for this one I think. I've retitled the issue for clarity.

@tgross tgross removed their assignment Jun 8, 2022
@tgross tgross changed the title Nomad cannot re-attach Job inspite of docker containers still running on VM & healthy tasks fail to restart because of failure to launch logmon Jun 8, 2022
@krishnaprateek6
Copy link
Author

is there any info or progress on why systemctl restart nomad takes too long for nomad service to restart please

@tgross
Copy link
Member

tgross commented Sep 21, 2022

@krishnaprateek6 open a new issue detailing that problem. Please don't pollute existing issues with unrelated questions.

@nobert
Copy link

nobert commented Oct 13, 2022

Hi @tgross, we are running into the same problem. We're running 1.4.1 on Debian 11. It happens on a nomad client restart. We pretty much only restart clients when upgrading nomad, but I imagine it can happen on a normal restart as well.

Up to now we've only detected it via expiring vault tokens which are still in use by allocations which failed to restart after receiving new secrets. Looking at the logs it appears that the logmon error is happening to most, if not all, allocations after a nomad client restart. However, the number of allocations that continue using old (expired) vault tokens after that is very small (most recently, 1 allocation out of 14 allocations on 1 node, out of 3 nodes total with roughly ~14-15 allocations per node), so it would appear that most are restarting successfully in the end anyway.

It's a strange issue and we're not sure how to proceed. Is there anything you can suggest to debug further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants