-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logmon "Unrecognized remote plugin" errors #6715
Comments
Thanks @zackiles for reaching out. That's a bug we must address indeed. In the example above, mind if you post
One issue I note from the logs is that nomad children processes seem to have died, possibly by ansible in process of restarting task. When spinning up a task, nomad spins some auxiliary processes (docker_logger, logmon, executor) that expect to be running beyond nomad main process. If these processes die while nomad client is down, the recovery process may be hindered. That doesn't explain the bug I'm afraid, but I suspect it's a contributing cause. |
@notnoop - I will try and get the logs. |
@notnoop - I was unable to get the logs and the machine on which it happens is no more 😢 |
Hi @notnoop - does this suffice ?
The job output is: (it's a reproduction of the above in a new environment)
|
@zonnie that's very useful indeed - I've very surprised by the "Unrecognized remote plugin message" event. Let me do some further research and reach out if I have any questions. Thanks a lot. |
I've encountered similar behavior around re-starting client nodes. I'm uncertain as to the steps I took to achieve this, as it happened some time during a larger docker version migration, and I've been unable to reproduce it. The behavior is similar in that a task seems to be stuck in "starting", while having only allocations that are status "running" with desired "run".
In the above output, the job has been re-started twice with one queued alloc failing due to resource exhaustion after draining some nodes experimentally. The behavior - one "starting" in task group - began before then, and persisted after. The job is configured to only have one allocation running as well. Due to the difference of all my allocations "running" instead of "pending" I'm not sure if my issue is related to the above at all. |
We've found that we see this issue particularly on nodes where there are high counts of allocations on the node or when the node itself is fairly busy. Allocation count wise, a team mate of mine was able to replicate the issue when there were more than 24 allocations on a given node. Marking a node as ineligible and stopping the particular allocation to get it to move to a node that has fewer allocations has been our way of getting the allocation in a better place. |
@notnoop Any luck looking into the |
Hi folks! This issue has been open a while, so I wanted to drop by and add some context. We've seen similar reports about the loggers from customers on Nomad 0.10.x and 0.11.x clusters. The reports we've received all are with the Docker driver, and there's some correlation with both client restarts and the number of allocations running on the client. When we're getting logs for Docker tasks, we're hitting the Docker API. This involves a whole lot of JSON serialization/deserialization, and I've experienced production issues at previous jobs when the Docker API gets hammered by logs, so that's a possible reproduction clue. I don't see any reason to suspect we wouldn't see this problem on the current version of Nomad as well. The error is bubbling up from case line := <-linesCh:
// Trim the line and split by "|" in order to get the parts of
// the output.
line = strings.TrimSpace(line)
parts := strings.SplitN(line, "|", 6)
if len(parts) < 4 {
err = fmt.Errorf(
"Unrecognized remote plugin message: %s\n\n"+
"This usually means that the plugin is either invalid or simply\n"+
"needs to be recompiled to support the latest protocol.", line)
return
} Our colleague @davemay99 has pointed out that this may be the result of spurious empty lines, but we don't know why we'd be seeing that either. The plugin may be writing to stdout or dying before it got to write the socketport info. But the error message isn't all that useful either and could probably be re-worded to something like:
Which wouldn't help Nomad users so much but would definitely make it more clear to us what could be going wrong. I'm going to rename and re-tag this issue a bit so that we have a different focus on it. If folks who've come by have stuck "starting" tasks without seeing the logger errors, let's open up a new issue for that. |
@tgross Hi, I would like to add to your last message. After about 50-60 allocation, we receive this error message and no more allocation can be scheduled on that specific client. Please let me know if and how I can help you to investigate this. |
Thanks @Nickleby9. Learning that it's not Docker-specific is actually a huge help because that eliminates a broad set of possible sources. |
@tgross |
Hi @Nickleby9! Sorry, I don't have any updates on this issue. Let me brainstorm a bit with some of my colleagues on this and see what I can come back with. If you have debug-level Nomad client logs for the clients that are impacted, that might help us out. |
@tgross I sent you a debug-level monitor from a client that is experiencing this issue to your email at your GitHub profile |
Thanks @Nickleby9. For future reference there's a "one way" mailing list [email protected] that can receive logs so that they're going to HashiCorp engineers only, without going through our personal email addresses. Looking at one of the log entries:
This definitely looks like we're getting a spurious empty line from |
@tgross Thank you, I didn't know that, sorry for using your personal email. |
Ok so for Nomad 1.2.6 we're using go-plugin 1.4.3, and the error is coming from So instead of getting a handshake from logmon, we're getting a blank line. Interestingly I see a closed issue hashicorp/go-plugin#164 that dismisses this kind of thing as obviously a plugin author error. On Linux clients I could probably give you a bpftrace program that could find out where the spurious write to stdout is coming from. But this at least gives me a clue where to look next. I'll report back once I know more. |
Ok, I wasn't able to find anywhere in either the Nomad logmon "server" code or the go-plugin code where we could be emitting a spurious unprintable character. So then I started digging into how we setup the logmon process and noticed that we have a different fifo pipe being set up for plugin communication on Windows (ref Given that you're seeing the problem only once there are a large number of allocations running, I was worried we might have collisions between tasks when creating those files, but we could create ~100M of them before hitting the birthday paradox so I unless your cluster is very busy 😀 I think we're ok there. But as it turns out go-plugin doesn't actually use those pipes. If we take a look at the go-plugin code we see on Windows we use a TCP listener (ref But if they're both set to 0, the loop where we could potentially retry (ref I yet don't have a firm mechanism in mind as to what could be going wrong, but there are two tests that might give us more data:
|
@tgross Has been awhile since I looked into this issue. I had a hunch that the plugin crashes, such that we either get corrupted output or stdout closes before emitting the port info which nomad interprets as an empty line. I wasn't able to reproduce it so no proof. |
@tgross Ok I still had this issue this morning so first I saved the output of I have increased the I will keep you updated once the issue will happen again and I will perform more tests. |
Thanks for those logs @Netlims! Ok, there's a lot of ports in use on this machine, so I did a little data cleaning and then slurped the whole thing into sqlite for analysis. In both logs, there are a large number of Nomad ports with a client named
All the ports where both applications are bound are within 10000-10059 and 14000-14046, and the high ports in the 49000+ range have only the
There are a few "blocks" of port ranges being used with very few gaps, but there's plenty of room between those blocks, so that suggests a port collision is unlikely. (As an aside, it looks like you're running Vault and Consul on the same host. They're both bound to the private IP and not localhost. If you're running Vault agent or Consul client on your Nomad clients, you might want to bind to localhost instead. Depending on your environment, we probably wouldn't recommend running the Vault/Consul server on the same host as a Nomad client either, although I suppose you could be running them as Nomad jobs?)
If you were to watch/capture Process Monitor process activity, it might help verify if the logmon process is exiting with a known exit code without making its way over to the Nomad host. |
@tgross Indeed SAASQABE02 is the client itself, where the logs are taken from. By the way, |
@tgross We have this issue again. |
Thanks @Nickleby9! The smoking gun is here:
Logmon is trying to spawn a conhost process, and I want to dig into why on that (I think there's some library operation that's doing that). But according to the NTSTATUS Values docs, exit code
Which honestly doesn't make a whole lot of sense to me, as if there was a linking error in the build I'd assume it would never work, instead of working for a while and then failing. Good ol' Stackoverflow has this question but that also reads to me as the sort of problem I'd expect to see all the time and now as we're seeing it here. I found a truly ancient Windows 2008 hotfix that fixes a problem similar to the symptoms you're describing, but obviously that doesn't apply here! 😬 So I don't quite have an answer yet, but we've definitely got a clue here for investigating. I'm going to look into why the conhost process is getting spawned and get back to you here once I have more info. Thanks for your patience! |
Hi @tgross! How is the investigation going? |
Hi @tgross, I noticed the other issue you mentioned. |
Hi folks, sorry, I don't have any new information on this one. We'll update when we've had a chance to dig in further. |
@tgross Any news with this? tracking this for a while. We are seeing this crippling workload when the number of running processes is more than a few dozens on Windows boxes. This is seriously causing us pain as we must fiddle with process allocation and nodes to get this back to normal every single time, sometimes with embarrassing results.. Not sure how widespread this problem is but based on this thread alone, it seems to be a very serious, possibly critical problem with the ability to scale using Nomad as a scheduler. Any chance this can get more focused attention from HC? |
@moshemarciano as noted above, we'll update when we have more information. We've sunk a bunch of effort into trying to track it down and while there have been some interesting discoveries about specific OS behaviors, it doesn't seem to be isolated to a particular OS and so it's likely a problem in logmon proper and/or the go-plugin architecture. While we understand this is important to you, please understand we're always juggling a lot of priorities at once. |
Hi everyone! We're also encontering this issue from time to time on Windows (Nomad v0.10.5). The head process |
Hi @tgross , Any news maybe? Is there anything else you need from us? |
I see this problem on linux. Around the time when Nomad allocations fail, I have a gap (09:17:01-0:00:35) in system journal:
Errors (09:10:04-10:00:25)
Also I see lots of these on machine:
|
The solution to our problem was to increased to amount of memory allocated for non-interactive processes on Windows. In the value, after "SharedSection", there are three numbers separated by a comma, the third number is responsible for the non-interactive memory space. |
@Nickleby9 do you know or have a link to docs on what that value is? We could add that to the install guide for Windows. |
@tgross sure. you can read about it here: Nomad uses the non-interactive space (session 0), so that's the only value that needs to be modified. |
Thanks @Nickleby9. There's an internal discussion around recommendations for Windows setups and general Windows performance issues, which I'll raise this in. There are obviously also some reports in this issue that this problem impacts Linux users (although maybe less often?). But that's helpful new information. |
The doc states:
Are we sure that it's doing anything for OS later than Windows 7? |
@RoadTrain Maybe there is a bug on how Windows is trying to handle the allocation and that's why increasing it manually solves that 🤷♂️ |
A job task is stuck in
Starting
state while it seems to be running fineNomad version
Nomad v0.9.5-gc-1
Operating system and Environment details
Linux machine-1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Issue
Looking for the status of
c-1
I see the following:This appears to be running according to Nomad
But when I dig deeper I see it's still in
Starting
for some reason, the single task that exists is inStarting
The problem is that the container seems to be running fine.
This completely fails a process in which I wait for my containers to be in
Running
state (using thejob/job-id/summary
Reproduction steps
c-1
was runningc-1
is in the above stateJob file (if appropriate)
Nomad Client logs (if appropriate)
Nomad Server logs (if appropriate)
Don't have those
The text was updated successfully, but these errors were encountered: