raw_exec plugin windows: "logmon: Unrecognized remote plugin message" #11939

cattuz · 2022-01-26T13:24:09Z

Nomad version

Nomad v1.2.2 (78b8c171a211f967a8b297a88a7e844b3543f2b0)

Operating system and Environment details

Windows Server 2016 on an Azure VM

Issue

I get the following message intermittently when using raw_exec plugin on windows:

logmon: Unrecognized remote plugin message: This usually means that the plugin is either invalid or simply needs to be recompiled to support the latest protocol.

It happens multiple times in a row, and only appears to be resolved when switching the job to another VM. Other identical allocations can start after an allocation fails like this, so it seems like it's something in the plugin itself that's not working.

Reproduction steps

Unable to reproduce reliably.

Expected Result

No plugin error .

Actual Result

Plugin error.

Job file (if appropriate)

I've tried this:

task "example" {
  driver = "raw_exec"
  
  artifact {
    source      = "..."
    destination = "${NOMAD_ALLOC_DIR}\\services"
  }

  resources {
    cpu    = 100
    memory = 100
    memory_max = 200
  }

  config {
    command = "${NOMAD_ALLOC_DIR}\\services\\example.exe"
  }
}

and wrapped in a script:

task "example" {
  driver = "raw_exec"
  
  artifact {
    source      = "..."
    destination = "${NOMAD_ALLOC_DIR}\\services"
  }

  resources {
    cpu    = 100
    memory = 100
    memory_max = 200
  }

  template {
    destination     = "local/run.ps1"
    left_delimiter  = "%%"
    right_delimiter = "%%"
    data            = <<EOH
$exe = "%% env "NOMAD_ALLOC_DIR" %%\services\example.exe"
$p = Start-Process $exe -ArgumentList $args -Wait -NoNewWindow -PassThru
Exit $p.ExitCode
EOH
  }

  config {
    command = "C:/Windows/System32/WindowsPowerShell/v1.0/powershell.exe"
    args = ["-File", "local/run.ps1"]
  }
}

Both with the same result.

Nomad Server logs (if appropriate)

No errors or warnings in server logs.

Nomad Client logs (if appropriate)

No errors or warnings in client logs.

The text was updated successfully, but these errors were encountered:

Amier3 · 2022-01-28T17:37:44Z

Hey @cattuz

This issue has come up a few times,but we haven't found a solid root cause yet. One of the recent users to encounter this with the same plugin noted:

After about 50-60 allocation, we receive this error message and no more allocation can be scheduled on that specific client.
You mentioned here that this could be linked to the number of allocations running on the client, so I tried manually removing some allocations and the failed ones eventually succeeded to run.
When I tried bringing back those allocations I have stopped, they failed to run with the same error message.

Based on your # of allocations, do you think this is the bug you're experiencing?

cattuz · 2022-01-29T14:59:24Z

The allocation counts are as follows:

My experience is that it has to do with how long nomad or the allocations have been running. There is no problem starting up around 40 allocations on a fresh VM, but when an allocation gets scheduled on the same VM after a few hours it fails with the logmon error.

I'm running nomad as a windows service on the VM. Would using a scheduled task or something else be better?

Edit:
I will be trying out a scheduled task restarting the nomad service hourly to see if anything changes:
schtasks /create /tn "Restart_Nomad" /tr "cmd /C 'sc stop Nomad && sc start Nomad'" /sc hourly /ru system /rl highest

Edit 2:
No luck with the periodic restart. So it seems it might be to with the number of allocations rather than some memory/resource leak. Maybe it has something to do with windows service sub process limits?
https://stackoverflow.com/questions/17472389/how-to-increase-the-maximum-number-of-child-processes-that-can-be-spawned-by-a-w

I will try increasing the subprocess limit and see if that changes anything.

cattuz · 2022-02-01T08:21:57Z

Indeed the windows "desktop heap" seems to have been the problem. More discussion in the service limits stackoverflow thread.

After increasing service heap limit from 768 to 4096 with the following to my VM setup script...

# Increase service heap limit, controlling how many subprocesses a service can spawn
$heapLimits = "%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,4096 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16"
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems" -Name "Windows" -Value $heapLimits

...all the logmon issues have gone away. It does seem quite logical in hindsight, although I've never encountered those windows service limits before.

There seem to be some caveats, like the change affecting all services running on the VM potentially making them consume more resources, but so far I have noted no adverse effects.

cattuz added the type/bug label Jan 26, 2022

cattuz mentioned this issue Jan 29, 2022

Task stuck in running when replacing deployment Roblox/nomad-driver-iis#60

Open

lgfa29 added stage/needs-investigation theme/logging theme/windows labels Feb 2, 2022

tgross assigned lgfa29 Mar 7, 2022

tgross mentioned this issue Jun 6, 2022

tasks fail to restart because of failure to launch logmon #13198

Open

tgross added theme/platform-windows and removed theme/windows labels Jun 20, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Triaging in Nomad - Community Issues Triage Jun 24, 2024

tgross added the theme/docs Documentation issues and enhancements label Jun 24, 2024

tgross unassigned lgfa29 Jun 24, 2024

tgross removed the stage/needs-investigation label Jun 24, 2024

tgross moved this from Triaging to Needs Triage in Nomad - Community Issues Triage Jun 24, 2024

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw_exec plugin windows: "logmon: Unrecognized remote plugin message" #11939

raw_exec plugin windows: "logmon: Unrecognized remote plugin message" #11939

cattuz commented Jan 26, 2022

Amier3 commented Jan 28, 2022

cattuz commented Jan 29, 2022 •

edited

Loading

cattuz commented Feb 1, 2022

raw_exec plugin windows: "logmon: Unrecognized remote plugin message" #11939

raw_exec plugin windows: "logmon: Unrecognized remote plugin message" #11939

Comments

cattuz commented Jan 26, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Amier3 commented Jan 28, 2022

cattuz commented Jan 29, 2022 • edited Loading

cattuz commented Feb 1, 2022

cattuz commented Jan 29, 2022 •

edited

Loading