Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostManager failed to remove container #373

Open
mhasself opened this issue Jan 18, 2024 · 2 comments
Open

HostManager failed to remove container #373

mhasself opened this issue Jan 18, 2024 · 2 comments
Labels
agent: hostmanager bug Something isn't working needs triage Cause of bug still unknown, needs investigation.

Comments

@mhasself
Copy link
Member

Odd problem today (finally resolved at 2024-01-18 14:30:06 UTC) when ACU agent was misbehaving, and hostmanager "down" could not kill it (twice, though first attempt was brief). HM logs show:

2024-01-18T14:23:51+0000 start called for update
2024-01-18T14:23:51+0000 update:140 Status is now "starting".
2024-01-18T14:23:51+0000 update:140 Status is now "running".
2024-01-18T14:23:51+0000 update:140 Update requested.
2024-01-18T14:23:51+0000 update:140 Status is now "done".
2024-01-18T14:23:52+0000 manager:0 Requesting termination of ACUAgent:acu
2024-01-18T14:23:57+0000 manager:0 Agent instance ACUAgent:acu refused to die.
2024-01-18T14:23:58+0000 manager:0 Detected unexpected session for ACUAgent:acu (probably docker); it will ...
2024-01-18T14:24:03+0000 manager:0 Requesting termination of ACUAgent:acu
2024-01-18T14:24:08+0000 manager:0 Agent instance ACUAgent:acu refused to die.
2024-01-18T14:24:09+0000 manager:0 Detected unexpected session for ACUAgent:acu (probably docker); it will ...
2024-01-18T14:24:13+0000 manager:0 Requesting termination of ACUAgent:acu
2024-01-18T14:24:19+0000 manager:0 Agent instance ACUAgent:acu refused to die.
...

But running docker-compose rm --stop --force ocs-acu, as ocs user, did bring down the container, without any issues. (That command should be exactly what the HM runs.)

I have no useful analysis yet...

@BrianJKoopman
Copy link
Member

BrianJKoopman commented Jun 6, 2024

This happened again today on satp2 with a Lakeshore372 agent. A normal docker stop <container> run manually stopped the container. After that I was able to start with the HM, then subsequent "down" commands from HM worked fine.

EDIT: I should also note that the 372 agent was in a semi-crashed state, having had a broken pipe error in the main acq process.

@BrianJKoopman
Copy link
Member

This problem continues to occur. It'd be great to figure out a way to reproduce this state. Recent occurrences I've seen have coincided with the agent's loss of connection to its device. For instance, when the cryomech agent had a broken pipe to the compressor, or the 372 agent to the 372.

@BrianJKoopman BrianJKoopman added bug Something isn't working needs triage Cause of bug still unknown, needs investigation. labels Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent: hostmanager bug Something isn't working needs triage Cause of bug still unknown, needs investigation.
Projects
None yet
Development

No branches or pull requests

2 participants