-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentry: the runner will be destroyed and replaced #651
Comments
Thanks for analyzing and identifying the root cause for the behavior we observed. I have two thoughts on this (or similar) issue(s):
The underlying issue, if I got this right, is that the runner got recreated after a Nomad agent restart. Then, I am wondering about the effectiveness of our |
We can add a Sentry Tag or adjust the Sentry Log Level (info/debug). The tag can be more self-explanatory, but the Log Level would be visible in the Issue overview.
In this case, just the Nomad Server was restarted. This led to the Event Stream being aborted. When re-establishing the event stream, we do an environment and runner recovery:
Should we just continue using the old Nomad Runner reference and destroy the new one that does not have any executions?
If I understand you right you are criticising the following process?!
Yeah, that's imaginable. However, it would introduce complexity to the Nomad Event Handling that already causes trouble to us. |
Sounds good. The log level would feel somewhat more "appropriate", but we don't send any info / debug events to Sentry. In this case, it might be easier to just apply a tag?
Ah, I See. My bad, sorry.
I am not sure about the consequences and why we decided in the first place to use the new reference. Also, I am currently unsure what you mean with "reference" -- what is changing there? The allocation ID (and so on) should not change, so it's "just" the object in Poseidon's memory? Resetting (or not resetting) the Runner Inactivity Timeout doesn't feel too important for me. If an execution is still running, I would expect that the Runner Inactivity Timeout is relatively low (assuming the execution was just started a few seconds ago). If no execution is started, the timer continues and will ensure this used runner is regularly replaced (it is still counted as used, I assume). Finally, I was wondering how an execution could survive a nomad server restart. Is this working in practice or would an execution stop anyway when the active Nomad server Poseidon is connected to restarts (for example, because the WebSocket connection is "proxied" through the Nomad server).
Yes, exactly. This sounds like a good thing for a follow-up issue / PR.
Let's move this discussion to the new issue. I would imagine that an event is sent as soon as a Nomad agent becomes ineligible (to be verified). Then, we could store this in a temporary list and double-check a runner's location before starting an execution. CodeOcean would need to handle this case (not done right now except for file copies). And as soon as either the agent becomes eligible again or Nomad places a new allocation on said agent, we can remove it from the temporary list. |
Then, let's go with the Sentry Log Level. In Poseidon, both the tag and the log level would have to be mapped through a custom field. We would not change the Poseidon Log Levels. The Sentry hook would translate the custom field into either a tag or a different log level.
Yes
Yes
In theory, every attribute of the Nomad Jobs can change due to other actors than Poseidon (e.g. unattended upgrade). When the event stream is aborted, we will not be notified about changes.
Good question in this context! It could not! The WebSocket connection is proxied through the server. When the server restarts, the WebSocket connection receives an Because we would not gain more functionally with changes at this point, we just separate the
|
Thanks for replying to all my questions and helping me to resolve my confusion. I am now clearer about the execution workflow and the involvement of the different components 👍.
Yes, that's true. For most attributes, however, I don't expect a real change (since no other actor is expected to change allocation / job settings). Hence, only further restarts (as caused by unattended upgrades) could happen, e.g., providing a clean runner rather than a used one... This could be missed by Poseidon, when the event stream is no longer connected, of course, but it also doesn't feel too bad.
Shouldn't this be recognized when Poseidon restarts (and recovers the environments and runners)? In this case, it could happen that a Nomad Job is stopped, but Poseidon would catch-up and update its memory structures upon reconnecting to Nomad (that's at least what I thought, in which case it wouldn't be too bad either).
That's true. I also wouldn't expect a change, but agree that we don't have much experience with this feature. Let's ignore a potential change for now and return to this problem when using ports.
Thanks for testing and clarifying. A minor follow-up question: Is each Nomad server restart causing this issue or only some? Could you help to fill out the following table?
Either way, knowing that we receive an
You mean because the WebSocket connection to the runner is interrupted anyway (as your experiment has shown)? Do we ensure that is no execution (started) between the loss of the Nomad event stream and the completion of the (environment and runner) recovery? Or, asked differently: Would it be possible that a new execution is successfully started (connected to Nomad....) and shortly after the recovery is forcefully stopping it (because only then the event stream got restored and all information recovered)?
Separating sounds good, yes. And, depending on the previous answer, "accepting" / "counting" is probably fine (but I really want to ensure we are not killing executions unnecessarily). |
I would expect that the port mapping can also change on restarts. When we miss this, it might lead to unavailability or broken isolation.
Yes, also when the Event Stream reconnects.
Yes
I added the "DNS Resolves" flag indicating whether our
Unfortunately, it is hard to differentiate between different types of poseidon/internal/nomad/api_querier.go Lines 146 to 152 in 8390b90
Yes
No
Yes, there is a timeframe of about one second to do so.
|
Yes, that's true (but doesn't make it easier in this case). Since the discussion is getting lengthy here, and I want to navigate towards a solution: Can we do some kind. of "in-place" update (in memory) of the runner reference? Either completely (without stopping a running execution, if any) or simply by updating a few pre-defined files (port mapping, runner "marked as used" status, ....)?
Awesome, and very useful table -- I like that!
That's true. Since we are now handling the execute error separately (and have an upstream issue), I would just "forget" about it and focus on the others. You're right, of course, that "Nomad had some error that made it drop the connection" without further information. So, my cents: Let's keep this for now and move on.
Not ideal, I think. Should / can we do something about this? Like denying further execution requests when the event stream is not connected or so? Just thinking aloud... |
To dos:
Since we decided to update the runner partially, we don't need to prohibit executions any longer (since we won't stop them forcefully any longer). |
POSEIDON-3W was regressed due to the following error: „the runner will be destroyed and replaced: the destruction should not cause external changes“.
@mpass99 wrote on Aug 16th, 15:01 CEST:
I'm glad that we have gained visibility on such rare errors. The last event happened because of an unattended upgrade that restarted some Nomad servers. In the process of reestablishing the Event Stream and recovering all runners, all runners got recreated. In the case of this execution, the recreation happened during a running execution connection.
This error is nothing we would not expect. We can either
The text was updated successfully, but these errors were encountered: