Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentry: the runner will be destroyed and replaced #651

Closed
MrSerth opened this issue Aug 16, 2024 · 8 comments · Fixed by #667
Closed

Sentry: the runner will be destroyed and replaced #651

MrSerth opened this issue Aug 16, 2024 · 8 comments · Fixed by #667
Labels
enhancement New feature or request

Comments

@MrSerth
Copy link
Member

MrSerth commented Aug 16, 2024

POSEIDON-3W was regressed due to the following error: „the runner will be destroyed and replaced: the destruction should not cause external changes“.


@mpass99 wrote on Aug 16th, 15:01 CEST:

I'm glad that we have gained visibility on such rare errors. The last event happened because of an unattended upgrade that restarted some Nomad servers. In the process of reestablishing the Event Stream and recovering all runners, all runners got recreated. In the case of this execution, the recreation happened during a running execution connection.
This error is nothing we would not expect. We can either

  • track (and ignore) it in this issue
  • or catch it in Poseidon
    • and throw a different error
    • and throw no error but lose the visibility on the number of executions that get aborted this way.
@MrSerth MrSerth added the enhancement New feature or request label Aug 16, 2024
@MrSerth
Copy link
Member Author

MrSerth commented Aug 16, 2024

Thanks for analyzing and identifying the root cause for the behavior we observed.

I have two thoughts on this (or similar) issue(s):

  1. First, it would be great if we could quickly identify Sentry issues we only have to "track" some behavior (i.e., count) and where no immediate action is required. Can we manually append some label or tag to these?
  2. While I understand the argumentation behind ignoring the issue (either with a Sentry event or without), this doesn't feel right to me. From a user perspective, the execution requested failed and a red error was shown. Then, it shouldn't make a difference to a user whether this is a "planned" event (such as a process restart) or some other failure. Therefore, I would tend to dig a little further here.

The underlying issue, if I got this right, is that the runner got recreated after a Nomad agent restart. Then, I am wondering about the effectiveness of our drain_on_shutdown strategy on Nomad and whether Poseidon could track these events, too. Since a live migration of Docker containers (with their changes on disk and in RAM) is not yet supported by Nomad, we might need to "exclude" some runners for new executions as soon as the respective Nomad agent is about to shut down. Is there some indication this could work and would solve the underlying issue?

@mpass99
Copy link
Contributor

mpass99 commented Aug 17, 2024

Can we manually append some label or tag to these?

We can add a Sentry Tag or adjust the Sentry Log Level (info/debug). The tag can be more self-explanatory, but the Log Level would be visible in the Issue overview.

the runner got recreated after a Nomad agent restart

In this case, just the Nomad Server was restarted. This led to the Event Stream being aborted. When re-establishing the event stream, we do an environment and runner recovery:

  • Load Environments
  • Keep Runners Synchronized
  • Load Runners
    • Fetch all Runner Jobs
    • Create a Nomad Runner for each
    • Handle replacement of old Runner references
      • Delete the old Nomad Runner locally and use the new Nomad Runner (reference)
        • Pro: PortMapping is updated and Runner Inactivity Timeout is reset
        • Contra: All running Executions are stopped

Should we just continue using the old Nomad Runner reference and destroy the new one that does not have any executions?

the effectiveness of our drain_on_shutdown strategy

If I understand you right you are criticising the following process?!

  • On Shutdown the Nomad agent gets ineligible and no new runners are being scheduled.
  • In the drain-on-shutdown deadline all running executions have time to finish.
    • ⚡ We still start new Executions in runners on the draining agent that may not have enough time to finish
  • After that the Nomad Agent shuts down

Is there some indication this could work and would solve the underlying issue?

Yeah, that's imaginable. However, it would introduce complexity to the Nomad Event Handling that already causes trouble to us.

@MrSerth
Copy link
Member Author

MrSerth commented Aug 17, 2024

We can add a Sentry Tag or adjust the Sentry Log Level (info/debug). The tag can be more self-explanatory, but the Log Level would be visible in the Issue overview.

Sounds good. The log level would feel somewhat more "appropriate", but we don't send any info / debug events to Sentry. In this case, it might be easier to just apply a tag?

In this case, just the Nomad Server was restarted.

Ah, I See. My bad, sorry.

Should we just continue using the old Nomad Runner reference and destroy the new one that does not have any executions?

I am not sure about the consequences and why we decided in the first place to use the new reference. Also, I am currently unsure what you mean with "reference" -- what is changing there? The allocation ID (and so on) should not change, so it's "just" the object in Poseidon's memory?
Regarding the port mapping: This is a mechanism we could reach an exposed port for the container, isn't it? So far, I thought this would only be "proxied" by the Nomad agent, not the Nomad server (in which case it wouldn't change when the Nomad server is restarted), isn't it? Therefore, I am wondering what could potentially change (or become outdated)?

Resetting (or not resetting) the Runner Inactivity Timeout doesn't feel too important for me. If an execution is still running, I would expect that the Runner Inactivity Timeout is relatively low (assuming the execution was just started a few seconds ago). If no execution is started, the timer continues and will ensure this used runner is regularly replaced (it is still counted as used, I assume).

Finally, I was wondering how an execution could survive a nomad server restart. Is this working in practice or would an execution stop anyway when the active Nomad server Poseidon is connected to restarts (for example, because the WebSocket connection is "proxied" through the Nomad server).

If I understand you right you are criticising the following process?!

  • ⚡ We still start new Executions in runners on the draining agent that may not have enough time to finish

Yes, exactly. This sounds like a good thing for a follow-up issue / PR.

Yeah, that's imaginable. However, it would introduce complexity to the Nomad Event Handling that already causes trouble to us.

Let's move this discussion to the new issue. I would imagine that an event is sent as soon as a Nomad agent becomes ineligible (to be verified). Then, we could store this in a temporary list and double-check a runner's location before starting an execution. CodeOcean would need to handle this case (not done right now except for file copies). And as soon as either the agent becomes eligible again or Nomad places a new allocation on said agent, we can remove it from the temporary list.

@mpass99
Copy link
Contributor

mpass99 commented Aug 18, 2024

The log level would feel somewhat more "appropriate", but we don't send any info / debug events to Sentry. In this case, it might be easier to just apply a tag?

Then, let's go with the Sentry Log Level. In Poseidon, both the tag and the log level would have to be mapped through a custom field. We would not change the Poseidon Log Levels. The Sentry hook would translate the custom field into either a tag or a different log level.

The allocation ID (and so on) should not change, so it's "just" the object in Poseidon's memory?

Yes

This is a mechanism we could reach an exposed port for the container, isn't it?

Yes

So far, I thought this would only be "proxied" by the Nomad agent, not the Nomad server (in which case it wouldn't change when the Nomad server is restarted), isn't it?

Yes

I am wondering what could potentially change (or become outdated)?

In theory, every attribute of the Nomad Jobs can change due to other actors than Poseidon (e.g. unattended upgrade). When the event stream is aborted, we will not be notified about changes.
A likely scenario about changes would be that Nomad Jobs stopped. The exposed ports or the mapping change do not seem that likely (however, we also have not gained that much experience with that feature).

how an execution could survive a nomad server restart

Good question in this context! It could not! The WebSocket connection is proxied through the server. When the server restarts, the WebSocket connection receives an unexpected EOF error. I tested, this in my local environment.

Because we would not gain more functionally with changes at this point, we just separate the the runner will be destroyed and replaced-error from the general Error executing the request-error, decease the Sentry Log Level for it, and "accept" the error?!

follow-up issue

#659

@MrSerth
Copy link
Member Author

MrSerth commented Aug 18, 2024

Thanks for replying to all my questions and helping me to resolve my confusion. I am now clearer about the execution workflow and the involvement of the different components 👍.

In theory, every attribute of the Nomad Jobs can change due to other actors than Poseidon (e.g. unattended upgrade).

Yes, that's true. For most attributes, however, I don't expect a real change (since no other actor is expected to change allocation / job settings). Hence, only further restarts (as caused by unattended upgrades) could happen, e.g., providing a clean runner rather than a used one... This could be missed by Poseidon, when the event stream is no longer connected, of course, but it also doesn't feel too bad.

A likely scenario about changes would be that Nomad Jobs stopped.

Shouldn't this be recognized when Poseidon restarts (and recovers the environments and runners)? In this case, it could happen that a Nomad Job is stopped, but Poseidon would catch-up and update its memory structures upon reconnecting to Nomad (that's at least what I thought, in which case it wouldn't be too bad either).

The exposed ports or the mapping change do not seem that likely (however, we also have not gained that much experience with that feature).

That's true. I also wouldn't expect a change, but agree that we don't have much experience with this feature. Let's ignore a potential change for now and return to this problem when using ports.

Good question in this context! It could not! The WebSocket connection is proxied through the server. When the server restarts, the WebSocket connection receives an unexpected EOF error. I tested, this in my local environment.

Thanks for testing and clarifying. A minor follow-up question: Is each Nomad server restart causing this issue or only some? Could you help to fill out the following table?

Role Restart? Problem?
Leader Restart <potentially problematic>
Leader No Restart <potentially fine>
Follower Restart <potentially problematic>
Follower No Restart <unknown>

Either way, knowing that we receive an unexpected EOF error is good to know there. How is this handled or logged? I just want to ensure it's not confused with the other unexpected EOF we improved with #643.
Besides that, I quickly checked the Nomad logs and didn't found any useful setting to modify the current behavior (of WebSocket connections getting terminated on server restarts). Thus, we'll probably need to "accept" it for now, I assume (and ideally document it somewhere 🙏).

Because we would not gain more functionally with changes at this point [...]

You mean because the WebSocket connection to the runner is interrupted anyway (as your experiment has shown)? Do we ensure that is no execution (started) between the loss of the Nomad event stream and the completion of the (environment and runner) recovery? Or, asked differently: Would it be possible that a new execution is successfully started (connected to Nomad....) and shortly after the recovery is forcefully stopping it (because only then the event stream got restored and all information recovered)?

we just separate the the runner will be destroyed and replaced-error from the general Error executing the request-error, decease the Sentry Log Level for it, and "accept" the error?!

Separating sounds good, yes. And, depending on the previous answer, "accepting" / "counting" is probably fine (but I really want to ensure we are not killing executions unnecessarily).

This was referenced Aug 20, 2024
@mpass99
Copy link
Contributor

mpass99 commented Aug 20, 2024

Hence, only further restarts (as caused by unattended upgrades) could happen, e.g., providing a clean runner rather than a used one

I would expect that the port mapping can also change on restarts. When we miss this, it might lead to unavailability or broken isolation.

Shouldn't this be recognized when Poseidon restarts?

Yes, also when the Event Stream reconnects.

Poseidon would catch-up and update its memory structures upon reconnecting to Nomad

Yes

Is each Nomad server restart causing this issue or only some? Could you help to fill out the following table?

Restart? Role DNS Resolves WebSocket Problem? Event Stream Problem?
Restart Leader Yes problematic problematic
Restart Leader No problematic fine
Restart Follower Yes problematic problematic
Restart Follower No fine fine
No Restart Leader Yes fine fine
No Restart Leader No fine fine
No Restart Follower Yes fine fine
No Restart Follower No fine fine

I added the "DNS Resolves" flag indicating whether our nomad. domain resolves to the restarted server.
In general, we can say, that if the server does not restart, we cannot identify issues with the WebSocket connection so far.
When the leader restarts, the WebSocket connection aborts.
When the server restarts that DNS resolves, the WebSocket connection aborts, and the Nomad Event Stream gets reconnected which triggers a recovery.

How is this handled or logged? I just want to ensure it's not confused with the other unexpected EOF we improved with #643.

Unfortunately, it is hard to differentiate between different types of unexpected EOF errors. The error just tells us that Nomad had some error that made it drop the connection.
Therefore, I really hope that the upstream issue related to #643 is fixed soon, so we can remove the case handling. That would allow us to identify different reasons for the unexpected EOF error such as this one.

case errors.Is(err, io.ErrUnexpectedEOF), strings.Contains(err.Error(), io.ErrUnexpectedEOF.Error()):
// The unexpected EOF is a generic Nomad error. However, our investigations have shown that all the current
// events of this error are caused by fsouza/go-dockerclient#1076. Because this error happens at the very end,
// it does not affect the functionality. Therefore, we don't propagate the error.
log.WithContext(ctx).WithError(err).
WithField(logging.SentryFingerprintFieldKey, []string{"nomad-unexpected-eof"}).Warn("Unexpected EOF for Execute")
return 0, nil

document it somewhere

#660

You mean because the WebSocket connection to the runner is interrupted anyway (as your experiment has shown)?

Yes

Do we ensure that is no execution (started) between the loss of the Nomad event stream and the completion of the (environment and runner) recovery?

No

Would it be possible that a new execution is successfully started (connected to Nomad....) and shortly after the recovery is forcefully stopping it?

Yes, there is a timeframe of about one second to do so.

Separating sounds good, yes

#661

@MrSerth
Copy link
Member Author

MrSerth commented Aug 20, 2024

I would expect that the port mapping can also change on restarts. When we miss this, it might lead to unavailability or broken isolation.

Yes, that's true (but doesn't make it easier in this case). Since the discussion is getting lengthy here, and I want to navigate towards a solution: Can we do some kind. of "in-place" update (in memory) of the runner reference? Either completely (without stopping a running execution, if any) or simply by updating a few pre-defined files (port mapping, runner "marked as used" status, ....)?

Could you help to fill out the following table? [...]

Awesome, and very useful table -- I like that!

Unfortunately, it is hard to differentiate between different types of unexpected EOF errors.

That's true. Since we are now handling the execute error separately (and have an upstream issue), I would just "forget" about it and focus on the others. You're right, of course, that "Nomad had some error that made it drop the connection" without further information. So, my cents: Let's keep this for now and move on.

Yes, there is a timeframe of about one second to do so.

Not ideal, I think. Should / can we do something about this? Like denying further execution requests when the event stream is not connected or so? Just thinking aloud...

@MrSerth
Copy link
Member Author

MrSerth commented Aug 21, 2024

To dos:

  • Extend table with another column to differentiate between WebSocket and event stream problems
  • update port mapping in-place to minimize the risk of forcefully stopping a running execution (if started within said "one-second window"
  • Close PR Separate Sentry Issues #661 and add a sentence why we don't expect it to happen any longer

Not ideal, I think. Should / can we do something about this? Like denying further execution requests when the event stream is not connected or so? Just thinking aloud...

Since we decided to update the runner partially, we don't need to prohibit executions any longer (since we won't stop them forcefully any longer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants