Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs / k8s events from scheduler that failed to start? #244

Open
consideRatio opened this issue Apr 10, 2020 · 1 comment
Open

Logs / k8s events from scheduler that failed to start? #244

consideRatio opened this issue Apr 10, 2020 · 1 comment

Comments

@consideRatio
Copy link
Collaborator

If a cluster is created, and the scheduler crashes for some reason, for example because it contained an old version of dask-gateway, that is an error very hard to get informed about.

Is there a way to capture something about this perhaps? Here is the loglevel: DEBUG logs from such sequence.

I know that when for example cert-manager's controller works to get a certificate from Let's encrypt because it found a Certificate resource of a cert-manager kind, it will send events that can be seen on the certificate. Perhaps a nice thing to do could be to emit k8s events from the controller about what it experienced about the failure to start the dask cluster.

@jcrist
Copy link
Member

jcrist commented Apr 10, 2020

This is a bit tricky, there are many ways a failure could occur. If the failure is at the k8s level it's easier for is to make a note in the DaskCluster object status that includes meaningful information. However, if the failure is in the pod execution (e.g. scheduler fails, bad image, etc...), it's harder for us to make a note about why that pod failed. We currently delete the pods immediately on failure, which requires pod logs to be stored externally (in e.g. stackdriver) since they won't be persisted. Perhaps we configure a way to keep the pods around for a period (JupyterHub Kubespawner supports this)? Maybe this behavior should be different depending on whether the cluster failed or was intentionally stopped? Suggestions welcome, I'm not sure what's best here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants