Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can´t create Dask Cluster on GKE using Artifact Registry #341

Open
WaterKnight1998 opened this issue Oct 20, 2020 · 2 comments
Open

Can´t create Dask Cluster on GKE using Artifact Registry #341

WaterKnight1998 opened this issue Oct 20, 2020 · 2 comments

Comments

@WaterKnight1998
Copy link

WaterKnight1998 commented Oct 20, 2020

What happened:

I am deploying Dask Gateway on a GKE cluster using the helm chart.

I have setup a custom image that is stored in Artifact Registry for the Schedulers and workers. I have used for that purpose the key gateway.backend.image.name of values.yam with an example value europe-west6-docker.pkg.dev/project1/images/dask.

When I try to create a cluster with:

from dask_gateway import Gateway

# -- Here we provide a few examples of creating a `Gateway` object --

# Gateway server running at http://mygateway.com with kerberos authentication
gateway = Gateway(
    address="http://localhost:8000/"
)
options = gateway.cluster_options()
options
cluster = gateway.new_cluster(options)
cluster.scale(1)

Please, take in mind that I am using localhost as I made a port-forwad to the pod traefik-dask-gateway.

The code throws me the next error:

---------------------------------------------------------------------------
GatewayClusterError                       Traceback (most recent call last)
<ipython-input-4-25985748968f> in <module>
----> 1 cluster = gateway.new_cluster(options)
      2 cluster.scale(1)

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    641             cluster_options=cluster_options,
    642             shutdown_on_close=shutdown_on_close,
--> 643             **kwargs,
    644         )
    645 

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, public_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    816             shutdown_on_close=shutdown_on_close,
    817             asynchronous=asynchronous,
--> 818             loop=loop,
    819         )
    820 

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, public_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    912             self.status = "starting"
    913         if not self.asynchronous:
--> 914             self.gateway.sync(self._start_internal)
    915 
    916     @property

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    337             )
    338             try:
--> 339                 return future.result()
    340             except BaseException:
    341                 future.cancel()

~/miniconda3/envs/Dask/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/miniconda3/envs/Dask/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
    926             self._start_task = asyncio.ensure_future(self._start_async())
    927         try:
--> 928             await self._start_task
    929         except BaseException:
    930             # On exception, cleanup

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
    944         # Connect to cluster
    945         try:
--> 946             report = await self.gateway._wait_for_start(self.name)
    947         except GatewayClusterError:
    948             raise

~/miniconda3/envs/Dask/lib/python3.7/site-packages/dask_gateway/client.py in _wait_for_start(self, cluster_name)
    576                     raise GatewayClusterError(
    577                         "Cluster %r failed to start, see logs for "
--> 578                         "more information" % cluster_name
    579                     )
    580                 elif report.status is ClusterStatus.STOPPED:

GatewayClusterError: Cluster 'dask-gateway.31f49455387e410ead07c154aa126e1a' failed to start, see logs for more information

I can´t find any logs with a more detailed error.

The fact is that If I store that image in other Docker image repositories it is working.

What Kubernetes Service Account is using Dask Gateway for pulling the images? I don´t see any pod getting launched even with a PullImageError or a similar error.

What you expected to happen:
I would expect the cluster to be created like with images stored in other repositories.

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version:
  • Operating System:
  • Install method (conda, pip, source):
@WaterKnight1998 WaterKnight1998 changed the title Can´t create Dask Cluster on GKE using Artifact Registry [BUG] Can´t create Dask Cluster on GKE using Artifact Registry Oct 20, 2020
@WaterKnight1998 WaterKnight1998 changed the title [BUG] Can´t create Dask Cluster on GKE using Artifact Registry Can´t create Dask Cluster on GKE using Artifact Registry Oct 20, 2020
@jcrist
Copy link
Member

jcrist commented Oct 21, 2020

I'm sorry you're having trouble here. I don't have any experience deploying with a custom registry, perhaps @droctothorpe has some thoughts?


Dask-Gateway is pretty quick about deleting failed pods (we should probably add a config option to delay this to help with debugging). If you start a watch for pods in the namespace you might catch something. Something like this might work:

kubectl get pod -w -l app.kubernetes.io/name=dask-gateway

I'd look for pods getting created, then check the status field to see if there's any interesting info there. Piping the above to tee log.txt might also help so you don't have to browse through the terminal when debugging.

What Kubernetes Service Account is using Dask Gateway for pulling the images?

Pods are created by the dask-gateway controller, the RBAC entry is created here: https://github.com/dask/dask-gateway/blob/master/resources/helm/dask-gateway/templates/controller/rbac.yaml.

@droctothorpe
Copy link
Contributor

We have been using Artifactory to store and retrieve images without issue.

The values.yaml looks like this:

gateway:
  replicas: 2
  resources:
    limits:
      cpu: 100m
      memory: 256Mi
    requests:
      cpu: 100m
      memory: 256Mi
  loglevel: DEBUG
  image:
    name: <internal-artifactory-url>/dask-gateway-server
    tag: <tag>
    pullPolicy: Always
  podDisruptionBudget:
    minAvailable: 1
...
  backend:
    image:
      name: <internal-artifactory-url>/dask-gateway-worker
      tag: <tag>
      pullPolicy: Always
    namespace: <namespace>

Are you passing in the image tag? Have you enabled debug logs? I recommend doing so and monitoring logs from the API and controller. You can do so individually with kubectl or collectively with stern -- stern -l app.kubernetes.io/name=dask-gateway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants