-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Nemo Curator K8s example #39
Conversation
instructions for how to a Curator workload on k8s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Thanks for doing this! Just left a few minor comments on expanding/reformatting some of the documentation. Also I noticed the DCO and precommit checks are failing. You also might not have signed your commits with a GPG key. Make sure you use git commit-sS
in the future and check the output of precommit to see what's wrong. The output of the DCO bot should tell you how to retroactively sign and signoff on your commits.
|
||
.. tab-set:: | ||
|
||
.. tab-item:: Running Existing Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tab items don't really stand out on GitHub. They might on the NeMo user guide, but I'd prefer if we could get these docs to look good on both the user guide and GitHub. Can you change this so Running Existing Module
is more noticeable (like a subheading or something)?
# Finished! | ||
|
||
|
||
.. tab-item:: Running Custom Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as Running Existing Module
.. note:: | ||
If you are creating another Dask cluster with the same ``--name <name>``, first delete it via:: | ||
|
||
kubectl delete daskcluster <name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you wrap this command in a code block or an inline code formatting?
# Replace <...> with a path on your local machine | ||
LOCAL_WORKSPACE=<...> | ||
|
||
# This copies $LOCAL_WORKSPACE/big_english to /big_english within the PVC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you change all big_english
names to something that more clearly designates a dataset? Perhaps something as simple as my_dataset
would be good.
name_to_path[name] = path | ||
return name_to_path | ||
|
||
parser = argparse.ArgumentParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a help section to all of the args?
examples/k8s/create_dask_cluster.py
Outdated
parser.add_argument("-g", "--n_gpus_per_worker", type=int, default=None) | ||
parser.add_argument("-c", "--n_cpus_per_worker", type=int, default=None) | ||
parser.add_argument( | ||
"-i", "--image", type=str, default="nvcr.io/nvidia/nemo:24.01.framework" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Update to 24.03. Really, we'll need to update this to 24.05 ASAP once that drops since that will be the first version of the container with most of the dask stuff in it. But we'll do that once it releases.
Setup Python Environment | ||
------------------------ | ||
|
||
Setup a virtual environment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on what this environment is used for? Users might get confused and wonder if the need to install NeMo Curator in the environment or just dask kubernetes.
cc: @jacobtomlinson if you're interested in taking a look |
.. code-block:: bash | ||
|
||
# Creates a CPU Dask cluster with 1 worker | ||
python create_dask_cluster.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command appears to have a lot of overlap with the built-in dask kubernetes gen cluster
command. I wonder if things can be made more flexible there so it can be used directly instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I wasn't aware of that. It looks like the flags are similar to what make_cluster_spec
accepts
dask kubernetes gen cluster --help
Usage: dask kubernetes gen cluster [OPTIONS]
Generate a DaskCluster spec, see ``dask_kubernetes.operator.make_cluster_spec``.
Options:
--name TEXT Name of the DaskCluster [required]
--image TEXT Container image to use (default
'ghcr.io/dask/dask:latest')
--n-workers INTEGER Number of workers
--resources TEXT Resource constraints
-e, --env TEXT Environment variables
--worker-command TEXT Worker command (default 'dask-worker')
--scheduler-service-type TEXT Service type for scheduler (default 'ClusterIP')
--jupyter Start Jupyter on the scheduler (default
So someone still needs to edit the spec to, say, add PVC mounts or remove resources
from the scheduler (e.g., GPUs or IB devices), they'd need a script like this to do that. In light of this, do you still see a path of using dask kubernetes gen cluster
by adding some of these configuration options, or do you think this opinionated script still has value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we might be able to satisfy most of what you need by adding a couple more flags.
If all you need is a flag to set pvc mounts and a flag to set scheduler resources independently then I could see us adding that quite easily.
No description provided.