Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Nemo Curator K8s example #39

Closed
wants to merge 6 commits into from
Closed

Adds Nemo Curator K8s example #39

wants to merge 6 commits into from

Conversation

terrykong
Copy link
Contributor

No description provided.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Thanks for doing this! Just left a few minor comments on expanding/reformatting some of the documentation. Also I noticed the DCO and precommit checks are failing. You also might not have signed your commits with a GPG key. Make sure you use git commit-sS in the future and check the output of precommit to see what's wrong. The output of the DCO bot should tell you how to retroactively sign and signoff on your commits.


.. tab-set::

.. tab-item:: Running Existing Module
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tab items don't really stand out on GitHub. They might on the NeMo user guide, but I'd prefer if we could get these docs to look good on both the user guide and GitHub. Can you change this so Running Existing Module is more noticeable (like a subheading or something)?

# Finished!


.. tab-item:: Running Custom Module
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as Running Existing Module

.. note::
If you are creating another Dask cluster with the same ``--name <name>``, first delete it via::

kubectl delete daskcluster <name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you wrap this command in a code block or an inline code formatting?

# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>

# This copies $LOCAL_WORKSPACE/big_english to /big_english within the PVC.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you change all big_english names to something that more clearly designates a dataset? Perhaps something as simple as my_dataset would be good.

name_to_path[name] = path
return name_to_path

parser = argparse.ArgumentParser()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a help section to all of the args?

parser.add_argument("-g", "--n_gpus_per_worker", type=int, default=None)
parser.add_argument("-c", "--n_cpus_per_worker", type=int, default=None)
parser.add_argument(
"-i", "--image", type=str, default="nvcr.io/nvidia/nemo:24.01.framework"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Update to 24.03. Really, we'll need to update this to 24.05 ASAP once that drops since that will be the first version of the container with most of the dask stuff in it. But we'll do that once it releases.

Setup Python Environment
------------------------

Setup a virtual environment:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on what this environment is used for? Users might get confused and wonder if the need to install NeMo Curator in the environment or just dask kubernetes.

@ayushdg
Copy link
Collaborator

ayushdg commented Apr 23, 2024

cc: @jacobtomlinson if you're interested in taking a look

@terrykong terrykong closed this by deleting the head repository Apr 23, 2024
.. code-block:: bash

# Creates a CPU Dask cluster with 1 worker
python create_dask_cluster.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command appears to have a lot of overlap with the built-in dask kubernetes gen cluster command. I wonder if things can be made more flexible there so it can be used directly instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I wasn't aware of that. It looks like the flags are similar to what make_cluster_spec accepts

dask kubernetes gen cluster --help
Usage: dask kubernetes gen cluster [OPTIONS]

  Generate a DaskCluster spec, see ``dask_kubernetes.operator.make_cluster_spec``.

Options:
  --name TEXT                    Name of the DaskCluster  [required]
  --image TEXT                   Container image to use (default
                                 'ghcr.io/dask/dask:latest')
  --n-workers INTEGER            Number of workers
  --resources TEXT               Resource constraints
  -e, --env TEXT                 Environment variables
  --worker-command TEXT          Worker command (default 'dask-worker')
  --scheduler-service-type TEXT  Service type for scheduler (default 'ClusterIP')
  --jupyter                      Start Jupyter on the scheduler (default 

So someone still needs to edit the spec to, say, add PVC mounts or remove resources from the scheduler (e.g., GPUs or IB devices), they'd need a script like this to do that. In light of this, do you still see a path of using dask kubernetes gen cluster by adding some of these configuration options, or do you think this opinionated script still has value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we might be able to satisfy most of what you need by adding a couple more flags.

If all you need is a flag to set pvc mounts and a flag to set scheduler resources independently then I could see us adding that quite easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants