-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic tutorial for usage of KubeRay without CodeFlare #122
Open
anishasthana
wants to merge
1
commit into
opendatahub-io:main
Choose a base branch
from
anishasthana:add_kuberay_only
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+388
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# KubeRay quickstart | ||
|
||
There is a demo notebook in the tutorial directory that you can run to get started with KubeRay. It will walk you through the process of setting up a cluster and running a simple example. | ||
|
||
## Setup | ||
|
||
You can follow the instructions in the [base readme](../README.md) to install the Distributed Workolads stack. | ||
|
||
## Notebook | ||
|
||
At this point you should be able to go to your notebook spawner page and select your notebook image of choice. | ||
|
||
You can access the spawner page through the Open Data Hub dashboard. The default route should be `https://odh-dashboard-<your ODH namespace>.apps.<your cluster's uri>`. Once you are on your dashboard, you can select "Launch application" on the Jupyter application. This will take you to your notebook spawner page. After that, simply upload the notebook and ray cluster template from the tutorial directory and you should be good to go. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,251 @@ | ||||||
{ | ||||||
"cells": [ | ||||||
{ | ||||||
"cell_type": "markdown", | ||||||
"id": "66fe6c71-2cb2-4425-a12a-c5b531a28155", | ||||||
"metadata": {}, | ||||||
"source": [ | ||||||
"# Using KubeRay to run Distributed Workloads without CodeFlare\n", | ||||||
"\n", | ||||||
"This notebook demonstrates a quick workflow using Ray from a notebook without the codeflare-sdk.\n", | ||||||
"The current usage patterns for KubeRay require manual oc commands to be run from your notebook, so you will need to authenticate manually. We recommend usage of the codeflare-sdk alongside CodeFlare for an easier experience. An example notebook showing an almost identical usecase can be found at https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/3_basic_interactive.ipynb" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "163b9d63-709e-435e-9933-988328831eba", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"!pip install --upgrade ray==\"2.5.0\"\n", | ||||||
"!pip install pandas" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "markdown", | ||||||
"id": "4b436207-9d87-4ce1-8909-6116e729a753", | ||||||
"metadata": { | ||||||
"tags": [] | ||||||
}, | ||||||
"source": [ | ||||||
"## You need to get your token to authenticate to the OpenShift Cluster.\n", | ||||||
"\n", | ||||||
"1. Go to the OpenShift Console\n", | ||||||
"2. Click on the arrow next to your username\n", | ||||||
"3. Click on \"Copy login command\"\n", | ||||||
"4. Once authenticated, copy the entire section under \"Log in with this token. It will look similar to the following\n", | ||||||
"oc login --token=<token> --server=<url>\n", | ||||||
"5. Run the following cell, making sure to use your token and server. The \"!\" at the beginning of the command is required." | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "7ea2c953-a9b7-4f19-941b-517a832ff379", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"!oc login --token=<token> --server=<url>" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": 13, | ||||||
"id": "88eb9d0a-846d-43ce-8730-020ac05cd4e9", | ||||||
"metadata": {}, | ||||||
"outputs": [ | ||||||
{ | ||||||
"name": "stdout", | ||||||
"output_type": "stream", | ||||||
"text": [ | ||||||
"raycluster.ray.io \"imdb-ray-test\" deleted\n", | ||||||
"raycluster.ray.io/imdb-ray-test created\n" | ||||||
] | ||||||
} | ||||||
], | ||||||
"source": [ | ||||||
"!oc delete -f test.yaml\n", | ||||||
"!oc apply -f test.yaml" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "cb671ebf-7317-4ae2-bb65-81b16b1f78e5", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"!oc get pods -o wide | grep imdb-ray-test | awk '{print $1, $6, $7 }'" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "markdown", | ||||||
"id": "0e8952d2-1633-4094-903c-a422b96ffbf5", | ||||||
"metadata": { | ||||||
"tags": [] | ||||||
}, | ||||||
"source": [ | ||||||
"As you can see from the above output, we have 2 worker nodes and a head node for the ray cluster. Each node has a separate IP address and different physical node it has been scheduled on." | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "d53c20b0-19b4-437d-9694-174a6d443426", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"!oc get svc | grep imdb-ray-test" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "11768c18-20c8-407c-9b4a-320264a0b8c5", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"import ray\n", | ||||||
"from ray.air.config import ScalingConfig\n", | ||||||
"\n", | ||||||
"# Copy the service name from above. If you are using the default service and namespace,\n", | ||||||
"# the ray_cluster_uri is ray://imdb-ray-test-head-svc.opendatahub.svc:10001\n", | ||||||
"\n", | ||||||
"ray_cluster_uri = \"ray://imdb-ray-test-head-svc.opendatahub.svc:10001\"\n", | ||||||
"\n", | ||||||
"#install additional libraries that will be required for model training\n", | ||||||
"runtime_env = {\"pip\": [\"transformers\", \"datasets\", \"evaluate\", \"pyarrow<7.0.0\", \"accelerate\"]}\n", | ||||||
"\n", | ||||||
"# NOTE: This will work for in-cluster notebook servers (RHODS/ODH), but not for local machines\n", | ||||||
"# To see how to connect from your laptop, go to demo-notebooks/additional-demos/local_interactive.ipynb\n", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This notebook is not in this repo. We should use the complete url
Suggested change
|
||||||
"ray.init(address=ray_cluster_uri, runtime_env=runtime_env)\n", | ||||||
"\n", | ||||||
"print(\"Ray cluster is up and running: \", ray.is_initialized())" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "566eba0c-6be2-4cb4-9aa4-9e147433642e", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"@ray.remote\n", | ||||||
"def train_fn():\n", | ||||||
" from datasets import load_dataset\n", | ||||||
" import transformers\n", | ||||||
" from transformers import AutoTokenizer, TrainingArguments\n", | ||||||
" from transformers import AutoModelForSequenceClassification\n", | ||||||
" import numpy as np\n", | ||||||
" from datasets import load_metric\n", | ||||||
" import ray\n", | ||||||
" from ray import tune\n", | ||||||
" from ray.train.huggingface import HuggingFaceTrainer\n", | ||||||
"\n", | ||||||
" dataset = load_dataset(\"imdb\")\n", | ||||||
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")\n", | ||||||
"\n", | ||||||
" def tokenize_function(examples):\n", | ||||||
" return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n", | ||||||
"\n", | ||||||
" tokenized_datasets = dataset.map(tokenize_function, batched=True)\n", | ||||||
"\n", | ||||||
" #using a fraction of dataset but you can run with the full dataset\n", | ||||||
" small_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(100))\n", | ||||||
" small_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(100))\n", | ||||||
"\n", | ||||||
" print(f\"len of train {small_train_dataset} and test {small_eval_dataset}\")\n", | ||||||
"\n", | ||||||
" ray_train_ds = ray.data.from_huggingface(small_train_dataset)\n", | ||||||
" ray_evaluation_ds = ray.data.from_huggingface(small_eval_dataset)\n", | ||||||
"\n", | ||||||
" def compute_metrics(eval_pred):\n", | ||||||
" metric = load_metric(\"accuracy\")\n", | ||||||
" logits, labels = eval_pred\n", | ||||||
" predictions = np.argmax(logits, axis=-1)\n", | ||||||
" return metric.compute(predictions=predictions, references=labels)\n", | ||||||
"\n", | ||||||
" def trainer_init_per_worker(train_dataset, eval_dataset, **config):\n", | ||||||
" model = AutoModelForSequenceClassification.from_pretrained(\"distilbert-base-uncased\", num_labels=2)\n", | ||||||
"\n", | ||||||
" training_args = TrainingArguments(\"/tmp/hf_imdb/test\", eval_steps=1, disable_tqdm=True, \n", | ||||||
" num_train_epochs=1, skip_memory_metrics=True,\n", | ||||||
" learning_rate=2e-5,\n", | ||||||
" per_device_train_batch_size=16,\n", | ||||||
" per_device_eval_batch_size=16, \n", | ||||||
" weight_decay=0.01,)\n", | ||||||
" return transformers.Trainer(\n", | ||||||
" model=model,\n", | ||||||
" args=training_args,\n", | ||||||
" train_dataset=train_dataset,\n", | ||||||
" eval_dataset=eval_dataset,\n", | ||||||
" compute_metrics=compute_metrics\n", | ||||||
" )\n", | ||||||
"\n", | ||||||
" scaling_config = ScalingConfig(num_workers=2, use_gpu=False) #num workers is the number of gpus\n", | ||||||
"\n", | ||||||
" # we are using the ray native HuggingFaceTrainer, but you can swap out to use non ray Huggingface Trainer. Both have the same method signature. \n", | ||||||
" # the ray native HFTrainer has built in support for scaling to multiple GPUs\n", | ||||||
" trainer = HuggingFaceTrainer(\n", | ||||||
" trainer_init_per_worker=trainer_init_per_worker,\n", | ||||||
" scaling_config=scaling_config,\n", | ||||||
" datasets={\"train\": ray_train_ds, \"evaluation\": ray_evaluation_ds},\n", | ||||||
" )\n", | ||||||
" result = trainer.fit()" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "c4458b86-8699-44b7-9785-d8ae8d0e29d4", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"ray.get(train_fn.remote())\n" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "a525ec28-c485-43f5-afc1-155de4ed4149", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"ray.cancel(ref)\n", | ||||||
"ray.shutdown()" | ||||||
] | ||||||
}, | ||||||
{ | ||||||
"cell_type": "code", | ||||||
"execution_count": null, | ||||||
"id": "00fe92e0-abb8-47d5-8a6d-b49c78bd230c", | ||||||
"metadata": {}, | ||||||
"outputs": [], | ||||||
"source": [ | ||||||
"!oc delete -f test.yaml" | ||||||
] | ||||||
} | ||||||
], | ||||||
"metadata": { | ||||||
"kernelspec": { | ||||||
"display_name": "Python 3", | ||||||
"language": "python", | ||||||
"name": "python3" | ||||||
}, | ||||||
"language_info": { | ||||||
"codemirror_mode": { | ||||||
"name": "ipython", | ||||||
"version": 3 | ||||||
}, | ||||||
"file_extension": ".py", | ||||||
"mimetype": "text/x-python", | ||||||
"name": "python", | ||||||
"nbconvert_exporter": "python", | ||||||
"pygments_lexer": "ipython3", | ||||||
"version": "3.8.13" | ||||||
} | ||||||
}, | ||||||
"nbformat": 4, | ||||||
"nbformat_minor": 5 | ||||||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
apiVersion: ray.io/v1alpha1 | ||
kind: RayCluster | ||
metadata: | ||
labels: | ||
controller-tools.k8s.io: '1.0' | ||
name: imdb-ray-test | ||
namespace: opendatahub | ||
spec: | ||
autoscalerOptions: | ||
idleTimeoutSeconds: 60 | ||
imagePullPolicy: Always | ||
resources: | ||
limits: | ||
cpu: 500m | ||
memory: 512Mi | ||
requests: | ||
cpu: 500m | ||
memory: 512Mi | ||
upscalingMode: Default | ||
enableInTreeAutoscaling: false | ||
headGroupSpec: | ||
rayStartParams: | ||
block: 'true' | ||
dashboard-host: 0.0.0.0 | ||
num-gpus: '0' | ||
serviceType: ClusterIP | ||
template: | ||
spec: | ||
containers: | ||
- env: | ||
- name: MY_POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: status.podIP | ||
- name: RAY_USE_TLS | ||
value: '0' | ||
- name: RAY_TLS_SERVER_CERT | ||
value: /home/ray/workspace/tls/server.crt | ||
- name: RAY_TLS_SERVER_KEY | ||
value: /home/ray/workspace/tls/server.key | ||
- name: RAY_TLS_CA_CERT | ||
value: /home/ray/workspace/tls/ca.crt | ||
image: quay.io/project-codeflare/ray:2.5.0-py38-cu116 | ||
imagePullPolicy: Always | ||
lifecycle: | ||
preStop: | ||
exec: | ||
command: | ||
- /bin/sh | ||
- -c | ||
- ray stop | ||
name: ray-head | ||
ports: | ||
- containerPort: 6379 | ||
name: gcs | ||
- containerPort: 8265 | ||
name: dashboard | ||
- containerPort: 10001 | ||
name: client | ||
resources: | ||
limits: | ||
cpu: 2 | ||
memory: 16G | ||
nvidia.com/gpu: 0 | ||
requests: | ||
cpu: 2 | ||
memory: 16G | ||
nvidia.com/gpu: 0 | ||
imagePullSecrets: [] | ||
rayVersion: 2.5.0 | ||
workerGroupSpecs: | ||
- groupName: small-group-jobtest | ||
maxReplicas: 2 | ||
minReplicas: 2 | ||
rayStartParams: | ||
block: 'true' | ||
num-gpus: '0' | ||
replicas: 2 | ||
template: | ||
metadata: | ||
annotations: | ||
key: value | ||
spec: | ||
containers: | ||
- env: | ||
- name: MY_POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: status.podIP | ||
- name: RAY_USE_TLS | ||
value: '0' | ||
- name: RAY_TLS_SERVER_CERT | ||
value: /home/ray/workspace/tls/server.crt | ||
- name: RAY_TLS_SERVER_KEY | ||
value: /home/ray/workspace/tls/server.key | ||
- name: RAY_TLS_CA_CERT | ||
value: /home/ray/workspace/tls/ca.crt | ||
image: quay.io/project-codeflare/ray:2.5.0-py38-cu116 | ||
lifecycle: | ||
preStop: | ||
exec: | ||
command: | ||
- /bin/sh | ||
- -c | ||
- ray stop | ||
name: machine-learning | ||
resources: | ||
limits: | ||
cpu: 1 | ||
memory: 16G | ||
nvidia.com/gpu: 0 | ||
requests: | ||
cpu: 1 | ||
memory: 16G | ||
nvidia.com/gpu: 0 | ||
imagePullSecrets: [] | ||
initContainers: | ||
- command: | ||
- sh | ||
- -c | ||
- until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; | ||
do echo waiting for myservice; sleep 2; done | ||
image: busybox:1.28 | ||
name: init-myservice |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.