KEP-2170: Create model and dataset initializers #2303

andreyvelich · 2024-10-23T21:45:06Z

I created model and dataset initializers.
Initially, we will only support HF for the demo purposes.
I will create dedicated issue to support more providers.

/assign @kubeflow/wg-training-leads @varshaprasad96 @akshaychitneni @deepanker13 @helenxie-bit @Electronic-Waste @saileshd1402 @kannon92

Signed-off-by: Andrey Velichkevich <[email protected]>

coveralls · 2024-10-23T21:49:51Z

Pull Request Test Coverage Report for Build 11517758882

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11507477280:	0.0%
Covered Lines:	77
Relevant Lines:	77

💛 - Coveralls

kannon92 · 2024-10-23T22:36:23Z

Should we consider unit or e2e tests for this?

.gitignore

andreyvelich · 2024-10-23T22:39:22Z

Should we consider unit or e2e tests for this?

Yeah, I will open dedicated issue for it.

Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]>

kannon92 · 2024-10-23T22:54:01Z

pkg/initiailizer_v2/utils/utils.py

+
+# Get DataClass config from the environment variables.
+# Env names must be equal to the DataClass parameters.
+def get_config_from_env(config) -> Dict[str, str]:


do you want some typing hints for config

Actually, I wasn't able to find Python type that can represent DataClass, as described here: https://stackoverflow.com/questions/54668000/type-hint-for-an-instance-of-a-non-specific-dataclass#:~:text=Despite%20its%20name%2C%20dataclasses.dataclass%20doesn%27t%20expose%20a%20class%20interface..
Any ideas on how we can add the type hint for it @kannon92 ?

Ah I see. Then it’s fine to go without.

I’m not much of a Python dev these days so I don’t know about the typing for data classes.

shall we use like this
config : Union[DataClassA, DataClassB]

I tried this, it won't work. E.g. I can see this error from PyLance:

Argument of type "type[HuggingFaceDatasetConfig]" cannot be assigned to parameter "config" of type "HuggingFaceDatasetConfig" in function "get_config_from_env" "type[type]" is not assignable to "type[HuggingFaceDatasetConfig]"

from dataclasses import fields from typing import Dict import os @dataclass class HuggingFaceModelInputConfig: storage_uri: str access_token: Optional[str] = None def get_config_from_env(config: Union[HuggingFaceModelInputConfig]) -> Dict[str, str]: config_from_env = {} for field in fields(config): config_from_env[field.name] = os.getenv(field.name.upper()) return config_from_env cf1 = get_config_from_env(HuggingFaceModelInputConfig) print(cf1)

This is working @andreyvelich

@deepanker13 What type validator do you use in your IDE ?
My VSCode complains with the error that I added above.
I am using PyLance VSCode extension: https://github.com/microsoft/pylance-release

saileshd1402 · 2024-10-24T05:46:13Z

pkg/initiailizer_v2/model/huggingface.py

+            repo_id=model_uri,
+            local_dir=constants.VOLUME_PATH_MODEL,
+            allow_patterns=["*.json", "*.safetensors", "*.model"],
+            ignore_patterns=["*.msgpack", "*.h5", "*.bin"],


we can consider ignoring ".pt" and ".pth" files as well

@saileshd1402 Do we know what files have .pt and .pth extension ?

They store model weights similar to .safetensors and .bin but are outdated now. For example: In meta-llama/Llama-3.1-8B there is a folder called "/original" where consolidated.00.pth stores same data as ".safetensors" but is outdated. But would like to know if others think those files are important

@lizzzcai do you have any thoughts on whether we can exclude .pth and .pt files from model download ?
I noticed that you suggested to add the same ignore_patterns to KServe storage initializer: kserve/kserve#3584 (comment)

Hi @andreyvelich, I don't see an issue to exclude .pth, .pt and .bin, as safetensors should be a preferred format and it has better security compared to others. Assuming that the model being downloaded provides safetensors format (most of the popular model should have).

If you want to support multiple formats, can check how vLLM support it.

I see, thanks for the info!
I think, in the future we should follow the vLLM approach.

deepanker13 · 2024-10-24T18:11:07Z

.github/workflows/publish-core-images.yaml

@@ -30,6 +30,14 @@ jobs:
            dockerfile: cmd/training-operator.v2alpha1/Dockerfile
            platforms: linux/amd64,linux/arm64,linux/ppc64le
            tag-prefix: v2alpha1
+          - component-name: model-initiailizer-v2


There is a typo at multiple places

initializer

Great catch! Let me fix that.

deepanker13 · 2024-10-24T18:30:22Z

pkg/initiailizer_v2/dataset/huggingface.py

+        huggingface_hub.snapshot_download(
+            repo_id=dataset_uri,
+            repo_type="dataset",
+            local_dir=constants.VOLUME_PATH_DATASET,


To speed up things should we set max_workers equal to number of files getting downloaded, currently it downloads 8 files parallely?

Do we have any benchmarks that show that setting of max_workers to number of files speedup download time ?
What if we don't have enough CPUs for all concurrent threads ?

deepanker13 · 2024-10-24T18:35:59Z

cmd/initiailizer_v2/model/Dockerfile

+WORKDIR /workspace
+
+# Copy the required Python modules.
+COPY cmd/initiailizer_v2/model/requirements.txt .


why is the folder named as cmd? It is containing docker files and requirements.txt

We want to be consistent across all Training Operator components, like we do in Katib: https://github.com/kubeflow/katib/tree/master/cmd
E.g.
cmd - contains binaries/dockerfiles for execution.
pkg - contains the actual backend.

varshaprasad96 · 2024-10-24T20:23:53Z

pkg/initiailizer_v2/dataset/huggingface.py

+        logging.info(f"Config for HuggingFace dataset initiailizer: {config_dict}")
+        self.config = HuggingFaceDatasetConfig(**config_dict)
+
+    def download_dataset(self):


Do we have unit tests for these, or would it be taken care in e2e? Downloading models from HF was having issues previously, it would be helpful to have it tested.

Nvmd, just realised we have: #2305

Signed-off-by: Andrey Velichkevich <[email protected]>

lizzzcai · 2024-10-25T02:28:56Z

pkg/initiailizer_v2/model/huggingface.py

+        huggingface_hub.snapshot_download(
+            repo_id=model_uri,
+            local_dir=constants.VOLUME_PATH_MODEL,
+            allow_patterns=["*.json", "*.safetensors", "*.model"],


For "*.safetensors" and "*.model" in the allow_patterns, I would say it works in most of the cases.
However, for model like mistralai/Mistral-7B-Instruct-v0.3, it has consolidated.safetensors and tokenizer.model.v3 (so called v3 format in Mistral, check their download example here).

snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

In this case, downloading the above mistral model with the current allow_patterns, the downloaded size will be 29 GB (double from the actual size 14.5 GB). Probably you need some logic to handle the Mistral model.

Thanks for sharing! Let me add it to the TODO list.

Electronic-Waste

I wonder if we can add some comment somewhere to let users know we only support downloading models/datasets from HuggingFace now?

It might be more user-friendly :)

andreyvelich · 2024-10-25T11:58:31Z

I wonder if we can add some comment somewhere to let users know we only support downloading models/datasets from HuggingFace now?

Yes, we are planning to add supported dataset and model providers to the website.
Additionally, I will create tracking issue to support more providers (S3, GCS, etc.)

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2024-10-25T22:18:15Z

Are there any other comments before we can move forward with this initial PR ?
/assign @kannon92 @Electronic-Waste @deepanker13 @varshaprasad96 @saileshd1402

google-oss-prow · 2024-10-25T22:18:19Z

@andreyvelich: GitHub didn't allow me to assign the following users: varshaprasad96, saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Are there any other comments before we can move forward with this initial PR ?
/assign @kannon92 @Electronic-Waste @deepanker13 @varshaprasad96 @saileshd1402

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tenzen-y

Let's address other issues in a follow up.
/lgtm
/approve

tenzen-y · 2024-10-27T18:01:02Z

cmd/initializer_v2/dataset/Dockerfile

@@ -0,0 +1,13 @@
+FROM python:3.11-alpine


Suggested change

FROM python:3.11-alpine

FROM python:3.11-slim-bookworm

@andreyvelich Could you use the Debian image since the Alpine has a performance penalty due to with musl libc?
Python still depends on the C codes.

Sure, let me create an issue

google-oss-prow · 2024-10-27T18:01:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* KEP-2170: Create model and dataset initializers Signed-off-by: Andrey Velichkevich <[email protected]> * Add abstract classes Signed-off-by: Andrey Velichkevich <[email protected]> * Add storage URI to config Signed-off-by: Andrey Velichkevich <[email protected]> * Update .gitignore Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Fix the misspelling for initializer Signed-off-by: Andrey Velichkevich <[email protected]> * Add .pt and .pth to ignore_patterns Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]>

* Added test for create-pytorchjob.ipynb Signed-off-by: sailesh duddupudi <[email protected]> * fix yaml syntax Signed-off-by: sailesh duddupudi <[email protected]> * Fix uses path Signed-off-by: sailesh duddupudi <[email protected]> * Add actions/checkout Signed-off-by: sailesh duddupudi <[email protected]> * Add bash to action.yaml Signed-off-by: sailesh duddupudi <[email protected]> * Install pip dependencies step Signed-off-by: sailesh duddupudi <[email protected]> * Add quotes for args Signed-off-by: sailesh duddupudi <[email protected]> * Add jupyter Signed-off-by: sailesh duddupudi <[email protected]> * Add nbformat_minor: 5 to fix invalid format error Signed-off-by: sailesh duddupudi <[email protected]> * Fix job name Signed-off-by: sailesh duddupudi <[email protected]> * test papermill-args-yaml Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args1 Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args2 Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args3 Signed-off-by: sailesh duddupudi <[email protected]> * Parameterize sdk install Signed-off-by: sailesh duddupudi <[email protected]> * Remove unnecessary output Signed-off-by: sailesh duddupudi <[email protected]> * nbformat normailze Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] Training Client Conditions related unit tests (#2253) * test: add unit test for get_job_conditions function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_created function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_running function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_restarting function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_failed function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_succeded function of training client Signed-off-by: Bobbins228 <[email protected]> * test: improve job condition unit tests efficiency Signed-off-by: Bobbins228 <[email protected]> --------- Signed-off-by: Bobbins228 <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] test: add unit test for list_jobs method of the training_client (#2267) Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273) Generate clientset, informers, listers and open api spec for v2alpha1 APIs. Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] Use torchrun to create PyTorchJob from function (#2276) * [SDK] Use torchrun to create PyTorchJob from function Signed-off-by: Andrey Velichkevich <[email protected]> * Update PyTorchJob SDK example Signed-off-by: Andrey Velichkevich <[email protected]> * Add consts for entrypoint Signed-off-by: Andrey Velichkevich <[email protected]> * Add check for num procs per worker Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] test: add unit test for get_job_logs method of the training_client (#2275) Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [v2alpha] Move GV related codebase (#2281) Move GV related codebase in v2alpha Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement runtime framework (#2248) * KEP-2170: Implement runtime framework interfaces Signed-off-by: Yuki Iwai <[email protected]> * Remove grep dependency Signed-off-by: Yuki Iwai <[email protected]> * KEP-2170: Implement ValidateObjects interface to the runtime framework Signed-off-by: Yuki Iwai <[email protected]> * KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind Signed-off-by: Yuki Iwai <[email protected]> * KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime Signed-off-by: Yuki Iwai <[email protected]> * Rephrase the error message Signed-off-by: Yuki Iwai <[email protected]> * Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs Signed-off-by: Yuki Iwai <[email protected]> * Propagate the TrainJob labels and annotations to the JobSet Signed-off-by: Yuki Iwai <[email protected]> * Remove PodAnnotations from the runtime info Signed-off-by: Yuki Iwai <[email protected]> * Implement TrainingRuntime ReplicatedJob validation Signed-off-by: Yuki Iwai <[email protected]> * Add TODO comments Signed-off-by: Yuki Iwai <[email protected]> * Replace queueSuspendedTrainJob with queueSuspendedTrainJobs Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Add DeepSpeed Example with Pytorch Operator (#2235) Signed-off-by: Syulin7 <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283) * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API Signed-off-by: Andrey Velichkevich <[email protected]> * Rename RuntimeRef in runtime framework Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Upgrade Deepspeed demo dependencies (#2294) Signed-off-by: Syulin7 <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add manifests for Kubeflow Training V2 (#2289) * KEP-2170: Add manifests for Kubeflow Training V2 Signed-off-by: Andrey Velichkevich <[email protected]> * Fix invalid name for webhook config in cert Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration tests Signed-off-by: Andrey Velichkevich <[email protected]> * Move kubebuilder markers to runtime framework Signed-off-by: Andrey Velichkevich <[email protected]> * Use Kubernetes recommended labels Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286) * FSDP Example with PyTorchJob and T5 Fine-Tuning Signed-off-by: Andrey Velichkevich <[email protected]> * Modify text Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement TrainJob Reconciler to manage objects (#2295) * KEP-2170: Implement TrainJob Reconciler to manage objects Signed-off-by: Yuki Iwai <[email protected]> * Mode dep-crds to manifests/external-crds Signed-off-by: Yuki Iwai <[email protected]> * Rename run with runtime Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Remove Prometheus Monitoring doc (#2301) Signed-off-by: Sophie <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Decouple JobSet from TrainJob (#2296) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Initialize runtimes before the manager starts (#2306) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310) * Generate SDK models for the Training V2 APIs Signed-off-by: Andrey Velichkevich <[email protected]> * Create pyproject.toml config Signed-off-by: Andrey Velichkevich <[email protected]> * Remove comments Signed-off-by: Andrey Velichkevich <[email protected]> * Fix pre-commit Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Create model and dataset initializers (#2303) * KEP-2170: Create model and dataset initializers Signed-off-by: Andrey Velichkevich <[email protected]> * Add abstract classes Signed-off-by: Andrey Velichkevich <[email protected]> * Add storage URI to config Signed-off-by: Andrey Velichkevich <[email protected]> * Update .gitignore Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Fix the misspelling for initializer Signed-off-by: Andrey Velichkevich <[email protected]> * Add .pt and .pth to ignore_patterns Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308) * KEP-2170: Implement JobSet and PlainML Plugins Signed-off-by: Andrey Velichkevich <[email protected]> * Fix nil pointer exception for Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Fix unit tests in runtime package Signed-off-by: Andrey Velichkevich <[email protected]> * Fix unit tests Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration tests Signed-off-by: Andrey Velichkevich <[email protected]> * Fix lint Signed-off-by: Andrey Velichkevich <[email protected]> * Implement Torch Plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Use list for the Info envs Signed-off-by: Andrey Velichkevich <[email protected]> * Fix golang ci Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Torch plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Use K8s sets Update error return Use ptr.Deref() for nil values Signed-off-by: Andrey Velichkevich <[email protected]> * Use client.Object for Build() call Signed-off-by: Andrey Velichkevich <[email protected]> * Remove DeepCopy Signed-off-by: Andrey Velichkevich <[email protected]> * Remove MLPolicy and PodGroupPolicy from the Info object Signed-off-by: Andrey Velichkevich <[email protected]> * Inline error Signed-off-by: Andrey Velichkevich <[email protected]> * Remove SDK jar file Signed-off-by: Andrey Velichkevich <[email protected]> * Add integration test for Torch plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Add TODO to calculate PodGroup values in unit tests Signed-off-by: Andrey Velichkevich <[email protected]> * Revert the change to add original Runtime Policies to Info Signed-off-by: Andrey Velichkevich <[email protected]> * Create const for the DefaultJobReplicas Signed-off-by: Andrey Velichkevich <[email protected]> * Check if PodLabels is empty Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement Initializer builders in the JobSet plugin (#2316) * KEP-2170: Implement Initializer builder in the JobSet plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Update the SDK models Signed-off-by: Andrey Velichkevich <[email protected]> * Remove Info from Initializer builder Signed-off-by: Andrey Velichkevich <[email protected]> * Update manifests Signed-off-by: Andrey Velichkevich <[email protected]> * Update pkg/constants/constants.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Use var for envs Signed-off-by: Andrey Velichkevich <[email protected]> * Remove check manifests from GitHub actions Signed-off-by: Andrey Velichkevich <[email protected]> * Move consts to JobSet plugin Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add the TrainJob state transition design (#2298) * KEP-2170: Add the TrainJob state transition design Signed-off-by: Yuki Iwai <[email protected]> * Replace actual jobs with TrainJob Signed-off-by: Yuki Iwai <[email protected]> * Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin Signed-off-by: Yuki Iwai <[email protected]> * Expand the Creation Failed reasons Signed-off-by: Yuki Iwai <[email protected]> * Rename Completed to Complete Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Update tf job examples to tf v2 (#2270) * mnist with summaries updaetd to TF v2 Signed-off-by: yelias <[email protected]> * tf_sample updaetd to TF v2 Signed-off-by: yelias <[email protected]> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <[email protected]> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <[email protected]> * Remove old example - estimator-API, this example has been replaced by distribution_strategy Signed-off-by: yelias <[email protected]> * Small fix Signed-off-by: yelias <[email protected]> * Remove unsupported powerPC dockerfiles Signed-off-by: yelias <[email protected]> * Fix typo in copyright Signed-off-by: yelias <[email protected]> --------- Signed-off-by: yelias <[email protected]> Co-authored-by: yelias <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add TrainJob conditions (#2322) * KEP-2170: Implement TrainJob conditions Signed-off-by: Yuki Iwai <[email protected]> * Fix API comments Signed-off-by: Yuki Iwai <[email protected]> * Make condition message constants Signed-off-by: Yuki Iwai <[email protected]> * Stop connecting condition type and reason in JobSet plugin Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Pin Gloo repository in JAX Dockerfile to a specific commit (#2329) This commit pins the Gloo repository to a specific commit (43b7acbf) in the JAX Dockerfile to prevent build failures caused by a recent bug introduced in the Gloo codebase. By locking the version of Gloo to a known working commit, we ensure that the JAX build remains stable and functional until the issue is resolved upstream. The build failure occurs when compiling the gloo/transport/tcp/buffer.cc file due to an undefined __NR_gettid constant, which was introduced after the pinned commit. By using this commit, we bypass the issue and allow the build to complete successfully. Signed-off-by: Sandipan Panda <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [fix] Resolve v2alpha API exceptions (#2317) Resolve v2alpha API exceptions by adding necessary listType validations. Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Upgrade Kubernetes to v1.30.7 (#2332) * Upgrade Kubernetes to v1.30.7 Signed-off-by: Antonin Stefanutti <[email protected]> * Use typed event handlers and predicates in job controllers Signed-off-by: Antonin Stefanutti <[email protected]> * Re-organize pkg/common/util/reconciler.go Signed-off-by: Antonin Stefanutti <[email protected]> * Update installation instructions in README Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Ignore cache exporting errors in the image building workflows (#2336) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add Torch Distributed Runtime (#2328) * KEP-2170: Add Torch Distributed Runtime Signed-off-by: Andrey Velichkevich <[email protected]> * Add pip list Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Refine the server-side apply installation args (#2337) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Add openapi-generator CLI option to skip SDK v2 test generation (#2338) Signed-off-by: Antonin Stefanutti <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Upgrade kustomization files to Kustomize v5 (#2326) Signed-off-by: oksanabaza <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Pin accelerate package version in trainer (#2340) * Pin accelerate package version in trainer Signed-off-by: Gavrish Prabhu <[email protected]> * include new line to pass pre-commit hook Signed-off-by: Gavrish Prabhu <[email protected]> --------- Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Replace papermill command with bash script Signed-off-by: sailesh duddupudi <[email protected]> * Typo fix Signed-off-by: sailesh duddupudi <[email protected]> * Move Checkout step outside action.yaml file Signed-off-by: sailesh duddupudi <[email protected]> * Add newline EOF in script Signed-off-by: sailesh duddupudi <[email protected]> * Pass python dependencies as args and pin versions Signed-off-by: sailesh duddupudi <[email protected]> * Update Usage Signed-off-by: sailesh duddupudi <[email protected]> * Install dependencies in yaml Signed-off-by: sailesh duddupudi <[email protected]> * fix ipynb Signed-off-by: sailesh duddupudi <[email protected]> * set bash flags Signed-off-by: sailesh duddupudi <[email protected]> * Update script args and add more kubernetes versions for tests Signed-off-by: sailesh duddupudi <[email protected]> * add gang-scheduler-name to template Signed-off-by: sailesh duddupudi <[email protected]> * move go setup to template Signed-off-by: sailesh duddupudi <[email protected]> * remove -p parameter from script Signed-off-by: sailesh duddupudi <[email protected]> --------- Signed-off-by: sailesh duddupudi <[email protected]> Signed-off-by: Bobbins228 <[email protected]> Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: Syulin7 <[email protected]> Signed-off-by: Akshay Chitneni <[email protected]> Signed-off-by: Sophie <[email protected]> Signed-off-by: yelias <[email protected]> Signed-off-by: Sandipan Panda <[email protected]> Signed-off-by: Antonin Stefanutti <[email protected]> Signed-off-by: oksanabaza <[email protected]> Signed-off-by: Gavrish Prabhu <[email protected]> Co-authored-by: Mark Campbell <[email protected]> Co-authored-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Varsha <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: yu lin <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]> Co-authored-by: Sophie Hsu <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: YosiElias <[email protected]> Co-authored-by: yelias <[email protected]> Co-authored-by: Sandipan Panda <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]> Co-authored-by: Oksana Bazylieva <[email protected]> Co-authored-by: Gavrish Prabhu <[email protected]>

andreyvelich added 3 commits October 23, 2024 22:15

KEP-2170: Create model and dataset initializers

fe481b7

Signed-off-by: Andrey Velichkevich <[email protected]>

Add abstract classes

59d3224

Signed-off-by: Andrey Velichkevich <[email protected]>

Add storage URI to config

45c860c

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot requested review from jinchihe and kuizhiqing October 23, 2024 21:45

google-oss-prow bot added the size/L label Oct 23, 2024

kannon92 reviewed Oct 23, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

Update .gitignore

468500d

Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich mentioned this pull request Oct 23, 2024

KEP-2170: Add unit and integration tests for model and dataset initializers #2305

Closed

kannon92 reviewed Oct 23, 2024

View reviewed changes

saileshd1402 reviewed Oct 24, 2024

View reviewed changes

deepanker13 reviewed Oct 24, 2024

View reviewed changes

varshaprasad96 reviewed Oct 24, 2024

View reviewed changes

Fix the misspelling for initializer

5c39812

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the issue-2210-storage-initializer branch from fa77453 to 5c39812 Compare October 25, 2024 02:03

lizzzcai reviewed Oct 25, 2024

View reviewed changes

Electronic-Waste reviewed Oct 25, 2024

View reviewed changes

Add .pt and .pth to ignore_patterns

2e8a518

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the issue-2210-storage-initializer branch from 2cf2c5e to 2e8a518 Compare October 25, 2024 11:58

google-oss-prow bot assigned deepanker13, Electronic-Waste and kannon92 Oct 25, 2024

tenzen-y reviewed Oct 27, 2024

View reviewed changes

google-oss-prow bot assigned tenzen-y Oct 27, 2024

google-oss-prow bot added the lgtm label Oct 27, 2024

google-oss-prow bot added the approved label Oct 27, 2024

google-oss-prow bot merged commit 3f7ec16 into kubeflow:master Oct 27, 2024
43 checks passed

andreyvelich mentioned this pull request Oct 28, 2024

Use Debian images for Python components in the Training Operator V2 #2311

Open

andreyvelich deleted the issue-2210-storage-initializer branch October 28, 2024 01:43

This was referenced Oct 28, 2024

Update Dockerfile with python debian image mani1soni/training-operator#1

Merged

Update Dockerfile with python debian image in cmd/initializer_v2/dataset/Dockerfile #2312

Open

KEP-2170: Create model and dataset initializers #2303

KEP-2170: Create model and dataset initializers #2303

Conversation

andreyvelich commented Oct 23, 2024 • edited Loading

coveralls commented Oct 23, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11517758882

Details

💛 - Coveralls

kannon92 commented Oct 23, 2024

andreyvelich commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepanker13 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saileshd1402 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepanker13 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepanker13 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

andreyvelich Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste left a comment

Choose a reason for hiding this comment

andreyvelich commented Oct 25, 2024

andreyvelich commented Oct 25, 2024

google-oss-prow bot commented Oct 25, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-oss-prow bot commented Oct 27, 2024

andreyvelich commented Oct 23, 2024 •

edited

Loading

coveralls commented Oct 23, 2024 •

edited

Loading

deepanker13 Oct 25, 2024 •

edited

Loading

saileshd1402 Oct 24, 2024 •

edited

Loading

deepanker13 Oct 24, 2024 •

edited

Loading

deepanker13 Oct 24, 2024 •

edited

Loading

andreyvelich Oct 25, 2024 •

edited

Loading

tenzen-y Oct 27, 2024 •

edited

Loading