Include multiple files for TrainingClient().create_job() #2233

u66u · 2024-08-23T09:39:53Z

What you would like to be added?

I want to create a base docker image that will be used for training different models, it shouldn't include any training-specific files and serve as a base layer., then use it as base_image in TrainingClient().create_job().

Then I write the training code in my kubeflow notebook or from local machine:

from kubeflow.training import TrainingClient

def train_func():
   ...
   
TrainingClient().create_job(
    train_func = train_func,
    ...
)

However, train_func just gets its code copied onto the kubernetes cluster, so I can't import anything from my local modules, I can only use pip libraries that are in my base docker image, I also can't import anything from the file that train_func is in:

from kubeflow.training import TrainingClient

def train():
    ...
    
def train_func():
    from my_module import dataset
    x = dataset
    y = train()

Unresolved import error in both cases^

Is there any way to include multiple files in TrainingClient().create_job() or TrainingClient().train() without using yaml configs and kubectl and without adding them to my docker image?

Why is this needed?

It allows not having to rebuild your docker image or create yaml configs each time you want to run a new training job

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich · 2024-08-28T17:16:32Z

Thank you for creating this @u66u! We discuss exactly the same capability with various users. I think, we have two options to solve this problem:

Unify the file-system (share PVC) between Kubeflow Notebook and Training Job. In that case, you can use the SDK like this:

from kubeflow.training import TrainingClient

def submit_to_trainjob():
    from my_model import train
    train()

TrainingClient().train(
    name="my-job",
    train_func=submit_to_trainjob,
)

Since my_model.py file will be located in the TrainJob, you should be able to run the training script.

Improve our Kubeflow Training SDK to automatically build the Docker image with your source code, and use this Docker image in the distributed training nodes (that is what Fairing was doing before: https://github.com/kubeflow/fairing). However, this will require Docker runtime to be running in your environment.

We are looking for various options on how to distribute the user's training code into TrainJob resources.
If you have any other suggestions, please let us know @u66u

cc @kubeflow/wg-training-leads @shravan-achar

andreyvelich · 2024-08-28T17:17:02Z

/remove-label lifecycle/needs-triage
/area sdk

u66u added kind/feature lifecycle/needs-triage labels Aug 23, 2024

google-oss-prow bot added area/sdk and removed lifecycle/needs-triage labels Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include multiple files for TrainingClient().create_job() #2233

Include multiple files for TrainingClient().create_job() #2233

u66u commented Aug 23, 2024 •

edited

Loading

andreyvelich commented Aug 28, 2024

andreyvelich commented Aug 28, 2024

Include multiple files for TrainingClient().create_job() #2233

Include multiple files for TrainingClient().create_job() #2233

Comments

u66u commented Aug 23, 2024 • edited Loading

What you would like to be added?

Why is this needed?

Love this feature?

andreyvelich commented Aug 28, 2024

andreyvelich commented Aug 28, 2024

u66u commented Aug 23, 2024 •

edited

Loading