Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include multiple files for TrainingClient().create_job() #2233

Open
u66u opened this issue Aug 23, 2024 · 2 comments
Open

Include multiple files for TrainingClient().create_job() #2233

u66u opened this issue Aug 23, 2024 · 2 comments

Comments

@u66u
Copy link

u66u commented Aug 23, 2024

What you would like to be added?

I want to create a base docker image that will be used for training different models, it shouldn't include any training-specific files and serve as a base layer., then use it as base_image in TrainingClient().create_job().

Then I write the training code in my kubeflow notebook or from local machine:

from kubeflow.training import TrainingClient

def train_func():
   ...
   
TrainingClient().create_job(
    train_func = train_func,
    ...
)

However, train_func just gets its code copied onto the kubernetes cluster, so I can't import anything from my local modules, I can only use pip libraries that are in my base docker image, I also can't import anything from the file that train_func is in:

from kubeflow.training import TrainingClient

def train():
    ...
    
def train_func():
    from my_module import dataset
    x = dataset
    y = train()

Unresolved import error in both cases^

Is there any way to include multiple files in TrainingClient().create_job() or TrainingClient().train() without using yaml configs and kubectl and without adding them to my docker image?

Why is this needed?

It allows not having to rebuild your docker image or create yaml configs each time you want to run a new training job

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@andreyvelich
Copy link
Member

Thank you for creating this @u66u! We discuss exactly the same capability with various users. I think, we have two options to solve this problem:

  1. Unify the file-system (share PVC) between Kubeflow Notebook and Training Job. In that case, you can use the SDK like this:
from kubeflow.training import TrainingClient

def submit_to_trainjob():
    from my_model import train
    train()

TrainingClient().train(
    name="my-job",
    train_func=submit_to_trainjob,
)

Since my_model.py file will be located in the TrainJob, you should be able to run the training script.

  1. Improve our Kubeflow Training SDK to automatically build the Docker image with your source code, and use this Docker image in the distributed training nodes (that is what Fairing was doing before: https://github.com/kubeflow/fairing). However, this will require Docker runtime to be running in your environment.

We are looking for various options on how to distribute the user's training code into TrainJob resources.
If you have any other suggestions, please let us know @u66u

cc @kubeflow/wg-training-leads @shravan-achar

@andreyvelich
Copy link
Member

/remove-label lifecycle/needs-triage
/area sdk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants