Replies: 1 comment
-
@vsoch and @alecbcs: Thanks for putting this together. I have a few questions:
? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@alecbcs and I chat today about needs for Ramble / Benchpark in the context of cloud (Kubernetes) and since we don't have a CLA and cannot contribute (yet!) we want to include some of our notes here.
Workflow Tool or Something Else?
We talked about "What is ramble" - and we like the idea / use case of being able to run ramble from you local machine that submits jobs to Kubernetes. Part of this might include building containers (e.g., with a spack or other base) and then also submitting Jobs for them. However, what we can't do is submit something to Kubernetes, already having the cluster, and wait for any kind of build. So we are proposing the following idea:
Stage 1: Prepare Container Bases
Note that this could be run in CI, meaning that we don't have the result / output files for the various configs. The reason we don't have
ramble workspace setup
is because we aren't going to be saving the experiment yaml config files.ramble build
would share logic withramble workspace setup
and a user could do both steps by way of:Also note that we are doing
ramble build
instead oframble workspace build
because potentially we could build other things.Potential output:
=> You have successfully built my-experiment into ghcr.io/dinosaur-is-the-best/my-experiment:latest
# Then I (as the user) push to a registry. Note that this could be done in a CI workflow docker push ghcr.io/dinosaur-is-the-best/my-experiment:latest
Stage 2: Run Workflow on Kubernetes
First do steps to get your cluster. Once you have your cluster nodes, still from our local machine
Ramble would then submit Kubernetes jobs (or other abstractions like operators) associated with each experiment. The containers would be deployed, generate some result, and that would be saved to a mounted RWX storage or some other artifact cache.
Pulling Results
After an experiment has run (that I've submit from my laptop) how do I get results? If the workflow above has a "push" action to some namespaced registry or storage, then ramble could have an equivalent "pull" or some derivative of "ramble analyze" to get the results and then analyze them.
To be clear, in the above:
The complexity of the above is really the number of different compute APIs (from VMs to Kubernetes) that warrant being submitted to - it's not always just a simple script.
@douglasjacobsen and @pearce8 let us know what you think! We can't contribute directly but if there is an indirect way we can try / test that would be great.
Beta Was this translation helpful? Give feedback.
All reactions