-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to build docker image with dockerfile #27
Comments
Hello Tuan, thanks for reaching out. The base image is actually multi-arch and depending on what architecture you are building the image, docker will pull the image for the corresponding architecture. What is your setup for building the image? Are you trying to build it on an Arm platform (for example Grace Hopper, on a Mac with M-type CPU, etc.) and then run it on an x86 platform? In this case, you can in principle pull the image for the other arch, using |
Thanks for the reply. This is exactly what I’m doing. I saw that docker enables cross-platform building. Have you tried to see if the dockerfile is compilable cross-platform? |
Hello Tuan, I had a look at cross-compilation and it is not really trivial. Check out https://docs.docker.com/build/building/multi-platform/#cross-compiling-a-go-application if you are interested. It seems that the way of getting it to work is to make a host arch build container but then evoke cross compilation for each target. I am not sure that this is so simple, I made very bad experiences with cross compilation in the past. What should work though is building it on the target arch directly and then push it to some registry (for example docker hub) and pull it where you need it. Also, some systems support squash fs (for example systems using Pyxis for container launches). In that case you could build your image on some x86 system, enroot dump it into an sqsh file (check https://github.com/NVIDIA/enroot) and then rsync the sqsh file over. Lastly, you can also build TorchFort natively on the system if it does not have container build support. Building TorchFort should be rather smooth once you built all the dependencies such as PyTorch. Let me know if you have any questions |
Hi Thorsten, Thanks for these details. I managed to build the container and run an example. Do I understand correctly that in the training loop, say in P/S: since this is no longer an issue, we could move this discussion somewhere else and close the issue, if you’d like. Best, Tuan. |
Hello Tuan, thanks for reaching out. TorchFort is designed to allow online learning or online inference in simulations. For example for cases where you want to call NN like library functions for analytics or if you want to train on data you generate. Another use case is to steer a simulation using reinforcement learning, in which case you also need to have a tight feedback loop between the simulation and the training/inference. What is your use case, if I am allowed to ask. I can tell you if TorchFort is the right package for you. Best and thanks |
In my collaboration we have a C++ based simulator that does very low-level and precise simulation of particle detector readout data, from which we want to train ML models to reconstruct physics objects. We want to try training models simultaneously with generating data. Since the simulator has rather long latency, it might become a bottle neck when training if the train step has to wait for a simulation step. That’s the main concern (aside from how to make all this work :) ). |
Ok, so generally you need to have data to train on. Either that data is offline, then you can use standard tools like PyTorch directly. If you want to train while you generate, you need to have some data somewhere. What you can do in this case is to implement a replay buffer (check the RL part of the code for how it can be done) and run training by pulling data from the replay buffer. Every once in a while when a new sample is ready, push it to the buffer and continue training. This behavior you can also mimic with other setups, for example running a regular PyTorch training with data pulled from a database which is changed while training runs. The most efficient setup will be dependent on how fast you can generate data and whether you can afford to train on old data while waiting for new data (in the sense that you want to avoid overfitting on the old data, which could be a problem if data generation is too slow). So for example if a simulation steps takes 1hr but a training step less than a second, then the limiter is simulation but that is also what you would have in the offline case, right? |
Yes, exactly. We want to set up a buffer to which the simulated data is dumped, and simultaneously run PyTorch training which draws data from this buffer. That is what TorchFort is written to handle, right? |
TorchFort is written to handle a tight integration of PyTorch training/inference with simulation. So basically it allows you to directly plug that into your C/C++/Fortran code without having to run a python interpreter. If you are using C++, you can also implement that with libtorch directly, but you would need to implement some of the stuff we have written on your own. Generally, TorchFort does not have a replay buffer for supervised training now but we could implement it easily if there is demand. |
Hello Tuan, I wanted to ask if I can close this issue or do you have more questions? |
I am struggling to build a docker image with the dockerfile provided. Apparently the issue is that the base image is with arm64 but the hpc pack is with x86. Could you test again to make sure the image can be built from the dockerfile?
The text was updated successfully, but these errors were encountered: