Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build docker image with dockerfile #27

Open
tuanmp opened this issue Oct 9, 2024 · 10 comments
Open

Unable to build docker image with dockerfile #27

tuanmp opened this issue Oct 9, 2024 · 10 comments

Comments

@tuanmp
Copy link

tuanmp commented Oct 9, 2024

I am struggling to build a docker image with the dockerfile provided. Apparently the issue is that the base image is with arm64 but the hpc pack is with x86. Could you test again to make sure the image can be built from the dockerfile?

@azrael417
Copy link
Collaborator

azrael417 commented Oct 10, 2024

Hello Tuan, thanks for reaching out. The base image is actually multi-arch and depending on what architecture you are building the image, docker will pull the image for the corresponding architecture. What is your setup for building the image? Are you trying to build it on an Arm platform (for example Grace Hopper, on a Mac with M-type CPU, etc.) and then run it on an x86 platform?

In this case, you can in principle pull the image for the other arch, using docker pull --platform=<arch>, however I would recommend building the image on a machine with the targeted arch directly.

@tuanmp
Copy link
Author

tuanmp commented Oct 10, 2024

Thanks for the reply. This is exactly what I’m doing. I saw that docker enables cross-platform building. Have you tried to see if the dockerfile is compilable cross-platform?

@azrael417
Copy link
Collaborator

Hello Tuan,

I had a look at cross-compilation and it is not really trivial. Check out https://docs.docker.com/build/building/multi-platform/#cross-compiling-a-go-application if you are interested. It seems that the way of getting it to work is to make a host arch build container but then evoke cross compilation for each target. I am not sure that this is so simple, I made very bad experiences with cross compilation in the past. What should work though is building it on the target arch directly and then push it to some registry (for example docker hub) and pull it where you need it. Also, some systems support squash fs (for example systems using Pyxis for container launches). In that case you could build your image on some x86 system, enroot dump it into an sqsh file (check https://github.com/NVIDIA/enroot) and then rsync the sqsh file over. Lastly, you can also build TorchFort natively on the system if it does not have container build support. Building TorchFort should be rather smooth once you built all the dependencies such as PyTorch.

Let me know if you have any questions
Best
Thorsten

@tuanmp
Copy link
Author

tuanmp commented Oct 31, 2024

Hi Thorsten,

Thanks for these details. I managed to build the container and run an example.

Do I understand correctly that in the training loop, say in examples/fortran/simulation/train.f90, the simulation is called with every train step? Does this mean that the training must wait for the simulation?

P/S: since this is no longer an issue, we could move this discussion somewhere else and close the issue, if you’d like.

Best,

Tuan.

@azrael417
Copy link
Collaborator

Hello Tuan,

thanks for reaching out. TorchFort is designed to allow online learning or online inference in simulations. For example for cases where you want to call NN like library functions for analytics or if you want to train on data you generate. Another use case is to steer a simulation using reinforcement learning, in which case you also need to have a tight feedback loop between the simulation and the training/inference.

What is your use case, if I am allowed to ask. I can tell you if TorchFort is the right package for you.

Best and thanks
Thorsten

@tuanmp
Copy link
Author

tuanmp commented Oct 31, 2024

In my collaboration we have a C++ based simulator that does very low-level and precise simulation of particle detector readout data, from which we want to train ML models to reconstruct physics objects. We want to try training models simultaneously with generating data. Since the simulator has rather long latency, it might become a bottle neck when training if the train step has to wait for a simulation step. That’s the main concern (aside from how to make all this work :) ).

@azrael417
Copy link
Collaborator

Ok, so generally you need to have data to train on. Either that data is offline, then you can use standard tools like PyTorch directly. If you want to train while you generate, you need to have some data somewhere. What you can do in this case is to implement a replay buffer (check the RL part of the code for how it can be done) and run training by pulling data from the replay buffer. Every once in a while when a new sample is ready, push it to the buffer and continue training. This behavior you can also mimic with other setups, for example running a regular PyTorch training with data pulled from a database which is changed while training runs. The most efficient setup will be dependent on how fast you can generate data and whether you can afford to train on old data while waiting for new data (in the sense that you want to avoid overfitting on the old data, which could be a problem if data generation is too slow). So for example if a simulation steps takes 1hr but a training step less than a second, then the limiter is simulation but that is also what you would have in the offline case, right?

@tuanmp
Copy link
Author

tuanmp commented Oct 31, 2024

Yes, exactly. We want to set up a buffer to which the simulated data is dumped, and simultaneously run PyTorch training which draws data from this buffer. That is what TorchFort is written to handle, right?

@azrael417
Copy link
Collaborator

TorchFort is written to handle a tight integration of PyTorch training/inference with simulation. So basically it allows you to directly plug that into your C/C++/Fortran code without having to run a python interpreter. If you are using C++, you can also implement that with libtorch directly, but you would need to implement some of the stuff we have written on your own.

Generally, TorchFort does not have a replay buffer for supervised training now but we could implement it easily if there is demand.

@azrael417
Copy link
Collaborator

Hello Tuan, I wanted to ask if I can close this issue or do you have more questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants