-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues #97
Comments
This is interesting, how much RAM are you allocating? |
I'm using an A100, so 80GB (@-task(queue='gpu', executor_config={'--mem-per-gpu': '80G', '-G': '1'})). I'm not sure that we are specifying an amount of RAM. Should we be using os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = 2 or higher? (found this in another thread) |
I meant computer RAM, the code compilation and PyRosetta still require a decent memory to run, we normally use 32 Gb |
I had one run with untrimmed EGRF 6aru as the target and an 80GB A100 that was held in our shared computing system because it exceeded the 100GB RAM I requested. It completed when I reran it and requested 200GB RAM. |
How big was that? I so far never needed that much RAM |
Chain A of 6aru is 622 residues. I tried lengths of 50-250 with hotspots and default settings. Our system uses cgroups to monitor and enforce resource sharing, and this is part of the error message I got
I can't confirm if that was accurate memory usage because it was running in a batch setting that I wasn't monitoring. |
Okay, thanks a lot for the report. I will keep an eye on the memory usage settings over the coming weeks, but this should not happen. |
Do these happen for you within Docker environments or also non-docker? |
I've only run inside an Apptainer environment |
My error was not inside of a docker or other container type. |
Thank you for providing this plot! I will try to artificially limit the RAM on our system and see if I also run into, but will probably only have time to look at it next week |
Can you reupload the bindcraft.py and functions/init.py from the main repo and see if it still occurs? |
Hi,
I am finding that after I run a job it seems to be running out of memory after 1-2 days. I don't think that this is a problem with the machine memory not being enough to run at all, since it's able to go through ~ 21-23 trajectories in the trajectory_stats.csv file. I'm also able to run the PDL1 example but get a similar error after some time if I try to get 200 binders (66 trajectories). Given a previous comment that we might need to run 200-300 trajectories for "easy" targets and 2000-3500 for harder ones I'm wondering what might be causing this.
I'm getting an error that looks like this:
<Signals.SIGKILL: 9>.; 2341581)
I'm wondering if the best practice might be to do something like shut down bindcraft and then restart it every 12 hours or so?
The text was updated successfully, but these errors were encountered: