Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues #97

Open
TimCraigCGPS opened this issue Nov 11, 2024 · 13 comments
Open

Memory issues #97

TimCraigCGPS opened this issue Nov 11, 2024 · 13 comments
Labels
help wanted Extra attention is needed

Comments

@TimCraigCGPS
Copy link

Hi,
I am finding that after I run a job it seems to be running out of memory after 1-2 days. I don't think that this is a problem with the machine memory not being enough to run at all, since it's able to go through ~ 21-23 trajectories in the trajectory_stats.csv file. I'm also able to run the PDL1 example but get a similar error after some time if I try to get 200 binders (66 trajectories). Given a previous comment that we might need to run 200-300 trajectories for "easy" targets and 2000-3500 for harder ones I'm wondering what might be causing this.

I'm getting an error that looks like this:
<Signals.SIGKILL: 9>.; 2341581)

I'm wondering if the best practice might be to do something like shut down bindcraft and then restart it every 12 hours or so?

@martinpacesa
Copy link
Owner

This is interesting, how much RAM are you allocating?

@TimCraigCGPS
Copy link
Author

TimCraigCGPS commented Nov 12, 2024

I'm using an A100, so 80GB (@-task(queue='gpu', executor_config={'--mem-per-gpu': '80G', '-G': '1'})). I'm not sure that we are specifying an amount of RAM.

Should we be using os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = 2 or higher? (found this in another thread)

@martinpacesa
Copy link
Owner

I meant computer RAM, the code compilation and PyRosetta still require a decent memory to run, we normally use 32 Gb

@agitter
Copy link

agitter commented Nov 16, 2024

I had one run with untrimmed EGRF 6aru as the target and an 80GB A100 that was held in our shared computing system because it exceeded the 100GB RAM I requested. It completed when I reran it and requested 200GB RAM.

@martinpacesa
Copy link
Owner

How big was that? I so far never needed that much RAM

@agitter
Copy link

agitter commented Nov 18, 2024

Chain A of 6aru is 622 residues. I tried lengths of 50-250 with hotspots and default settings.

Our system uses cgroups to monitor and enforce resource sharing, and this is part of the error message I got

Job has gone over cgroup memory limit of 102400 megabytes. Last measured usage: 261 megabytes.  Consider resubmitting with a higher request_memory.

I can't confirm if that was accurate memory usage because it was running in a batch setting that I wasn't monitoring.

@martinpacesa
Copy link
Owner

Okay, thanks a lot for the report. I will keep an eye on the memory usage settings over the coming weeks, but this should not happen.

@martinpacesa martinpacesa added the help wanted Extra attention is needed label Nov 18, 2024
@martinpacesa
Copy link
Owner

Do these happen for you within Docker environments or also non-docker?

@agitter
Copy link

agitter commented Dec 5, 2024

I've only run inside an Apptainer environment

@TimCraigCGPS
Copy link
Author

My error was not inside of a docker or other container type.

@Masterchiefm
Copy link

image
An interesting thing, Memory usage steadily increases over time. It must be some data that is continuously accumulating but not being cleared in time. A scheduled restart can temporarily resolve the issue.

@martinpacesa
Copy link
Owner

Thank you for providing this plot! I will try to artificially limit the RAM on our system and see if I also run into, but will probably only have time to look at it next week

@martinpacesa
Copy link
Owner

Can you reupload the bindcraft.py and functions/init.py from the main repo and see if it still occurs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants