-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fit_elastic_tensor hangs #659
Comments
Hi @katnykiel, we saw this issue too. My colleague @danielzuegner opened a PR to fix the schema issue at #651. I think the issue is that the schema expects the fitting to succeed, and when it does not, the document cannot be created; i.e., I think it will only affect "failed" calculations. Tagging @mjwen as someone who most recently worked on this, and @utf for visibility/merging the fix. |
Fixed in #651 |
This issue was marked as completed, but I am still having the same issue of fit_elastic_tensor running for hours without completing. I have noticed the following behavior:
Any ideas why this could be happening? I re-installed atomate2 directly from the repository, after the changes yesterday were committed:
I then had to apply two quick fixes for other atomate2 errors:
All of the previous fireworks in this workflow complete, I have checked the run directories and there are no errors returned by VASP. |
Thanks for flagging this again @katnykiel. Would you be willing to share the one of the structures that is taking a long time, and also the code you're using to submit the workflow? |
Thanks for sharing that code. I just tried running the elastic workflow myself on the structure you shared, and the I have a feeling you're right that the issue is to do with the database access. It could be resolving the output references which is taking a long time. Are you able to run the following code and let me know how long it takes? All this does is use the database to get the inputs for the from fireworks import LaunchPad
from jobflow import SETTINGS
# Connect to the job store and launchpad
store = SETTINGS.JOB_STORE
store.connect()
lpad = LaunchPad.auto_load()
# change this to be the fw id of the "fit_elastic_tensor" job
fw_id = 1
fw = lpad.get_fw_by_id(fw_id)
job = fw[0]["job"]
job.resolve_args(store=store) |
Alex, thanks for your help on this! It's a bit of a relief to have narrowed down an error that has been eluding me for weeks. I was able to run the code you sent. This finished in about ~50 minutes; confirming that this is indeed an issue in accessing the database. I tried running both locally (M2 mac) and in an HPC environment, both took the same time. I am able to access the database through other methods in reasonable times (mongosh, lpad, mongoDB playground, pymongo). Is there something specific about how the In the meantime I will dive into the code to try and identify what might be the issue. |
I have a strong feeling that the issue is related to materialsproject/jobflow#408 I will do my best to implement a fix for that issue in the next few days. |
Thanks for re-raising this @katnykiel. Can I clarify and ask if this is a performance issue you're seeing for all versions of atomate2/jobflow, or just new versions? Regarding database size, 20,000 documents is not typically a "large" database, but it can be if running in a resource-constrained environment, and it can be useful to find issues like this which will affect everyone, even if they're running on a larger server. Not sure if it'd be useful, but you can look at where the database call is actually happening when you run I see this query has a Regardless, if it's just a document size issue, there may be no mystery here beyond what's already described in the jobflow issue. |
@mkhorton I have only calculated elastic constants with atomate2's 0.0.12 release (or later), but I can try downgrading to older versions to see if that changes the performance. I do know that when my database only contained ~2000 documents the fit_elastic_tensor fireworks ran much faster (<1 min). Thank you for the insight on the database call, that is definitely helpful in my debugging. I'll update here with any results I find. In the immediate short term I am just building an ElasticDocument directly from queried runs, in the interest of obtaining results in a timely manner for my PI. |
@katnykiel, I don't think @mkhorton is suggesting to use an older version of atomate2. Nothing has changed in atomate2 that could be causing this issue. Instead, it seems like your database could be resource limited (e.g., not enough RAM or not configured correctly). The jobflow issue I mentioned will exacerbate the problem, but for a well-configured database it shouldn't be a limitation with only 20k documents. To test, your database, can you try running: from fireworks import LaunchPad
from jobflow import SETTINGS
# Connect to the job store and launchpad
store = SETTINGS.JOB_STORE
store.connect()
lpad = LaunchPad.auto_load()
# change this to be the fw id of the "fit_elastic_tensor" job
fw_id = 1
fw = lpad.get_fw_by_id(fw_id)
job = fw[0]["job"]
uuid = job.input_references[0].uuid
# test 1
_ = store.query_one({"uuid": uuid})
# test 2
_ = store.query_one({"uuid": uuid}, properties=["output.structure"])
# test 3
_ = store.get_output(uuid, load=True) There are 3 tests in there, it would be useful to know the timings for each of them. If the database is configured ok, each one should take substantially less than 30 seconds (realistically 5-10 seconds max). To clarify, this time we are only querying the database for the output of 1 calculation (the initial relaxation) rather than the full 25 calculations that are used to fit the elastic tensor (i.e., what we tested for in the previous timing test). |
@utf Thank you for the clarification on using older versions. I ran the three tests you listed and each completed in <30 seconds (22s, 23s, 25s). However, when I submitted 3 scripts in parallel, that time increased to ~40 seconds per test. I will look into the database deployment to ensure I'm configuring it correctly. Thanks y'all for your help! |
Thanks @katnykiel. When you consider that the output from 25 calculations are needed to fit the elastic tensor, that will equate to about 25 seconds * 50 requests (due to a quirk in how dynamic workflow results are stored, two database requests are required for the elastic relaxation jobs) ~= 20 minutes. Obviously, this is still below your previous observation of 40 minutes but perhaps if you're not requesting the same output repeatedly this will slow things down further. The minimal difference between the test times (test1, test2, test3), also indicates to me that fixing the jobflow issue won't impact what you're seeing here. I.e., that fix would reduce the amount of data transferred but your tests show minimal change in the time whether the full output document is requested or just a single field such as the structure. I think trying to optimise your database is the best bet for now. If you have access to the server, you could check the RAM usage and whether you're having to use swap space during database requests. Out of interest, how much RAM does your server have? |
@utf I'm running the database from a K8s deployment of a mongoDB docker image through my university HPC system. I believe I have access to 16 GB RAM, of which I am using about 20%. I will reach out my university's HPC support staff and see if they are able to help pin down any errors in the configuration of the database. |
Sometimes it can be something trivial like the Docker daemon not having access to enough memory even if the host itself has sufficient memory, eg flags like |
I wanted to follow up on this thread - I configured my database deployment to use more memory (from about 4 GB to 16 GB) and now the database calls are running much faster. A single call takes around a second, with an entire It seems this issue was specific to my database deployment, so feel free to close out the issue. Thanks y'all for your help! |
That's great to hear! Thanks for all the detailed debugging |
issue
I have been encountering more issues while using the ElasticMaker workflow; specifically, in the
fit_elastic_tensor
firework. All previous fireworks in this workflow complete without error.The fit_elastic_tensor firework results in one of the possible outcomes;
runs until it reaches job wall time (~4 hrs)
fizzles and returns the following error:
I have tried BaseVaspMaker and M3GNetRelaxMaker workflows, but both seem to experience all three of these outcomes.
I don't seem to recall
fit_elastic_tensor
fireworks to take ~1 hr in the past, is these a bug somewhere inatomate2/common/schemas/elastic.py
?environment
installed with
pip install "atomate2[strict]"
The text was updated successfully, but these errors were encountered: