Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fit_elastic_tensor hangs #659

Closed
katnykiel opened this issue Dec 23, 2023 · 17 comments
Closed

BUG: fit_elastic_tensor hangs #659

katnykiel opened this issue Dec 23, 2023 · 17 comments

Comments

@katnykiel
Copy link

issue

I have been encountering more issues while using the ElasticMaker workflow; specifically, in the fit_elastic_tensor firework. All previous fireworks in this workflow complete without error.

The fit_elastic_tensor firework results in one of the possible outcomes;

  1. completes (after ~1 hr), with the following warning:
/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/pymatgen/core/tensors.py:327: UserWarning: Tensor is not symmetric, information may be lost in voigt conversion.
  warnings.warn("Tensor is not symmetric, information may be lost in voigt conversion.")
  1. runs until it reaches job wall time (~4 hrs)

  2. fizzles and returns the following error:

Traceback (most recent call last):
  File \"/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/fireworks/core/rocket.py\", line 261, in run
    m_action = t.run_task(my_spec)
  File \"/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/jobflow/managers/fireworks.py\", line 177, in run_task
    response = job.run(store=store)
  File \"/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/jobflow/core/job.py\", line 583, in run
    response = function(*self.function_args, **self.function_kwargs)
  File \"/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/atomate2/common/jobs/elastic.py\", line 220, in fit_elastic_tensor
    return ElasticDocument.from_stresses(
      File \"/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/atomate2/common/schemas/elastic.py\", line 220, in from_stresses
    derived_properties = DerivedProperties(**property_dict)
  File \"/home/knykiel/.conda/envs/2022.10-py39/atomate2/lib/python3.9/site-packages/pydantic/main.py\", line 164, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 7 validation errors for DerivedProperties
trans_v
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type
long_v
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type
snyder_ac
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type
snyder_opt
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type
snyder_total
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type
cahill_thermalcond
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type
debye_temperature
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.4/v/float_type

I have tried BaseVaspMaker and M3GNetRelaxMaker workflows, but both seem to experience all three of these outcomes.

I don't seem to recall fit_elastic_tensor fireworks to take ~1 hr in the past, is these a bug somewhere in atomate2/common/schemas/elastic.py?

environment

installed with pip install "atomate2[strict]"

FireWorks                 2.0.3
jobflow                   0.1.14
atomate2                  0.0.12
pydantic                  2.4.2
pydantic_core             2.10.1
pydantic-settings         2.0.3
monty                     2023.9.25
@mkhorton
Copy link
Member

mkhorton commented Jan 2, 2024

Hi @katnykiel, we saw this issue too. My colleague @danielzuegner opened a PR to fix the schema issue at #651.

I think the issue is that the schema expects the fitting to succeed, and when it does not, the document cannot be created; i.e., I think it will only affect "failed" calculations.

Tagging @mjwen as someone who most recently worked on this, and @utf for visibility/merging the fix.

@utf
Copy link
Member

utf commented Jan 8, 2024

Fixed in #651

@utf utf closed this as completed Jan 8, 2024
@katnykiel
Copy link
Author

This issue was marked as completed, but I am still having the same issue of fit_elastic_tensor running for hours without completing.

I have noticed the following behavior:

  • when I submit a single fit_elastic_tensor firework, it completes in ~30 min
  • when I run two in parallel, it takes ~50 min per job
  • when I run 5 or more in parallel, it takes 4+ hours per job

Any ideas why this could be happening?

I re-installed atomate2 directly from the repository, after the changes yesterday were committed:

`pip install git+https://github.com/materialsproject/atomate2.git`

I then had to apply two quick fixes for other atomate2 errors:

All of the previous fireworks in this workflow complete, I have checked the run directories and there are no errors returned by VASP.

example atomate2 elastic workflow

@utf utf reopened this Jan 9, 2024
@utf
Copy link
Member

utf commented Jan 9, 2024

Thanks for flagging this again @katnykiel. Would you be willing to share the one of the structures that is taking a long time, and also the code you're using to submit the workflow?

@katnykiel
Copy link
Author

Sure, I'd be happy to help debug. Here's one of the structures I'm using (after symmetrizing and relaxing, before deformation):

Cr4 N3                                  
   1.00000000000000     
     8.5680778519564473   -1.4792873745411490   -0.0018342218140873
     8.5088945191243308    1.4792873745411490   -0.0018342218140873
     8.2826457879158113    0.0000000000000000    2.5480767767381529
   Cr   N 
     4     3
Direct
  0.3950358591452771  0.3618704190799000  0.3794856738126970
  0.6049641408547232  0.6381295809201002  0.6205143261873030
  0.7908590609919883  0.7825369904459157  0.7867311594790335
  0.2091409390080115  0.2174630095540844  0.2132688405209664
  0.0851482349035691  0.0650117409188657  0.0757213462645470
  0.5000000000000000  0.5000000000000000  0.5000000000000000
  0.9148517650964308  0.9349882590811343  0.9242786537354530

And an image for reference:

symmetrized structure

I have calculated the elastic tensor of this structure before with no issues. The fit_elastic_tensor firework takes a different amount of time to complete depending on if other fit_elastic_tensor jobs are running at the same time. They're all separate SLURM jobs on different resources.

I wonder if the bottleneck might be related to the size of the mongoDB database I'm using? (>20,000 documents). I'm running a mongoDB docker image, and as I'm out of my depth here I may have misconfigured some setting.

I'm submitting the workflow using the following code

from fireworks import LaunchPad
from jobflow import SETTINGS
from jobflow.managers.fireworks import flow_to_workflow
from pymatgen.core import Structure
from pymatgen.symmetry.analyzer import SpacegroupAnalyzer
from atomate2.vasp.flows.elastic import ElasticMaker
from atomate2.vasp.powerups import update_user_incar_settings


# Connect to the job store and launchpad
store = SETTINGS.JOB_STORE
store.connect()
lpad = LaunchPad.auto_load()

# Query for structures in the job store
results = store.query("some criteria")
structures = []
for result in results:
    struct = Structure.from_dict(result["output"]["structure"])
    structures.append(struct)

# For each structure, submit elastic workflow
for struct in structures:
    # Symmetrize the structure
    sga = SpacegroupAnalyzer(struct)
    struct = sga.get_primitive_standard_structure()

    # Create the elastic constant workflow
    elastic_flow = ElasticMaker(name=f"DFT elastic: {struct.formula}").make(struct)

    # Update the INCAR parameters
    incar_updates = {"NCORE": 9, "GGA": "PE"}
    elastic_flow = update_user_incar_settings(elastic_flow, incar_updates)

    # Add the workflow to the launchpad
    wf = flow_to_workflow(elastic_flow)
    for fw in wf.fws:
        fw.spec.update(
            {
                "tags": [
                    "some tag",
                ]
            }
        )
    lpad.add_wf(wf)

@utf
Copy link
Member

utf commented Jan 9, 2024

Thanks for sharing that code. I just tried running the elastic workflow myself on the structure you shared, and the fit_elastic_tensor job finished in 3 seconds.

I have a feeling you're right that the issue is to do with the database access. It could be resolving the output references which is taking a long time. Are you able to run the following code and let me know how long it takes? All this does is use the database to get the inputs for the fit_elastic_tensor job but not actually do the fitting.

from fireworks import LaunchPad
from jobflow import SETTINGS

# Connect to the job store and launchpad
store = SETTINGS.JOB_STORE
store.connect()
lpad = LaunchPad.auto_load()

# change this to be the fw id of the "fit_elastic_tensor" job
fw_id = 1

fw = lpad.get_fw_by_id(fw_id)
job = fw[0]["job"]
job.resolve_args(store=store)

@katnykiel
Copy link
Author

Alex, thanks for your help on this! It's a bit of a relief to have narrowed down an error that has been eluding me for weeks.

I was able to run the code you sent. This finished in about ~50 minutes; confirming that this is indeed an issue in accessing the database. I tried running both locally (M2 mac) and in an HPC environment, both took the same time.

I am able to access the database through other methods in reasonable times (mongosh, lpad, mongoDB playground, pymongo). Is there something specific about how the fit_elastic_tensor queries the database that would make it significantly slower?

In the meantime I will dive into the code to try and identify what might be the issue.

@utf
Copy link
Member

utf commented Jan 9, 2024

I have a strong feeling that the issue is related to materialsproject/jobflow#408

I will do my best to implement a fix for that issue in the next few days.

@mkhorton
Copy link
Member

mkhorton commented Jan 9, 2024

Thanks for re-raising this @katnykiel. Can I clarify and ask if this is a performance issue you're seeing for all versions of atomate2/jobflow, or just new versions?

Regarding database size, 20,000 documents is not typically a "large" database, but it can be if running in a resource-constrained environment, and it can be useful to find issues like this which will affect everyone, even if they're running on a larger server.

Not sure if it'd be useful, but you can look at where the database call is actually happening when you run resolve_args() to do some more granular profiling. For example with some print statements to get the query and then re-running the same query via mongosh etc. Adding the explain() command to your query can give you useful information about what the database is actually doing it.

I see this query has a sort, which really should not affect performance here (both because the index field should be indexed, and the matching document list should be small), but stranger things have happened. I've certainly encountered non-intuitive database issues in the past, so I always test these things manually. The other thing you can do is inspect your database via the serverStatus command which contains info on amount of memory used, number of page faults, etc., and could be a sign you're under-resourced.

Regardless, if it's just a document size issue, there may be no mystery here beyond what's already described in the jobflow issue.

@katnykiel
Copy link
Author

@mkhorton I have only calculated elastic constants with atomate2's 0.0.12 release (or later), but I can try downgrading to older versions to see if that changes the performance. I do know that when my database only contained ~2000 documents the fit_elastic_tensor fireworks ran much faster (<1 min).

Thank you for the insight on the database call, that is definitely helpful in my debugging. I'll update here with any results I find. In the immediate short term I am just building an ElasticDocument directly from queried runs, in the interest of obtaining results in a timely manner for my PI.

@utf
Copy link
Member

utf commented Jan 10, 2024

@katnykiel, I don't think @mkhorton is suggesting to use an older version of atomate2. Nothing has changed in atomate2 that could be causing this issue. Instead, it seems like your database could be resource limited (e.g., not enough RAM or not configured correctly). The jobflow issue I mentioned will exacerbate the problem, but for a well-configured database it shouldn't be a limitation with only 20k documents.

To test, your database, can you try running:

from fireworks import LaunchPad
from jobflow import SETTINGS

# Connect to the job store and launchpad
store = SETTINGS.JOB_STORE
store.connect()
lpad = LaunchPad.auto_load()

# change this to be the fw id of the "fit_elastic_tensor" job
fw_id = 1

fw = lpad.get_fw_by_id(fw_id)
job = fw[0]["job"]
uuid = job.input_references[0].uuid

# test 1
_ = store.query_one({"uuid": uuid})

# test 2
_ = store.query_one({"uuid": uuid}, properties=["output.structure"])

# test 3
_ = store.get_output(uuid, load=True)

There are 3 tests in there, it would be useful to know the timings for each of them. If the database is configured ok, each one should take substantially less than 30 seconds (realistically 5-10 seconds max). To clarify, this time we are only querying the database for the output of 1 calculation (the initial relaxation) rather than the full 25 calculations that are used to fit the elastic tensor (i.e., what we tested for in the previous timing test).

@katnykiel
Copy link
Author

@utf Thank you for the clarification on using older versions.

I ran the three tests you listed and each completed in <30 seconds (22s, 23s, 25s). However, when I submitted 3 scripts in parallel, that time increased to ~40 seconds per test.

I will look into the database deployment to ensure I'm configuring it correctly. Thanks y'all for your help!

@utf
Copy link
Member

utf commented Jan 10, 2024

Thanks @katnykiel. When you consider that the output from 25 calculations are needed to fit the elastic tensor, that will equate to about 25 seconds * 50 requests (due to a quirk in how dynamic workflow results are stored, two database requests are required for the elastic relaxation jobs) ~= 20 minutes. Obviously, this is still below your previous observation of 40 minutes but perhaps if you're not requesting the same output repeatedly this will slow things down further.

The minimal difference between the test times (test1, test2, test3), also indicates to me that fixing the jobflow issue won't impact what you're seeing here. I.e., that fix would reduce the amount of data transferred but your tests show minimal change in the time whether the full output document is requested or just a single field such as the structure.

I think trying to optimise your database is the best bet for now. If you have access to the server, you could check the RAM usage and whether you're having to use swap space during database requests. Out of interest, how much RAM does your server have?

@katnykiel
Copy link
Author

@utf I'm running the database from a K8s deployment of a mongoDB docker image through my university HPC system. I believe I have access to 16 GB RAM, of which I am using about 20%. I will reach out my university's HPC support staff and see if they are able to help pin down any errors in the configuration of the database.

@mkhorton
Copy link
Member

Sometimes it can be something trivial like the Docker daemon not having access to enough memory even if the host itself has sufficient memory, eg flags like --shm-size. I would have thought 16 GB would be enough.

@katnykiel
Copy link
Author

I wanted to follow up on this thread - I configured my database deployment to use more memory (from about 4 GB to 16 GB) and now the database calls are running much faster. A single call takes around a second, with an entire fit_elastic_tensor firework taking about two minutes to run.

It seems this issue was specific to my database deployment, so feel free to close out the issue. Thanks y'all for your help!

@utf
Copy link
Member

utf commented Jan 20, 2024

That's great to hear! Thanks for all the detailed debugging

@utf utf closed this as completed Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants