Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving output references without querying the full output document #408

Open
utf opened this issue Aug 29, 2023 · 4 comments
Open

Resolving output references without querying the full output document #408

utf opened this issue Aug 29, 2023 · 4 comments

Comments

@utf
Copy link
Member

utf commented Aug 29, 2023

Currently, when resolving output references, the full output of a previous job is queried from the database, even if only a small amount of the document is needed. For example, when executing the following atomate2 code using FireWorks/the local manager, the full VASP task document of job1 is first returned before the structure is accessed.

job1 = RelaxMaker().make(structure)
job2 = RelaxMaker().make(job1.output.structure)

This is clearly very wasteful, puts strain on the database, and is inefficient, especially when the task document contains large items such as band structures.

Problem

Part of the difficulty is that there is no way to know in advance if the full output is actually needed to access the specific field requested. Take this example:

job2.output.vasp_objects["bandstructure"].nb_bands

There are two complicating factors here:

  1. The band structure object is stored in the data store not the docs store.
  2. The full band structure object is needed to access nb_bands as this is not something that is stored in the band structure object directly but is instead an attribute of the BandStructure class.

Accordingly, simply restricting the query to return output.vasp_objects.bandstructure.nb_bands will fail on two accounts.

Another complicating factor is that the output database can sometimes contain references to other job outputs. E.g., this can happen for dynamic workflows. Jobflow automatically resolves these references under-the-hood, but again it requires first getting the full output, finding any references in the output, resolving those references, and finally returning the specific item from the output that was requested.

Proposed solution

The best way I can think of solving this is:

  1. Store the following information in the job output metadata:
  • The location of any blobs (e.g., items not stored in the docs store), stored as the mongo syntax location to those items. E.g., output.vasp_objects.bandstructure.
  • The location of any output references.
  • The location of any objects which are not Python built-in types or pydantic data classes. These can be identified by the monty {"__class__", "__module__"} signature.
  1. When resolving outputs, first query the database for the above locations. Next, check if the requested output overlaps with any of the above paths. E.g., in the example above output.vasp_objects.bandstructure.nb_bands would overlap with output.vasp_objects.bandstructure.
  2. Next, query the database for the full object (not the specific output requested). E.g., this would query for output.vasp_objects.bandstructure. Now resolve any blobs from the data store/output references.
  3. Finally, return the specified output.

As I see it, this approach has two disadvantages:

  1. It will always require two database requests instead of one.
  2. A lot of additional information will get stored alongside the job. For example, in the VASP workflows, we will also be storing the locations of every structure in the metadata. This might not be that bad though.

However, I can't see a cleaner way of solving this bug, and I imagine this would result in a speedup even with the extra database request.

@gpetretto
Copy link
Contributor

Thanks for bringing this up and proposing a solution. Here are some first thoughts/questions that came to mind.

  • The first one concerns the requirement of performing multiple queries. When making the first query I expect that to resolve the references you would first run:

    store.query_one({"uuid": uuid}, properties=["index", "metadata"], sort={"index": -1})

    In the example you were mentioning with job2 = RelaxMaker().make(job1.output.structure) one could imagine running instead

    store.query_one({"uuid": uuid}, properties=["index", "metadata", "output.structure"], sort={"index": -1})

    If the output document already contains the structure the second query could be avoided. I am not sure if this would put much more strain on the DB, but I expect that most of the second queries could be skipped.

  • I am not sure if you are suggesting to put the additional information in the current metadata field. If that is the case I would suggest to store this additional data in a separate key. The current metadata will also contain data from the user, so it may be confusing for an unexperienced user to have all those additional dictionaries in the metadata.

  • It is not entirely clear to me if you want to store the metadata for all the elements with @module, @class, even for nested objects. For example, the bandstructure will require the references for the BandStructureSymmLine, Structure and Lattice object? Or only for the top one (i.e. BandStructureSymmLine)?

@utf
Copy link
Member Author

utf commented Aug 29, 2023

For your first point, that is also a potential option. However, if the first query fails, you will need to fall back on the method I proposed, leading to a total of 3 queries in the worst case scenario. However, since most queries will be directly resolvable maybe this is a reasonable thing to do. Either way, I think we will still need to implement the method I proposed, so perhaps this could be added on later.

I am not sure if you are suggesting to put the additional information in the current metadata field

No sorry, that was poor wording. I didn't mean in the metadata field, this should definitely be stored separately.

It is not entirely clear to me if you want to store the metadata for all the elements with @module, @Class, even for nested objects.

My initial thought was to only store the top level, e.g., BandStructureSymmLine. Storing subsequent levels should also work, but I fear that this will result in too many locations getting stored. E.g., within Structure, there are also Periodic Sites, then Elements, etc.

@gpetretto
Copy link
Contributor

gpetretto commented Aug 29, 2023

For your first point, that is also a potential option. However, if the first query fails, you will need to fall back on the method I proposed, leading to a total of 3 queries in the worst case scenario. However, since most queries will be directly resolvable maybe this is a reasonable thing to do. Either way, I think we will still need to implement the method I proposed, so perhaps this could be added on later.

Maybe I am missing something, but my idea was to query both the "metadata" and output.structure during the first query, so that one could directly fall back to the original plan and perform just one more query. This under the assumption that the metadata will be a small addition with respect to the queried properties.

Anyway, this could indeed also be added at a later stage.

@utf
Copy link
Member Author

utf commented Aug 29, 2023

Ah I missed that. Yes, that should work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants