-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design of a universal base DFT task document #741
Comments
@utf , we merged in some of @esoteric-ephemera's work on this in emmet (#971) yesterday as a checkpoint for me to start working on up-versioning the schema of all the task docs MP's existing tasks collection. A couple outstanding questions on my end:
The fields in question, for quick reference:
|
Hi @tsmathis. Great to hear this is being picked up.
|
Okay, got it. For 1., I am going to take a pretty hard stance against allowing users to inject arbitrary data into MP's base task doc. I would rather lean towards first pushing users towards creating a new derived document schema for their workflow. And if the workflow becomes pervasive enough to be a default, MP could then add the new fields explicitly to the base task doc schema (e.g., transformations, etc.). TLDR: I am open to adding fields, but I want them to be explicit. For 2. I agree that ease of query-ability is important, so I am somewhat neutral. But @esoteric-ephemera I think has an idea of how he might want those fields populated/formatted to reduce duplication |
For 1, that's perfectly fine for MP but I believe that should be handled during ingestion of task docs, rather than as limitation to atomate2. Just to note that ALL vasp jobs have the same schema, even for different calculation types, so creating a new schema just for one specific VASP calculation wouldn't be a realistic solution. This is already a feature I use for some of my own workflows, so I'd rather it not be removed. A general point is that these task documents have been in use in atomate2 for quite some time. I would strongly recommend that you don't make any breaking changes to avoid disrupting existing workflows. |
I would also like to emphasize that the atomate2 use cases go much beyond MP data creation, even though this is indeed a very important use case. Thus, I would also recommend that major breaking chances should only be done very carefully |
For (1), since the TaskDoc schema in emmet allows extra fields, the user could add an No breaking changes then, but additional fields would be populated for atomate2 calcs that might later be stripped on ingestion by MP For (2), I think it makes sense to have There's already similar logic implemented for atomate2's |
Agreed about point 2, that makes a lot of sense. My issue with point 1 is that we have a schemas for several reasons: i) to have documentation of what the fields contain, ii) to have a way of looking up what fields are present, and iii) to validate the task document. This suggestion breaks all of those reasons. Can I ask the motivation for removing it? |
On MP's side, we are trying 1) to standardize our schema and 2) transition to using TaskDoc for everything. Including updating the schema for the existing tasks collection on MP ahead of our full re-compute. The TaskDoc schema has downstream effects on how MP is now parsing and storing the task documents as well as using them in builders (hence why we are digging into this now). A lot of the legacy tasks in MP's collection use outdated schemas with missing fields - none of these have So, yes, we don't want to inhibit atomate2 users from extending the model for their own use cases, as was mentioned about the For your points (And my thoughts here are strictly from the emmet side, since the emmet base is TaskDoc):
That doesn't actually explain what is contained in that field, since anything JSONable is allowed. ii. "to have a way of looking up what fields are present". iii. "to validate the task document." And to be clear, I do agree MP should be handling this during ingestion, which is why I would push for the And correct me if I'm wrong, but the theme of this gh issue is to define a common code agnostic schema for atomate2 workflows? So then shouldn't that schema inherit from the emmet classes/base schemas? |
Just to confirm, is the main reason you want to remove it for cosmetic reasons? I take your point about validation but I disagree with the other reasons. The description describes what the field should contain, and I often navigate the schemas on emmet to find where a particular field lives. A broader issue I have is that we have entrusted a lot of power to emmet by putting the task documents there. Atomate2 is already constantly breaking because someone changed something in pymatgen or emmet without considering the downstream consequences (and for this reason, I'm very appreciative that you've posted your questions on this thread). I'd really like to avoid any unnecessary changes to such a core part of the atomate2 code base. Modifying things for cosmetic reasons is not a strong justification in my opinion. |
Hi @utf, the main reasons for removing the
Since the emmet BaseModel permits extra model fields, it seems mostly harmless to move the I'll also note there are identical |
A nice feature of atomate2 is the universal workflows which can be run using multiple DFT calculators (e.g., elastic, phonon, defect workflows). As the number of DFT codes grows (currently including VASP, CP2K, FHI-aims, Abinit, QChem), there is a need to standardise the output task documents to maintain interoperability.
I'm opening this PR so we can start a discussion about what fields should be included in the base schema. Obviously, we should not break the existing VASP Task Document, so I've listed the fields below as a starting point. This issue is related to #145.
Some open questions:
The text was updated successfully, but these errors were encountered: