Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

JPBergsma · 2022-08-17T20:29:00Z

I have put some of the changes I made in Barcelona and some more things in this PR.
I got a bit carried away and in hindsight I should probably have split it up in two or more PRs as it has become quite large and involves changes that could have been separated.

I am also still a bit worried that I may have changed too much and that there could be problems with code depending on the optimade python tools.

It now allows mapping to arbitrary nesting levels as long as the fields are separated by ".", which is similar to how they are handled in MongoDB.
Example: "aliases": { "structures": {"OPTIMADE_field": "field.nested_field"} }
It also allows adding database specific fields to top level optimade dictionaries like: "species"
So if you add "species.oxidation_state" to the provider_fields, it will be presented as "species._exmpl_oxidation_state"

The mapping is now done in two steps. One step for removing/adding the prefixes and one step for mapping the optimade field to the backend specific field.

The all_aliases method therefore now only returns the mapping between the optimade fields and the backend fields.
The all_prefixed_fields now contains the pairs of prefixed and unprefixed optimade fields.

Another change that is in this PR is that only the requested fields are now retrieved from the Mongo database. This caused some issues with validation, because fields that are required for a normal response are now no longer present when the response_fields parameter is present. I there for had to make the validation less strict in a few places.

codecov · 2022-08-17T20:37:34Z

Codecov Report

Merging #1304 (f174850) into master (105b501) will increase coverage by 0.15%.
The diff coverage is 99.24%.

@@            Coverage Diff             @@
##           master    #1304      +/-   ##
==========================================
+ Coverage   91.35%   91.51%   +0.15%     
==========================================
  Files          74       74              
  Lines        4374     4454      +80     
==========================================
+ Hits         3996     4076      +80     
  Misses        378      378

Flag	Coverage Δ
project	`91.51% <99.24%> (+0.15%)`	⬆️
validator	`91.57% <100.00%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
optimade/utils.py	`71.56% <97.22%> (+13.35%)`	⬆️
optimade/models/jsonapi.py	`93.93% <100.00%> (+0.44%)`	⬆️
optimade/models/links.py	`100.00% <100.00%> (ø)`
optimade/models/references.py	`100.00% <100.00%> (+2.27%)`	⬆️
optimade/server/entry_collections/elasticsearch.py	`97.53% <100.00%> (ø)`
...made/server/entry_collections/entry_collections.py	`97.98% <100.00%> (+0.19%)`	⬆️
optimade/server/mappers/entries.py	`99.29% <100.00%> (+0.19%)`	⬆️
optimade/server/routers/utils.py	`96.39% <100.00%> (-0.10%)`	⬇️
optimade/validator/validator.py	`83.48% <100.00%> (+0.02%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

JPBergsma · 2022-09-14T15:04:50Z

I still made some edits in this area during my stay in Barcelona, so I would like to add those changes to this PR as well.

…_aliases

… 2.Now only the requested fields are retrieved from the backend.

…d again eventhough they had already been determined in find in entry_collections.py

…qa/flake8

…to trigger tests.

…quirements.

… and removed fallback value for get_non_optional_fields.

ml-evs · 2022-11-16T12:11:39Z

Is this ready for review now then @JPBergsma ?

…he class.

…ets.

JPBergsma · 2022-11-16T13:31:00Z

Yes, you can review it now. I have also updated the text at the start of this PR, so you may want to reread it.

ml-evs

Thanks for the hard effort on this @JPBergsma! I've focused this first review on the response fields aspect without looking too deeply at the aliases. Whilst the nested aliases would be nice, we should make sure it doesn't effect performance for server's that don't use them (perhaps just hiding all of that functionality behind a switch based on whether there are dotted aliases in the server config).

I agree that this is definitely two separate PRs, but understand if it now too hard to disentangle them.

I think selecting which fields to grab from the database could be useful going forward (though I guess in the trajectories case they might be so big as to live in a different collection/table anyway?), however I think the current implementation falls a bit short in terms of performance and changes to the existing Python API that others have built their servers on. Whilst we shouldn't feel completely married to the existing way we do things (i.e., we can introduce breaking changes where necessary before v1), we should bear in mind that this will probably be a prohibitive blocker for APIs using this package (Materials Project for example is still using v0.13.3 from 18 months ago.

optimade/models/links.py

docs/getting_started/setting_up_an_api.md

ml-evs · 2022-11-17T11:46:05Z

optimade/server/entry_collections/entry_collections.py

+        for result in results:
+            for field in include_fields:
+                set_field_to_none_if_missing_in_dict(result["attributes"], field)


This is going to incur a significant performance overhead, but I guess you want to do it so that you don't have to pull e.g., entire trajectories from the database each time, yet you still want to deserialize the JSON into your classes? I think I would suggest we instead have a per-collection deserialization flag, as presumably you only want to deserialize trajectories once (on database insertion) anyway. Does that make sense?

If you want to retain this approach, it might be cleaner to do it at the pydantic level, e.g., a root_validator that explicitly sets all missing fields to None (see also https://pydantic-docs.helpmanual.io/usage/models/#creating-models-without-validation as an option).

I do not think this is particularly heavy compared to all the other things we do in the code.
It is what we previously did in handle_response_fields. I only moved it here, so we can do it before the deserialization and added code to handle nested fields.

For biomolecular data, a structure can easily have 10,000 atoms, so retrieving them from the database and putting them in the model would take some time. This way we can avoid that if the species_at_sites and cartesian_site_positions are not in the response_fields. (I also made a patch in the code for Barcelona that allowed them to specify the default response fields, so they can choose to not have these fields in the response by default.)

I did not want to make the change even bigger by bypassing the rest of the validator (as in your second suggestion).
But from a performance viewpoint, bypassing the validation would be good.
Do you want me to add this to this PR or to a future PR ?

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Hmmm, fair enough, just looks a bit scarier as a double for loop alongside the recursive descent into dictionaries to get the nested aliases. It's quite hard to reason about this, so I might set up a separate repo for measuring performance in the extreme limits (1 structure of 10000 atoms vs 10000 structures of a few atoms -- i.e., what we have now, ignoring pagination of course).

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Did you use @root_validator(pre=True) to make sure it gets run before anything else? Perhaps bypassing the validation altogether can wait for another PR, as you suggest, but I'd like to make the performance tests first at least (doesn't necessarily hold up this PR but it might hold up the next release).

I just noticed I made a mistake in my test script.
So it may be possible to do this with a root_validator after all.

I have added a root_validator to the attributes class that, if a flag is set, checks whether all required fields are present and if not adds them and sets them to 0.

I'll try to put the handling of the other include fields back to the place where it happened originally so the code changes less.

optimade/server/entry_collections/entry_collections.py

ml-evs · 2022-11-17T11:55:28Z

optimade/server/mappers/entries.py

-            + cls.ALIASES
+            ]
+            + list(cls.PROVIDER_FIELDS)


The diff is getting very confused between these similar functions, could you add the new stuff below instead all_aliases instead?

ml-evs · 2022-11-17T11:58:01Z

optimade/server/mappers/entries.py

-    def deserialize(
-        cls, results: Union[dict, Iterable[dict]]
-    ) -> Union[List[EntryResource], EntryResource]:
-        if isinstance(results, dict):
-            return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results))
+    def add_alias_and_prefix(cls, doc: dict) -> dict:


Same issue here with the diff, it is hard to see the deserialize has also been changed as well as moved. Could you put your new stuff underneath it?

ml-evs · 2022-11-17T11:58:52Z

optimade/server/mappers/entries.py

+    def deserialize(cls, results: Iterable[dict]) -> List[EntryResource]:
+        return [cls.ENTRY_RESOURCE_CLASS(**result) for result in results]


Just generally I don't like that we are changing the functionality of existing methods (no longer doing the mapping) like deserialize as mappers are used elsewhere beyond just the server (and potentially by other users).

ml-evs · 2022-11-17T12:00:23Z

optimade/server/routers/utils.py

-def handle_response_fields(
+def remove_exclude_fields(


Is the removal of handle_response_fields really necessary? Again, my server code uses this independently of this package. We need to be careful not to force people into a rewrite of their existing stuff for the sake of new features (that they may or may not use)

ml-evs · 2022-11-17T12:01:57Z

setup.py

@@ -105,6 +105,7 @@
        "pydantic~=1.10,>=1.10.2",
        "email_validator~=1.2",
        "requests~=2.28",
+        "pyyaml>=5.4, <7",  # Keep at pyyaml 5.4 for aiida-core support


Did you pin down which part of the non-server stuff needs pyyaml? I assume it was just for reading the config files in yaml format previously

Co-authored-by: Matthew Evans <[email protected]>

…Attributes class.

…needed.

ml-evs · 2022-12-02T17:44:23Z

optimade/models/jsonapi.py

+    @root_validator(pre=True)
+    def set_missing_to_none(cls, values):
+        if "set_missing_to_none" in values and values.pop("set_missing_to_none"):
+            for field in cls.schema()["required"]:
+                if field not in values:
+                    if (
+                        field == "structure_features"
+                    ):  # It would be nice if there would be a more universal way to handle special cases like this.
+                        values[field] = []
+                    else:
+                        values[field] = None
+        return values
+


This looks on the right track! I've just played around with something too after spotting something in the pydantic docs about default factories. Would the snippet also solve this issue?

>>> from pydantic import BaseModel, Field >>> from typing import Optional >>> class Model(BaseModel): ... test: Optional[str] = Field(default_factory=lambda: None) ... >>> Model() Model(test=None) >>> Model(test=None) Model(test=None) >>> Model(test="value") Model(test='value') >>> Model.schema() {'title': 'Model', 'type': 'object', 'properties': {'test': {'title': 'Test', 'type': 'string'}}}

We can then patch the underlying OptimadeField and StrictField wrappers to default to having a default_factory that returns null in cases where there is no default value for the field to fall back on, and we can do this without modifying the schema or the models.

The only concern is that this functionality might get removed from pydantic:

The default_factory argument is in beta, it has been added to pydantic in v1.5 on a provisional basis. It may change significantly in future releases and its signature or behaviour will not be concrete until v2.

Though it has already lasted a few versions.

Alow aliases to be nested fields.

b949945

JPBergsma added 2 commits August 17, 2022 22:43

Added a line to docs to explain how to alias nested fields.

7adbf4a

Merge branch 'master' into JPBergsma/allow_nested_aliases

8c05579

JPBergsma marked this pull request as ready for review August 18, 2022 12:16

JPBergsma requested review from CasperWA and ml-evs as code owners August 18, 2022 12:16

JPBergsma added 2 commits August 27, 2022 17:50

Merge branch 'master' into JPBergsma/allow_nested_aliases

6c8294b

Merge branch 'master' into JPBergsma/allow_nested_aliases

0ebc724

JPBergsma added the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 14, 2022

JPBergsma mentioned this pull request Oct 19, 2022

Add support for Python 3.11 #1361

Closed

JPBergsma added 7 commits October 19, 2022 14:52

Merge branch 'Materials-Consortia:master' into JPBergsma/allow_nested…

4388c9d

…_aliases

Merge branch 'master' into JPBergsma/allow_nested_aliases

c29cdbb

1. Added support for more versatile aliassing allowing nested fields.…

368792a

… 2.Now only the requested fields are retrieved from the backend.

Merge branch 'master' into JPBergsma/allow_nested_aliases

38fcde7

Added type hint to alias_and_prefix function.

5cec6d6

Added missing..

cbba008

Fixed bug in elastic search where the requested fields were determine…

67f2dea

…d again eventhough they had already been determined in find in entry_collections.py

JPBergsma requested a review from markus1978 as a code owner November 15, 2022 16:16

JPBergsma added 5 commits November 15, 2022 18:01

Use https://github.com/pycqa/flake8 instead of https://gitlab.com/pyc…

775009d

…qa/flake8

Added pyyaml to install requirements and added python 3.11 classifier.

1358b9d

Made a small change to the descriptions of the nested dict functions …

bb228ee

…to trigger tests.

Removed pyyaml from serverdeps as I placed it already under installre…

8245884

…quirements.

Removed get_value function from entries.py as it was no longer needed…

7b2320b

… and removed fallback value for get_non_optional_fields.

JPBergsma changed the title ~~Alow aliases to be nested fields.~~ Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. Nov 16, 2022

JPBergsma added enhancement New feature or request config For issues/PRs related to the server config. and removed on-hold For PRs/issues that are on-hold for an unspecified time labels Nov 16, 2022

Removed get_value function from entries.py as it was no longer needed…

d87bc1b

… and removed fallback value for get_non_optional_fields.

JPBergsma added 4 commits November 16, 2022 13:15

Simplified set_field_to_none_if_missing_in_dict and moved it out of t…

6b99822

…he class.

Moved functions related to handling nested dicts to utils.py.

841520d

Updated docstring remove_exclude_fields and removed unneccesary brack…

c22089f

…ets.

solved merge conflict.

1146273

ml-evs reviewed Nov 17, 2022

View reviewed changes

JPBergsma and others added 3 commits November 17, 2022 14:39

Updateof the explanation of the handling of nested fields

4c46f55

Co-authored-by: Matthew Evans <[email protected]>

Merged changes from master.

3a908bc

fix bug introduced by merge with master.

1e49fb8

JPBergsma force-pushed the JPBergsma/allow_nested_aliases branch from 925a5b7 to 1e49fb8 Compare November 30, 2022 13:37

Made changes to satisfy mypy.

596bb73

JPBergsma force-pushed the JPBergsma/allow_nested_aliases branch from 0a551aa to 596bb73 Compare November 30, 2022 14:02

JPBergsma added 2 commits November 30, 2022 18:38

Added option to automatically set missing but required fields to the …

703a9df

…Attributes class.

Removed get_non_optional_fields and get_schema as they are no longer …

f174850

…needed.

ml-evs reviewed Dec 2, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

JPBergsma commented Aug 17, 2022 •

edited

Loading

codecov bot commented Aug 17, 2022 •

edited

Loading

JPBergsma commented Sep 14, 2022

ml-evs commented Nov 16, 2022

JPBergsma commented Nov 16, 2022

ml-evs left a comment

ml-evs Nov 17, 2022

JPBergsma Nov 17, 2022

ml-evs Nov 17, 2022

JPBergsma Nov 22, 2022

JPBergsma Dec 1, 2022 •

edited

Loading

ml-evs Nov 17, 2022

ml-evs Nov 17, 2022

ml-evs Nov 17, 2022

ml-evs Nov 17, 2022

ml-evs Nov 17, 2022

ml-evs Dec 2, 2022

		def deserialize(cls, results: Iterable[dict]) -> List[EntryResource]:
		return [cls.ENTRY_RESOURCE_CLASS(**result) for result in results]

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

Are you sure you want to change the base?

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

Conversation

JPBergsma commented Aug 17, 2022 • edited Loading

codecov bot commented Aug 17, 2022 • edited Loading

Codecov Report

JPBergsma commented Sep 14, 2022

ml-evs commented Nov 16, 2022

JPBergsma commented Nov 16, 2022

ml-evs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPBergsma Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPBergsma commented Aug 17, 2022 •

edited

Loading

codecov bot commented Aug 17, 2022 •

edited

Loading

JPBergsma Dec 1, 2022 •

edited

Loading