Regression module for isolated atoms energies #45

FNTwin · 2024-03-11T14:13:42Z

Checklist:

Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation is a new function is added or an existing one is deleted.

Separate regression logic into an independent regressor module
Implement Linear and Ridge Regression
Add flag to recompute statistics without redownloading the datasets
Nan handling for the solver
Handling multiple level of theory
Subsample logic to calculate big datasets
Caching of the calculated dictionary and retrieval
Docstrings

shenoynikhil · 2024-03-14T00:27:45Z

Looks good to me. I have tested it on openMLIP in my branch and it seems to be helpful. Related PR on openMLIP https://github.com/OpenDrugDiscovery/openMLIP/pull/149

@prtos waiting for your review.

prtos

Great PR, There are some structural choices that are worth discussing but we can merge it and continue improving the structure.

prtos · 2024-03-14T19:16:43Z

openqdc/datasets/base.py

@@ -192,34 +222,67 @@ def _precompute_statistics(self, overwrite_local_cache: bool = False):
    def _compute_average_nb_atoms(self):
        self.__average_nb_atoms__ = np.mean(self.data["n_atoms"])

+    def _set_linear_e0s(self):
+        new_e0s = [np.zeros((max(self.numbers) + 1, 21)) for _ in range(len(self.__energy_methods__))]


Why the second dimension is 21?

Because the regressed isolated energies don't change with the charges and the isolated atom energy factory returns a matrix which columns define the charge of a species, I just fill up fill the row from -10 to 10 to avoid an index error

prtos · 2024-03-14T19:22:13Z

openqdc/utils/io.py

@@ -66,7 +66,7 @@ def push_remote(local_path, overwrite=True):
    """
    remote_path = local_path.replace(get_local_cache(), get_remote_cache(write_access=overwrite))
    gcp_filesys.mkdirs(os.path.dirname(remote_path), exist_ok=False)
-    # print(f"Pushing {local_path} file to {remote_path}, ({gcp_filesys.exists(os.path.dirname(remote_path))})")
+    print(f"Pushing {local_path} file to {remote_path}, ({gcp_filesys.exists(os.path.dirname(remote_path))})")


Better use logger.info here no?

Forgot to remove it. Thanks for flagging

openqdc/utils/regressor.py

prtos · 2024-03-14T19:34:02Z

openqdc/utils/constants.py

@@ -8,7 +8,7 @@

 BOHR2ANG: Final[float] = 0.52917721092

-POSSIBLE_NORMALIZATION: Final[List[str]] = ["formation", "total", "inter"]
+POSSIBLE_NORMALIZATION: Final[List[str]] = ["formation", "total", "inter", "regression", "regression_inter"]


This suggest that we should have a separate preprocessing class that take any dataset and return it with a normalized energy entry during iteration no?

The POSSIBLE_NORMALIZATION is a list that I included to sync with openMLIP as openMLIP is dependent from openQDC and doesn't affect the data in any way. It is just checked when you call the get_statistics method.

Right now we are avoiding to modify the total_energy and the "normalization" is done on openMLIP and imo it is the most flexible way to handle this stuffs as other people can try to reconstruct the total_loss

openqdc/datasets/base.py

openqdc/utils/regressor.py

openqdc/datasets/base.py

S-Thaler · 2024-03-15T14:20:36Z

Thanks for this great PR @FNTwin - this brings us a big step towards a robust implementation of the different normalizations in openMLIP.

…IP versions

FNTwin added 6 commits February 21, 2024 09:09

Regressor utilities + ridge

5fc0074

per atom norm stats

d70d4f8

Implemented force_mask

03be712

Dummy fix

c12ccba

Force mask improvement

9ac602e

Regressor docstrings, merged develop, multi method solve

500f752

FNTwin added the enhancement New feature or request label Mar 11, 2024

Remove prints

baf91bd

prtos reviewed Mar 14, 2024

View reviewed changes

S-Thaler reviewed Mar 15, 2024

View reviewed changes

openqdc/datasets/base.py Outdated Show resolved Hide resolved

S-Thaler reviewed Mar 15, 2024

View reviewed changes

openqdc/utils/regressor.py Outdated Show resolved Hide resolved

S-Thaler reviewed Mar 15, 2024

View reviewed changes

openqdc/datasets/base.py Show resolved Hide resolved

FNTwin added 2 commits March 15, 2024 14:32

Adressed docstring comments, normalization names + nanstd issue

9f23613

Made stephan happy :)

8b4253b

FNTwin mentioned this pull request Mar 15, 2024

Switch custom regressor for sklearn #47

Open

FNTwin added 2 commits March 19, 2024 14:42

Merged origin/force_mask + made it backward compatible with older ML…

a7a0847

…IP versions

Solved merge issue with develop

2b75de3

prtos approved these changes Mar 20, 2024

View reviewed changes

FNTwin mentioned this pull request Mar 20, 2024

[On hold] Force mask #43

Closed

3 tasks

FNTwin merged commit c3a0b49 into develop Mar 20, 2024
5 checks passed

FNTwin deleted the lin_atom_en branch March 20, 2024 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression module for isolated atoms energies #45

Regression module for isolated atoms energies #45

FNTwin commented Mar 11, 2024

shenoynikhil commented Mar 14, 2024

prtos left a comment

prtos Mar 14, 2024

FNTwin Mar 15, 2024 •

edited

Loading

prtos Mar 14, 2024

FNTwin Mar 15, 2024

prtos Mar 14, 2024

FNTwin Mar 15, 2024

S-Thaler commented Mar 15, 2024

Regression module for isolated atoms energies #45

Regression module for isolated atoms energies #45

Conversation

FNTwin commented Mar 11, 2024

shenoynikhil commented Mar 14, 2024

prtos left a comment

Choose a reason for hiding this comment

prtos Mar 14, 2024

Choose a reason for hiding this comment

FNTwin Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

prtos Mar 14, 2024

Choose a reason for hiding this comment

FNTwin Mar 15, 2024

Choose a reason for hiding this comment

prtos Mar 14, 2024

Choose a reason for hiding this comment

FNTwin Mar 15, 2024

Choose a reason for hiding this comment

S-Thaler commented Mar 15, 2024

FNTwin Mar 15, 2024 •

edited

Loading