Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression module for isolated atoms energies #45

Merged
merged 11 commits into from
Mar 20, 2024
Merged

Regression module for isolated atoms energies #45

merged 11 commits into from
Mar 20, 2024

Conversation

FNTwin
Copy link
Collaborator

@FNTwin FNTwin commented Mar 11, 2024

Checklist:

  • Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation is a new function is added or an existing one is deleted.

  • Separate regression logic into an independent regressor module
  • Implement Linear and Ridge Regression
  • Add flag to recompute statistics without redownloading the datasets
  • Nan handling for the solver
  • Handling multiple level of theory
  • Subsample logic to calculate big datasets
  • Caching of the calculated dictionary and retrieval
  • Docstrings

@FNTwin FNTwin added the enhancement New feature or request label Mar 11, 2024
@shenoynikhil
Copy link
Collaborator

Looks good to me. I have tested it on openMLIP in my branch and it seems to be helpful. Related PR on openMLIP https://github.com/OpenDrugDiscovery/openMLIP/pull/149

@prtos waiting for your review.

Copy link
Contributor

@prtos prtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, There are some structural choices that are worth discussing but we can merge it and continue improving the structure.

@@ -192,34 +222,67 @@ def _precompute_statistics(self, overwrite_local_cache: bool = False):
def _compute_average_nb_atoms(self):
self.__average_nb_atoms__ = np.mean(self.data["n_atoms"])

def _set_linear_e0s(self):
new_e0s = [np.zeros((max(self.numbers) + 1, 21)) for _ in range(len(self.__energy_methods__))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the second dimension is 21?

Copy link
Collaborator Author

@FNTwin FNTwin Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the regressed isolated energies don't change with the charges and the isolated atom energy factory returns a matrix which columns define the charge of a species, I just fill up fill the row from -10 to 10 to avoid an index error

@@ -66,7 +66,7 @@ def push_remote(local_path, overwrite=True):
"""
remote_path = local_path.replace(get_local_cache(), get_remote_cache(write_access=overwrite))
gcp_filesys.mkdirs(os.path.dirname(remote_path), exist_ok=False)
# print(f"Pushing {local_path} file to {remote_path}, ({gcp_filesys.exists(os.path.dirname(remote_path))})")
print(f"Pushing {local_path} file to {remote_path}, ({gcp_filesys.exists(os.path.dirname(remote_path))})")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use logger.info here no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove it. Thanks for flagging

@@ -8,7 +8,7 @@

BOHR2ANG: Final[float] = 0.52917721092

POSSIBLE_NORMALIZATION: Final[List[str]] = ["formation", "total", "inter"]
POSSIBLE_NORMALIZATION: Final[List[str]] = ["formation", "total", "inter", "regression", "regression_inter"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggest that we should have a separate preprocessing class that take any dataset and return it with a normalized energy entry during iteration no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The POSSIBLE_NORMALIZATION is a list that I included to sync with openMLIP as openMLIP is dependent from openQDC and doesn't affect the data in any way. It is just checked when you call the get_statistics method.

Right now we are avoiding to modify the total_energy and the "normalization" is done on openMLIP and imo it is the most flexible way to handle this stuffs as other people can try to reconstruct the total_loss

@S-Thaler
Copy link
Collaborator

Thanks for this great PR @FNTwin - this brings us a big step towards a robust implementation of the different normalizations in openMLIP.

@FNTwin FNTwin mentioned this pull request Mar 20, 2024
3 tasks
@FNTwin FNTwin merged commit c3a0b49 into develop Mar 20, 2024
5 checks passed
@FNTwin FNTwin deleted the lin_atom_en branch March 20, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants