-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on a matbench-generative
benchmark: what it might look like and where to put it
#150
Comments
cc @kjappelbaum |
I wonder if one could not also try to make a guacamol-type library for materials. Some of the tasks could even be identical (e.g., novelty, similarity). |
@kjappelbaum great resource. Based on some papers I had been thinking about the novelty/validity/uniqueness style of metrics in molecular generative applications. Planning to take a closer look at guacamol |
For now, I created matbench-genmetrics (name pending) to get some of the metrics implemented. I.e. take a list of generated structures as inputs and calculate values for a range of benchmark tasks. See sparks-baird/matbench-genmetrics#3. Biggest issue for me right now is surfacing metric computation functionality from CDVAE. cc @JosephMontoya-TRI in the context of Novel inorganic crystal structures predicted using autonomous simulation agents which mentions in the abstract:
|
Similar to Guacamol is MOSES: https://github.com/molecularsets/moses. Came across it in https://dx.doi.org/10.1002/wcms.1608 |
Just stumbled across this issue whilst checking in on matbench after reading 10.1002/advs.202200164. Maybe of interest is my current plan to take any open datasets of hypothetical structures (e.g., from the autonomous agent paper linked above, have briefly discussed this with Joseph Montoya) and make them accessible via OPTIMADE APIs (will probably host them myself for now, in lieu of the data creators hosting it themselves). My own aim is to use these hypothetical structures in experimental auto XRD refinements, plus doing some property prediction on everyone else's fancy new materials! Might this be a useful endeavor for this discussion too? The functionality for querying every OPTIMADE database simultaneously for, say, a given formula unit, is pretty much there now (database hosting flakiness aside). Perhaps this leads to materials discovery by committee, where if enough autonomous agents independently discover something, someone should probably try to actually make it! |
I like the idea of a committee "vote" for certain compounds. Lots of adaptive design algorithms out there - would probably reduce some of the risk / uncertainty with synthesizing new compounds. That's great to hear about the progress with OPTIMADE. Related: figshare: NOMAD Chemical Formulas and Calculation IDs. I went with NOMAD (directly via their API) due to some trouble with querying OPTIMADE at the time. See discussion in Extract chemical formulas, stability measure, identifier from all NOMAD entries excluding certain periodic elements. If you have a specific dataset + split + metric(s) that you'd want to add to matbench-genmetrics, would be happy to see a PR there or adapt a usage example if you provide one. |
Hey @sgbaird I just read through this now. We have been brainstorming w/TRI what a good improvement on matbench would look like (generative, adaptive learning, etc.). These sorts of generative tests would be a great addition in my opinion. I think it could be merged into the core functionality of matbench, and I actually don't think it would require a ton of changes to the core code. However it might fit more naturally into MOSES/Guacamol (as well as those being much more popular than matbench) which is fine too. Let me know what you decide and I'm glad to help or brainstorm more |
That's great to hear. I've been working separately on benchmarks for both generative metrics and adaptive learning. The latter is in the spirit of duck-typing: "if it looks like a materials optimization problem and it behaves like a materials optimization problem ..." For the generative metrics, @kjappelbaum has been helping out, and for the optimization task, I've been working with @truptimohanty and @jeet-parikh.
If you mean integrating directly into MOSES or Guacamol, I'd probably need to host a separate webpage that just follows a similar format since MOSES stands for "Molecular Sets" and the "Mol" in GuacaMol refers to molecules. I'm curious to hear your thoughts about how integration with Matbench might work. Maybe @JoshuaMeyers or @danpol could comment on their experience getting generative metrics, benchmarks, and leaderboards set up. |
Sure. CodebaseIn terms of actual implementation and code, I don't think it would require too many changes. We would need to define the datasets to be used, create a single validation file to define the schema for validation, and make some additions to Regular problem (Fold 1)
Adaptive problem (Fold 1)
Then repeat with a different seed Generative problemThis might be a bit harder. Not sure what the best, or evan an acceptable, way to do this is. I imagine these changes for the adaptive problems could be done without adding more than 100-200 SLOC to the entire codebase. The generative one might be require more changes but I still think it could be done. Evaluation and Visualization (Adaptive)We record how many solutions were found per number of function evaluations. This scheme would also work with multiple metrics (i.e., a multiobjective optimization, where pareto optimal points are considered "solutions"), which as I have come to find, is basically the setup for every single materials discovery problem. I.e., material must be Then the simplest way to compare algorithms is to just plot the avg. num of function evaluations to reach a desired number of solutions, like the graph below: We could also show the individual objective function response metrics on the y axis instead of "candidates found". So you could compare algorithms on more than the single dimension of "candidates found". Something like
There are lots of other cool and informative visualizations we could show on the website. Here's an example of another one for a multiobjective problem: Now I know people don't always like just comparing based on number of objective function evaluations, so we could also consider other metrics. For example, @JoesephMontoya-TRI suggested also accounting for the time it takes to make a prediction. Faster algorithms are preferred over slower algorithms. Evaluation and Visualization (Generative)I think I should probably ask you about this one... DisclaimerI haven't worked on this kind of optimization or generative stuff in a while and haven't put as much scientific rigor into this as I did for the original matbench paper, so apologies if it is missing something |
@ardunn thanks for the response! Still digging into this a bit more and will give a detailed response soon. Also tagging @JosephMontoya-TRI since it looks like the tag above had a typo. |
This is a great discussion! To elaborate a bit on my thoughts re: time, the motivation for this on our side was both in the efficiency of the selection algorithm and in the time cost of the experiments themselves. Having this in after-the-fact benchmarks might be too complicated, it's a luxury of doing DFT simulations that the time is easily quantifiable, but might be harder with real experiments. Time is also only one aspect of cost, so number of experiments/observations/samples might be the way to go just because it's simple, at least in the first iteration. Also, are people calling this adaptive learning now? I like it, definitely a lot more than "active learning" (which has unfortunate overlap with an important pedagogical term) or "sequential learning" (which sounds vague). Also nice that you could keep the acronym AL. |
This helps! Thanks
That's an interesting way of approaching it. In the scenario above, would the task for a single fold involve a single next suggested experiment or a budget of multiple iterations? And any thoughts on how many folds? In other words: folds = ?
num_iter = ?
for fold in task.folds:
train_inputs, train_outputs = task.get_train_and_val_data()
test_candidates = task.get_test_data(fold, include_target=False)
my_model.attach_trials(train_inputs, train_outputs)
suggested_candidates = my_model.optimize(candidates=test_candidates, num_iter=num_iter)
task.record(fold, suggested_candidates)
This certainly has some appeal in that it works for hard-to-visualize 3+ objectives in a multi-objective problem as you mentioned. I agree that virtually any realistic materials optimization problem is multi-objective which introduces either the notion of Pareto fronts/hypervolumes or is worked around via (the usually) simpler scalarization of objectives*. I'm a bit on the fence on even including single-objective tasks. I think we could be selective about which adaptive design problems to include. I recommend looking through https://github.com/stars/sgbaird/lists/optimization-benchmarks. Physical sciences tasks are in Olympus, design bench, and maybe some others. See also #2 (comment). generative modelsFor evaluation, I think it's also inherently a multi-objective problem (maybe eventually someone will suggest a single metric that can become defacto like MAE; maybe that's already there with KL divergence, but I think the question is how to reliably compute KL divergence for crystals, compounds, etc.). If I had to choose a single metric, it would probably be rediscovery rate, in part because it's the most difficult one. For example:
I think it would make sense to base the visualizations on MOSES (see below) such as comparing distributions of properties. I'll give this some more thought.
Thanks for joining in! How would you picture integration of the experiment time cost into a
I find myself using "adaptive design" more naturally, and usually mention synonyms (at least terms used to mean the same thing). I think Citrine still calls it active learning. *I still need to understand some of the details behind [Chimera](https://github.com/aspuru-guzik-group/chimera), a scalarization framework. In some ways, maybe Pareto hypervolume acquisition functions could also be considered a type of "scalarization". |
btw, happy to set up a brainstorming and planning meeting |
Just had a meeting with @computron about how to integrate another add-on about materials stability prediction into |
@janosh I think I can make either of those work! |
I can do either of those as well @janosh, just shoot me a calendar invite!! |
@ardunn glad to have (virtually) met you and a few others during the meeting. As a follow-up, do you mind taking a look at and running https://colab.research.google.com/github/sparks-baird/matbench-genmetrics/blob/main/notebooks/1.0-matbench-genmetrics-basic.ipynb? Curious to hear your thoughts on integration. |
Citrine informatics manuscript - suggestion for using "discovery yield" and "discovery probability" metrics for assessing the ability of a model to do well for adaptive design materials discovery tasks https://arxiv.org/abs/2210.13587 |
Originally posted by @sgbaird in #2 (comment)
Originally posted by @txie-93 in #2 (comment)
@ardunn what do you think?
matbench-generative
hosted on https://matbench.materialsproject.org/, a separate website (maybe linked to as one of the tabs on https://matbench.materialsproject.org/ if it gets enough traction) but with the core functionality inmatbench
, as a separate project permanently, or as a separate project/fork later to be incorporated intomatbench
? Some combination of these?I figured this is a good place to do some brainstorming.
The text was updated successfully, but these errors were encountered: