-
Notifications
You must be signed in to change notification settings - Fork 8
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions for assessing performance #22
Comments
Thanks! I think I might not have 100% understood the underlying problem completely - still not so sure about all the details ;-). The HGD can be used to benchmark the success of a method rather than the error. Why would you do that? Two reasons come to mind:
Of course an error rate also tells you that you are better off with a random draw early on and Bayesian methods catch up at a later point. I find it somehow more intuitive to play around with sample population size vs. target population size vs. number of draws to see what to expect in terms of a success rate. In my opinion it fits just naturally into laboratory practice where the question often is how much effort is required to reach a certain goal (e.g. to hit a certain target within a 5% tolerance) at a certain success rate (e.g. in 90% of the cases). We have described some of this stuff in our paper last year and also provided the equations (https://www.researchgate.net/publication/353340371_Sequential_learning_to_accelerate_discovery_of_alkali-activated_binders) I hope this makes any sense. You can of course draw some of the same conclusion from error rates - it’s just not normalized like success rates and it could thus be a bit more difficult to communicated or to compare between different data sets. All just opinions - apologies if I went completely off topic and missed the point 😬 |
@BAMcvoelker thanks for the comments!
What is HGD?
Good point - I thought about comparing to a normalized distribution (maybe I'll add this as a kwarg), and I'm planning to use the Wasserstein metric instead of MAE as a more robust comparison of the discrete spectra. Though I think technically Wasserstein might normalize it already. Will need to check on that. For "was really the best of the batch", something I'm planning on is doing repeat validation runs for the "best predicted sample" to verify if it was actually the best or just a result of noise in the measurement.
The number of samples that fall within some constraint seems nice from a materials discovery perspective. Pretty related to @ardunn's comment in materialsproject/matbench#150 (comment). Something that would probably make sense to implement here. Thanks again for the feedback! |
Thank you for your comments and for the discussion. I find it truly interesting what you are doing and I hope to read more in the future. All my comments are aimed at a scenario where you have access to the "complete" information, i.e. you know the whole space and you "only" run virtual experiments to compare different algorithms. Online benchmarking is another nut we haven't started to crack yet. Forgot to mention: HGD (Hypergeometric Distribution). For completeness, I've attached an example I found on my hard drive: Success rate vs. draws comparing random process (RP) and different models. |
@BAMcvoelker thanks! For the attached figure, What does 100% required experiments mean? 100% success rate means all samples within the target threshold |
Note to self: Wasserstein distance uses CDF which assumes probability (i.e. normalizes the distribution), so I won't be resolving any differences in brightness for the same RGB value that way. Maybe something from https://github.com/cjekel/similarity_measures would help, or I could remove the normalization from |
I think there is a general misunderstanding here. If I understand you correctly, you are aiming at a sort of generative approach, where statistical methods are applied to find (generate) reasonable solutions in an infinite possibility space. I come from a field where the expert (material scientist) formulates a finite solution space in terms of a certain number of material recipes. Our goal is simply to find the best solution from the given solutions. Therefore, 100% required experiments means that 100% of the (given) recipes are validated. In other words: If you need to create 100 material recipes and validate all of them in the lab to find the given target, you need 100% of the possible experiments. This would be the worst possible result and is not the case with any of the tested methods. For example, if we consider the success rate of 90 (where the target is reached in 90 out of many randomly generated cases), the worst methods (GPR=Gaussian Process Regression) require the validation of about 50% of the given materials. Of course, you can only express the required experiments in relative terms if you have a finite solution space - which doesn't seem to be the case for you. Sorry for misunderstanding that. It took some time to get that right. Now I'm a bit smarter - thanks for that! |
@BAMcvoelker that makes sense with the difference between a continuous vs. a finite solution space. Oftentimes, formulation-based materials discovery problems have a technically continuous search space (except for experimental limitations in measurement and dosage resolution). Thanks for clarifying! I think this could still be relevant, and I'll keep it in mind. I think favoring the ability to find many good or near-optimal solutions (rather than focusing exclusively on finding the "best" solution) is important from a materials discovery perspective, especially since often the notion of "best" can be limited. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Suggestion by @BAMcvoelker, see twitter post:
I think this is similar to (if not the same as) this towards data science post comparing of grid, random, and Bayesian.
Oh, also realizing I might have misinterpreted the suggestion (and caused some confusion by not including enough info). MAE refers to the MAE between some fixed target spectrum and an observed spectrum. Hence the MAE isn't referring to model performance from a regression quality perspective, but rather how well we match a (discrete) target spectrum. @BAMcvoelker, ignore this comment if that was already clear.
The text was updated successfully, but these errors were encountered: