Suggestions for assessing performance #22

sgbaird · 2022-08-22T19:52:55Z

Suggestion by @BAMcvoelker, see twitter post:

You could also plot the sample performance (e.g. in terms of performance quantile) instead of model performance (in terms of MAE). The plot would show success rate VS draws. This would have the advantage to be data set invariant for random and grid search.

I think this is similar to (if not the same as) this towards data science post comparing of grid, random, and Bayesian.

Oh, also realizing I might have misinterpreted the suggestion (and caused some confusion by not including enough info). MAE refers to the MAE between some fixed target spectrum and an observed spectrum. Hence the MAE isn't referring to model performance from a regression quality perspective, but rather how well we match a (discrete) target spectrum. @BAMcvoelker, ignore this comment if that was already clear.

iterateccvoelker · 2022-08-22T21:36:52Z

Thanks! I think I might not have 100% understood the underlying problem completely - still not so sure about all the details ;-).

The HGD can be used to benchmark the success of a method rather than the error. Why would you do that? Two reasons come to mind:

what if your task can not be expressed as an error, for instance if you are just interested in a certain shape or order of your spectrum and can accept a high offset. One example from our practice is when we are looking for good (successful) materials with zero shot learning or transfer learning. Here it’s not important if the prediction was low-error but rather if the predicted best material was really the best of the batch.
Because the success can for instance be within a certain acceptable margin of error. This - often realistic assumption - makes it incredibly easy to be successful just by chance. Here is an example: if you have 2000 possible spectra and you allow a 10 % deviation from the exact solution - chances are that you are successful in the first round 1 to 9 (or 10% vs. 90%). (This is maybe surprising and counter intuitive when you think about drawing just one sample out of 2000 possible energies but the math works out).

Of course an error rate also tells you that you are better off with a random draw early on and Bayesian methods catch up at a later point. I find it somehow more intuitive to play around with sample population size vs. target population size vs. number of draws to see what to expect in terms of a success rate. In my opinion it fits just naturally into laboratory practice where the question often is how much effort is required to reach a certain goal (e.g. to hit a certain target within a 5% tolerance) at a certain success rate (e.g. in 90% of the cases).

We have described some of this stuff in our paper last year and also provided the equations (https://www.researchgate.net/publication/353340371_Sequential_learning_to_accelerate_discovery_of_alkali-activated_binders)

I hope this makes any sense. You can of course draw some of the same conclusion from error rates - it’s just not normalized like success rates and it could thus be a bit more difficult to communicated or to compare between different data sets.

All just opinions - apologies if I went completely off topic and missed the point 😬

sgbaird · 2022-08-22T23:36:27Z

@BAMcvoelker thanks for the comments!

The HGD can be used to benchmark the success of a method rather than the error.

What is HGD?

what if your task can not be expressed as an error, for instance if you are just interested in a certain shape or order of your spectrum and can accept a high offset. One example from our practice is when we are looking for good (successful) materials with zero shot learning or transfer learning. Here it’s not important if the prediction was low-error but rather if the predicted best material was really the best of the batch.

Good point - I thought about comparing to a normalized distribution (maybe I'll add this as a kwarg), and I'm planning to use the Wasserstein metric instead of MAE as a more robust comparison of the discrete spectra. Though I think technically Wasserstein might normalize it already. Will need to check on that. For "was really the best of the batch", something I'm planning on is doing repeat validation runs for the "best predicted sample" to verify if it was actually the best or just a result of noise in the measurement.

Because the success can for instance be within a certain acceptable margin of error. This - often realistic assumption - makes it incredibly easy to be successful just by chance. Here is an example: if you have 2000 possible spectra and you allow a 10 % deviation from the exact solution - chances are that you are successful in the first round 1 to 9 (or 10% vs. 90%). (This is maybe surprising and counter intuitive when you think about drawing just one sample out of 2000 possible energies but the math works out).

The number of samples that fall within some constraint seems nice from a materials discovery perspective. Pretty related to @ardunn's comment in materialsproject/matbench#150 (comment). Something that would probably make sense to implement here.

Thanks again for the feedback!

iterateccvoelker · 2022-08-23T07:27:22Z

Thank you for your comments and for the discussion. I find it truly interesting what you are doing and I hope to read more in the future.

All my comments are aimed at a scenario where you have access to the "complete" information, i.e. you know the whole space and you "only" run virtual experiments to compare different algorithms. Online benchmarking is another nut we haven't started to crack yet.

Forgot to mention: HGD (Hypergeometric Distribution).

For completeness, I've attached an example I found on my hard drive: Success rate vs. draws comparing random process (RP) and different models.

HGD_curve_Max_4.pdf

sgbaird · 2022-08-26T02:58:32Z

@BAMcvoelker thanks!

For the attached figure, What does 100% required experiments mean?

100% success rate means all samples within the target threshold $f_{(c,90)}$ were found, right? (meaning within 10% of the desired amount?)

sgbaird · 2022-08-26T03:07:25Z

Note to self: Wasserstein distance uses CDF which assumes probability (i.e. normalizes the distribution), so I won't be resolving any differences in brightness for the same RGB value that way. Maybe something from https://github.com/cjekel/similarity_measures would help, or I could remove the normalization from scipy.stats.wasserstein_distance https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/stats/_stats_py.py#L8881-L8894. Former would be better because then I don't need to validate. Discrete Frechet distance seems promising and has a relationship with Wasserstein distance.

iterateccvoelker · 2022-08-26T07:45:16Z

I think there is a general misunderstanding here. If I understand you correctly, you are aiming at a sort of generative approach, where statistical methods are applied to find (generate) reasonable solutions in an infinite possibility space.

I come from a field where the expert (material scientist) formulates a finite solution space in terms of a certain number of material recipes. Our goal is simply to find the best solution from the given solutions. Therefore, 100% required experiments means that 100% of the (given) recipes are validated. In other words: If you need to create 100 material recipes and validate all of them in the lab to find the given target, you need 100% of the possible experiments. This would be the worst possible result and is not the case with any of the tested methods. For example, if we consider the success rate of 90 (where the target is reached in 90 out of many randomly generated cases), the worst methods (GPR=Gaussian Process Regression) require the validation of about 50% of the given materials. Of course, you can only express the required experiments in relative terms if you have a finite solution space - which doesn't seem to be the case for you.

Sorry for misunderstanding that. It took some time to get that right. Now I'm a bit smarter - thanks for that!

sgbaird · 2022-08-26T19:15:41Z

@BAMcvoelker that makes sense with the difference between a continuous vs. a finite solution space. Oftentimes, formulation-based materials discovery problems have a technically continuous search space (except for experimental limitations in measurement and dosage resolution). Thanks for clarifying! I think this could still be relevant, and I'll keep it in mind. I think favoring the ability to find many good or near-optimal solutions (rather than focusing exclusively on finding the "best" solution) is important from a materials discovery perspective, especially since often the notion of "best" can be limited.

sparks-baird locked and limited conversation to collaborators Sep 10, 2022

sgbaird converted this issue into discussion #58 Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Suggestions for assessing performance #22

Suggestions for assessing performance #22

sgbaird commented Aug 22, 2022 •

edited

Loading

iterateccvoelker commented Aug 22, 2022 •

edited

Loading

sgbaird commented Aug 22, 2022

iterateccvoelker commented Aug 23, 2022

sgbaird commented Aug 26, 2022

sgbaird commented Aug 26, 2022 •

edited

Loading

iterateccvoelker commented Aug 26, 2022

sgbaird commented Aug 26, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Suggestions for assessing performance #22

Suggestions for assessing performance #22

Comments

sgbaird commented Aug 22, 2022 • edited Loading

iterateccvoelker commented Aug 22, 2022 • edited Loading

sgbaird commented Aug 22, 2022

iterateccvoelker commented Aug 23, 2022

sgbaird commented Aug 26, 2022

sgbaird commented Aug 26, 2022 • edited Loading

iterateccvoelker commented Aug 26, 2022

sgbaird commented Aug 26, 2022

This issue was moved to a discussion.

sgbaird commented Aug 22, 2022 •

edited

Loading

iterateccvoelker commented Aug 22, 2022 •

edited

Loading

sgbaird commented Aug 26, 2022 •

edited

Loading