Evaluation measures duplicated or not present / no measure for imbalanced data available #27

amueller · 2019-06-05T14:48:24Z

Related: #20

Currently no measure is computed that's useful for highly imbalanced classes.
Take for example sick:
https://www.openml.org/t/3021

I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren).

On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size:
https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure
https://www.openml.org/a/evaluation-measures/f-measure

Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.

joaquinvanschoren · 2019-06-05T19:39:19Z

IIRC we just use the WEKA evaluation class in the evaluation engine, which by *default* computes the weighted average for all class-specific measure. Hence, if you look at f-measure, you actually see the weighted average. I agree that this is confusing. To check, let's take this run: https://www.openml.org/r/9199162 The 'large' number that you see with the F-measure is the one also used on the task page. And if you compute the weighted F-measure you can see that this is indeed the value you expect: 0.9917 * (3541/3772) + 0.8625 (231/3772) = 0.9838 If you check the API: https://www.openml.org/api/v1/run/9199162 You can see that it returns the weighted score but simply called 'f_measure' The evaluation measure documentation is clearly wrong, and https://www.openml.org/a/evaluation-measures/f-measure doesn't say anything about weighting. What to do... Changing the naming in the API/database would be a very big change. It's probably best to fix the documentation, explaining that the non-prefixed versions compute the weighted average and removing the confusing 'mean-weighted-f-measure'? Thoughts?

…

On Wed, 5 Jun 2019 at 16:48, Andreas Mueller ***@***.***> wrote: Related: #20 <#20> Currently no measure is computed that's useful for highly imbalanced classes. Take for example sick: https://www.openml.org/t/3021 I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren <https://github.com/joaquinvanschoren>). On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size: https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure https://www.openml.org/a/evaluation-measures/f-measure Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27?email_source=notifications&email_token=AANFAV7YKEJDD2RAI4VFWVLPY7G3TA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXZJILQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANFAV2SS3IJWSWHQIRL4ADPY7G3TANCNFSM4HT2HR7Q> .

amueller · 2019-06-05T19:59:45Z

yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs.

Do you have comments on computing the other one, the mean f-measure?

joaquinvanschoren · 2019-06-05T20:03:06Z

AFAIK we don't compute the mean f-measure in the backend, you'd need to grab the per-class scores and average yourself I'm afraid. @janvanrijn: do you feel like adding this to the evaluation engine?

…

On Wed, 5 Jun 2019 at 21:59, Andreas Mueller ***@***.***> wrote: yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs. Do you have comments on computing the other one, the mean f-measure? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27?email_source=notifications&email_token=AANFAV5RTOZVBTD7KB5YJVLPZALLDA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXA267Q#issuecomment-499232638>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANFAV3UUWSHRRX6IW5QQXDPZALLDANCNFSM4HT2HR7Q> .

amueller · 2019-06-05T20:21:06Z

@joaquinvanschoren why is it in the drop-down then? ;)

janvanrijn · 2019-06-05T20:35:31Z

do you feel like adding this to the evaluation engine?

Not sure if adding an additional `unweighted' version would be a great idea, as these tables already put a massive load on our storage. I am open to updates in the API / evaluation engine that make this more convenient though.

joaquinvanschoren · 2019-06-05T20:37:38Z

@janvanrijn: That would work!

amueller · 2019-06-05T20:39:48Z

I'm not sure I follow. What are the entries in the drop-down based on if not the things in the evaluation engine?

janvanrijn · 2019-06-05T20:43:01Z

I would presume this list:
https://www.openml.org/api/v1/evaluationmeasure/list

amueller · 2019-06-05T20:44:14Z

Well ok that's a response from the backend sever, right? so that's generated from the database? Shouldn't there be some synchronization between the metrics in the database and the metrics computed by the evaluation engine?

joaquinvanschoren · 2019-06-05T20:44:52Z

The API returns a list of all measures known to OpenML:
https://www.openml.org/api/v1/evaluationmeasure/list

But indeed not all of those are returned all the time (some are never, apparently).

I could add a check for every measure to see if any of the runs contains that measure. I think I didn't do this before since it's not exactly cheap...

amueller · 2019-06-05T20:47:44Z

I think it would be more helpful to

Have a list of what the evaluation engine computes
Only show the things in the drop down menu that are available for that particular run

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

amueller · 2019-06-05T20:49:20Z

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

joaquinvanschoren · 2019-06-05T20:51:46Z

Only show the things in the drop down menu that are available for that particular run

You mean for that particular task?

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

It's a great time to suggest which one you'd like :).

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

Yes...

janvanrijn · 2019-06-05T20:52:55Z

Have a list of what the evaluation engine computes

Probably, a mapping between task types and what an evaluation engine computes. Also, officially, there can be multiple evaluation engines.

amueller · 2019-06-05T20:54:37Z

You mean for that particular task?

yes, sorry

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

Macro f1 would be good for D3M, otherwise I'd probably prefer macro average recall and/or macro average AUC.

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

yes

That seems.... kinda circular? So that's just an arbitrary list? alright...

joaquinvanschoren · 2019-06-05T21:03:14Z

As Jan suggested, the API could compute the macro-averaged precision, recall, f1, and auc on the fly based on the per-class scores and return them.

amueller · 2019-06-05T21:05:36Z

not sure what "on the fly" means here.

joaquinvanschoren · 2019-06-05T21:08:20Z

Note: for this to show up in the old frontend I'd need to finish the new indexer (which works on top of the API rather than on the database).

not sure what "on the fly" means here.

As Jan explained, computing these in advance would add many millions of rows to the database. The API could instead get the per-class scores, compute the macro-averages, and then return them in its response.

amueller · 2019-06-05T21:09:35Z

@joaquinvanschoren ok but then we couldn't show them on the website, right? There's hundreds of runs on a given dashboard and that would never finish in time.

joaquinvanschoren · 2019-06-05T21:22:56Z

It would slow down the response from the API, yes. That in turn may slow down the website.

Hard to say what is faster. Computing them on the fly means that the SQL query is equally fast but the extra computations may slow down the final response. Adding them to the database may slow down the SQL query a bit but keeps the response writing equally fast.

amueller · 2019-06-05T21:26:43Z

I don't know how slow the database would get with adding them to the database but on the fly doesn't seem feasible to me. For a medium sized dataset this could easily take a second per run, and there might be 10000 runs to render. How many instances of the evaluation server do we run in parallel?

joaquinvanschoren · 2019-06-05T21:44:45Z

Oh, but we wouldn't compute these from the predictions. We already store the per-class scores for all runs in the database. It would just be a matter of fetching them and computing the average.

amueller · 2019-06-05T21:45:21Z

oh, right, my bad.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation measures duplicated or not present / no measure for imbalanced data available #27

Evaluation measures duplicated or not present / no measure for imbalanced data available #27

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 via email

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 via email

amueller commented Jun 5, 2019

janvanrijn commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

janvanrijn commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 •

edited

Loading

janvanrijn commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 •

edited

Loading

amueller commented Jun 5, 2019

Evaluation measures duplicated or not present / no measure for imbalanced data available #27

Evaluation measures duplicated or not present / no measure for imbalanced data available #27

Comments

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 via email

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 via email

amueller commented Jun 5, 2019

janvanrijn commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

janvanrijn commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 • edited Loading

janvanrijn commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 • edited Loading

amueller commented Jun 5, 2019

joaquinvanschoren commented Jun 5, 2019 •

edited

Loading

joaquinvanschoren commented Jun 5, 2019 •

edited

Loading