-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation measures duplicated or not present / no measure for imbalanced data available #27
Comments
IIRC we just use the WEKA evaluation class in the evaluation engine, which
by *default* computes the weighted average for all class-specific measure.
Hence, if you look at f-measure, you actually see the weighted average. I
agree that this is confusing.
To check, let's take this run: https://www.openml.org/r/9199162
The 'large' number that you see with the F-measure is the one also used on
the task page. And if you compute the weighted F-measure you can see that
this is indeed the value you expect: 0.9917 * (3541/3772) + 0.8625
(231/3772) = 0.9838
If you check the API: https://www.openml.org/api/v1/run/9199162
You can see that it returns the weighted score but simply called 'f_measure'
The evaluation measure documentation is clearly wrong, and
https://www.openml.org/a/evaluation-measures/f-measure doesn't say anything
about weighting.
What to do...
Changing the naming in the API/database would be a very big change. It's
probably best to fix the documentation, explaining that the non-prefixed
versions compute the weighted average and removing the confusing
'mean-weighted-f-measure'?
Thoughts?
…On Wed, 5 Jun 2019 at 16:48, Andreas Mueller ***@***.***> wrote:
Related: #20 <#20>
Currently no measure is computed that's useful for highly imbalanced
classes.
Take for example sick:
https://www.openml.org/t/3021
I would like to see the "mean" measures be computed in particular (they
also are helpful for comparison with D3M, cc @joaquinvanschoren
<https://github.com/joaquinvanschoren>).
On the other hand, the "weighted" measures are not computed but seem to be
duplicates of the measure without prefix, which is also weighted by class
size:
https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure
https://www.openml.org/a/evaluation-measures/f-measure
Though that's not entirely clear from the documentation. If the f-measure
documentation is actually accurate (which I don't think it is), that would
be worse because it's unclear for which class the f-measure is reported.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27?email_source=notifications&email_token=AANFAV7YKEJDD2RAI4VFWVLPY7G3TA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXZJILQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANFAV2SS3IJWSWHQIRL4ADPY7G3TANCNFSM4HT2HR7Q>
.
|
yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs. Do you have comments on computing the other one, the mean f-measure? |
AFAIK we don't compute the mean f-measure in the backend, you'd need to
grab the per-class scores and average yourself I'm afraid.
@janvanrijn: do you feel like adding this to the evaluation engine?
…On Wed, 5 Jun 2019 at 21:59, Andreas Mueller ***@***.***> wrote:
yes, I agree with your conclusion. Let's just remove the weighted one and
fix the docs.
Do you have comments on computing the other one, the mean f-measure?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27?email_source=notifications&email_token=AANFAV5RTOZVBTD7KB5YJVLPZALLDA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXA267Q#issuecomment-499232638>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANFAV3UUWSHRRX6IW5QQXDPZALLDANCNFSM4HT2HR7Q>
.
|
@joaquinvanschoren why is it in the drop-down then? ;) |
Not sure if adding an additional `unweighted' version would be a great idea, as these tables already put a massive load on our storage. I am open to updates in the API / evaluation engine that make this more convenient though. |
@janvanrijn: That would work! |
I'm not sure I follow. What are the entries in the drop-down based on if not the things in the evaluation engine? |
I would presume this list: |
Well ok that's a response from the backend sever, right? so that's generated from the database? Shouldn't there be some synchronization between the metrics in the database and the metrics computed by the evaluation engine? |
The API returns a list of all measures known to OpenML: But indeed not all of those are returned all the time (some are never, apparently). I could add a check for every measure to see if any of the runs contains that measure. I think I didn't do this before since it's not exactly cheap... |
I think it would be more helpful to
I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default. |
also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"? |
You mean for that particular task?
It's a great time to suggest which one you'd like :).
Yes... |
Probably, a mapping between task types and what an evaluation engine computes. Also, officially, there can be multiple evaluation engines. |
yes, sorry
Macro f1 would be good for D3M, otherwise I'd probably prefer macro average recall and/or macro average AUC.
That seems.... kinda circular? So that's just an arbitrary list? alright... |
As Jan suggested, the API could compute the macro-averaged precision, recall, f1, and auc on the fly based on the per-class scores and return them. |
not sure what "on the fly" means here. |
Note: for this to show up in the old frontend I'd need to finish the new indexer (which works on top of the API rather than on the database).
As Jan explained, computing these in advance would add many millions of rows to the database. The API could instead get the per-class scores, compute the macro-averages, and then return them in its response. |
@joaquinvanschoren ok but then we couldn't show them on the website, right? There's hundreds of runs on a given dashboard and that would never finish in time. |
It would slow down the response from the API, yes. That in turn may slow down the website. Hard to say what is faster. Computing them on the fly means that the SQL query is equally fast but the extra computations may slow down the final response. Adding them to the database may slow down the SQL query a bit but keeps the response writing equally fast. |
I don't know how slow the database would get with adding them to the database but on the fly doesn't seem feasible to me. For a medium sized dataset this could easily take a second per run, and there might be 10000 runs to render. How many instances of the evaluation server do we run in parallel? |
Oh, but we wouldn't compute these from the predictions. We already store the per-class scores for all runs in the database. It would just be a matter of fetching them and computing the average. |
oh, right, my bad. |
Related: #20
Currently no measure is computed that's useful for highly imbalanced classes.
Take for example sick:
https://www.openml.org/t/3021
I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren).
On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size:
https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure
https://www.openml.org/a/evaluation-measures/f-measure
Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.
The text was updated successfully, but these errors were encountered: