Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation measures duplicated or not present / no measure for imbalanced data available #27

Open
amueller opened this issue Jun 5, 2019 · 23 comments

Comments

@amueller
Copy link

amueller commented Jun 5, 2019

Related: #20

Currently no measure is computed that's useful for highly imbalanced classes.
Take for example sick:
https://www.openml.org/t/3021

I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren).

On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size:
https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure
https://www.openml.org/a/evaluation-measures/f-measure

Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 5, 2019 via email

@amueller
Copy link
Author

amueller commented Jun 5, 2019

yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs.

Do you have comments on computing the other one, the mean f-measure?

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 5, 2019 via email

@amueller
Copy link
Author

amueller commented Jun 5, 2019

@joaquinvanschoren why is it in the drop-down then? ;)

@janvanrijn
Copy link
Member

do you feel like adding this to the evaluation engine?

Not sure if adding an additional `unweighted' version would be a great idea, as these tables already put a massive load on our storage. I am open to updates in the API / evaluation engine that make this more convenient though.

@joaquinvanschoren
Copy link
Contributor

@janvanrijn: That would work!

@amueller
Copy link
Author

amueller commented Jun 5, 2019

I'm not sure I follow. What are the entries in the drop-down based on if not the things in the evaluation engine?

@janvanrijn
Copy link
Member

I would presume this list:
https://www.openml.org/api/v1/evaluationmeasure/list

@amueller
Copy link
Author

amueller commented Jun 5, 2019

Well ok that's a response from the backend sever, right? so that's generated from the database? Shouldn't there be some synchronization between the metrics in the database and the metrics computed by the evaluation engine?

@joaquinvanschoren
Copy link
Contributor

The API returns a list of all measures known to OpenML:
https://www.openml.org/api/v1/evaluationmeasure/list

But indeed not all of those are returned all the time (some are never, apparently).

I could add a check for every measure to see if any of the runs contains that measure. I think I didn't do this before since it's not exactly cheap...

@amueller
Copy link
Author

amueller commented Jun 5, 2019

I think it would be more helpful to

  1. Have a list of what the evaluation engine computes
  2. Only show the things in the drop down menu that are available for that particular run

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

@amueller
Copy link
Author

amueller commented Jun 5, 2019

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 5, 2019

  1. Only show the things in the drop down menu that are available for that particular run

You mean for that particular task?

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

It's a great time to suggest which one you'd like :).

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

Yes...

@janvanrijn
Copy link
Member

Have a list of what the evaluation engine computes

Probably, a mapping between task types and what an evaluation engine computes. Also, officially, there can be multiple evaluation engines.

@amueller
Copy link
Author

amueller commented Jun 5, 2019

You mean for that particular task?

yes, sorry

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

Macro f1 would be good for D3M, otherwise I'd probably prefer macro average recall and/or macro average AUC.

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

yes

That seems.... kinda circular? So that's just an arbitrary list? alright...

@joaquinvanschoren
Copy link
Contributor

As Jan suggested, the API could compute the macro-averaged precision, recall, f1, and auc on the fly based on the per-class scores and return them.

@amueller
Copy link
Author

amueller commented Jun 5, 2019

not sure what "on the fly" means here.

@joaquinvanschoren
Copy link
Contributor

Note: for this to show up in the old frontend I'd need to finish the new indexer (which works on top of the API rather than on the database).

not sure what "on the fly" means here.

As Jan explained, computing these in advance would add many millions of rows to the database. The API could instead get the per-class scores, compute the macro-averages, and then return them in its response.

@amueller
Copy link
Author

amueller commented Jun 5, 2019

@joaquinvanschoren ok but then we couldn't show them on the website, right? There's hundreds of runs on a given dashboard and that would never finish in time.

@joaquinvanschoren
Copy link
Contributor

It would slow down the response from the API, yes. That in turn may slow down the website.

Hard to say what is faster. Computing them on the fly means that the SQL query is equally fast but the extra computations may slow down the final response. Adding them to the database may slow down the SQL query a bit but keeps the response writing equally fast.

@amueller
Copy link
Author

amueller commented Jun 5, 2019

I don't know how slow the database would get with adding them to the database but on the fly doesn't seem feasible to me. For a medium sized dataset this could easily take a second per run, and there might be 10000 runs to render. How many instances of the evaluation server do we run in parallel?

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 5, 2019

Oh, but we wouldn't compute these from the predictions. We already store the per-class scores for all runs in the database. It would just be a matter of fetching them and computing the average.

@amueller
Copy link
Author

amueller commented Jun 5, 2019

oh, right, my bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants