Unsupervised models and prediction #99

davnn · 2022-01-22T09:23:58Z

Hi, I would like to use ELKI to validate our OutlierDetection.jl algorithms. It appears that for unsupervised algorithms there is no facility to learn a model from data and predict on another dataset, right? Has this been done somewhere previously?

kno10 · 2022-01-22T09:45:47Z

Most unsupervised learning does not consider prediction as part of the task.
For the simplest models, such as k-means, or kNN, prediction is possible.
But for most it is not possible in a consistent way. This includes common models such as the local outlier factor.
E.g., sklearn implements a "predict" function for LOF, but it will "predict" different values than the local outlier factor would usually assign to these points. Similarly, affinity propagation is a closed-world approach, and "predicting" using the nearest neighbor (as done in sklearn if I recall correctly) yields worse results and ignores much of the algorithms idea; instead it mixes it with a k-means-like approach (and then the algorithm ends up performing not much better than k-means).

I suggest to not approach these algorithms from a supervised "fit then predict" mindset at all, it does not do them justice.

davnn · 2022-01-22T10:14:13Z

Thanks a lot for your feedback!

I suggest to not approach these algorithms from a supervised "fit then predict" mindset at all, it does not do them justice.

I agree that a post-hoc nearest-neighbors approach to enable prediction does not feel right and does not do the algorithms justice. However, I'm mainly interested in unsupervised outlier detection algorithms that can be used for prediction fairly naturally. As far as I can tell, this includes most of the neighbor-based approaches, where you can simply compare a prediction to the points in the training set.

Edit: I also wouldn't say, for example, that the prediction disturbs the meaning of the LOF score.

kno10 · 2022-01-22T10:28:11Z

It doesn't disturb the meaning, but the results of prediction will not agree with the results of the original algorithm.
So it is some extension, there is more than one way to do this, and it's not part of the published work.

kno10 · 2022-01-22T11:26:37Z

And in many cases it will even be better to depart, e.g., from LOF and try to cleanly transfer the key idea into an open-world "train-test split setting", rather than hack this into the original closed-world setting. This makes things both a lot easier and more efficient than trying to stay close to the original LOF.
The problem is, there is no public data set suitable for evaluating any of this in a meaningful way.

kno10 · 2022-01-25T15:51:03Z

@davnn for implementing a set of outlier detection algorithms in Julia, it may be worth trying to implement the generalized pattern introduced here:

Schubert, E., Zimek, A. & Kriegel, HP. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Disc 28, 190–237 (2014). https://doi.org/10.1007/s10618-012-0300-z

This avoids redundancies when implementing this. Its not very elegantly possible in Java right now though, because generics cannot be primitives; and most of the time we will be dealing with primitive values (e.g., a local density estimate). There is a partial implementation of this pattern in ELKI for the parallelized versions, but the code has to work around a lot of Java limitations. But for a Julia implementation, this may be an interesting starting point to approach it as a multi-stage mapping process, with these stages shared across different detectors.

davnn · 2022-01-25T17:36:33Z

Thanks for the hint! I've already used that paper as a reference for the different approaches a couple of times ;). I'll investigate how those generalizations could be implemented further down the line.

kno10 · 2022-02-06T11:53:05Z

If you used that paper as a reference, I would appreciate if it were included in the references list of the package. ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupervised models and prediction #99

Unsupervised models and prediction #99

davnn commented Jan 22, 2022

kno10 commented Jan 22, 2022

davnn commented Jan 22, 2022 •

edited

Loading

kno10 commented Jan 22, 2022

kno10 commented Jan 22, 2022

kno10 commented Jan 25, 2022

davnn commented Jan 25, 2022

kno10 commented Feb 6, 2022

Unsupervised models and prediction #99

Unsupervised models and prediction #99

Comments

davnn commented Jan 22, 2022

kno10 commented Jan 22, 2022

davnn commented Jan 22, 2022 • edited Loading

kno10 commented Jan 22, 2022

kno10 commented Jan 22, 2022

kno10 commented Jan 25, 2022

davnn commented Jan 25, 2022

kno10 commented Feb 6, 2022

davnn commented Jan 22, 2022 •

edited

Loading