Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupervised models and prediction #99

Open
davnn opened this issue Jan 22, 2022 · 7 comments
Open

Unsupervised models and prediction #99

davnn opened this issue Jan 22, 2022 · 7 comments

Comments

@davnn
Copy link

davnn commented Jan 22, 2022

Hi, I would like to use ELKI to validate our OutlierDetection.jl algorithms. It appears that for unsupervised algorithms there is no facility to learn a model from data and predict on another dataset, right? Has this been done somewhere previously?

@kno10
Copy link
Member

kno10 commented Jan 22, 2022

Most unsupervised learning does not consider prediction as part of the task.
For the simplest models, such as k-means, or kNN, prediction is possible.
But for most it is not possible in a consistent way. This includes common models such as the local outlier factor.
E.g., sklearn implements a "predict" function for LOF, but it will "predict" different values than the local outlier factor would usually assign to these points. Similarly, affinity propagation is a closed-world approach, and "predicting" using the nearest neighbor (as done in sklearn if I recall correctly) yields worse results and ignores much of the algorithms idea; instead it mixes it with a k-means-like approach (and then the algorithm ends up performing not much better than k-means).

I suggest to not approach these algorithms from a supervised "fit then predict" mindset at all, it does not do them justice.

@davnn
Copy link
Author

davnn commented Jan 22, 2022

Thanks a lot for your feedback!

I suggest to not approach these algorithms from a supervised "fit then predict" mindset at all, it does not do them justice.

I agree that a post-hoc nearest-neighbors approach to enable prediction does not feel right and does not do the algorithms justice. However, I'm mainly interested in unsupervised outlier detection algorithms that can be used for prediction fairly naturally. As far as I can tell, this includes most of the neighbor-based approaches, where you can simply compare a prediction to the points in the training set.

Edit: I also wouldn't say, for example, that the prediction disturbs the meaning of the LOF score.

@kno10
Copy link
Member

kno10 commented Jan 22, 2022

It doesn't disturb the meaning, but the results of prediction will not agree with the results of the original algorithm.
So it is some extension, there is more than one way to do this, and it's not part of the published work.

@kno10
Copy link
Member

kno10 commented Jan 22, 2022

And in many cases it will even be better to depart, e.g., from LOF and try to cleanly transfer the key idea into an open-world "train-test split setting", rather than hack this into the original closed-world setting. This makes things both a lot easier and more efficient than trying to stay close to the original LOF.
The problem is, there is no public data set suitable for evaluating any of this in a meaningful way.

@kno10
Copy link
Member

kno10 commented Jan 25, 2022

@davnn for implementing a set of outlier detection algorithms in Julia, it may be worth trying to implement the generalized pattern introduced here:

Schubert, E., Zimek, A. & Kriegel, HP. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Disc 28, 190–237 (2014). https://doi.org/10.1007/s10618-012-0300-z

This avoids redundancies when implementing this. Its not very elegantly possible in Java right now though, because generics cannot be primitives; and most of the time we will be dealing with primitive values (e.g., a local density estimate). There is a partial implementation of this pattern in ELKI for the parallelized versions, but the code has to work around a lot of Java limitations. But for a Julia implementation, this may be an interesting starting point to approach it as a multi-stage mapping process, with these stages shared across different detectors.

@davnn
Copy link
Author

davnn commented Jan 25, 2022

Thanks for the hint! I've already used that paper as a reference for the different approaches a couple of times ;). I'll investigate how those generalizations could be implemented further down the line.

@kno10
Copy link
Member

kno10 commented Feb 6, 2022

If you used that paper as a reference, I would appreciate if it were included in the references list of the package. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants