Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PAM algorithm to K-Medoids #73

Merged
merged 33 commits into from
Nov 26, 2020

Conversation

TimotheeMathieu
Copy link
Contributor

Following issue #46 here is an implementation of PAM algorithm.

  • I only implemented the swap part of the algorithm as we do not need the greedy initialization (build part) because we already have initialization algorithms available.
  • PAM algorithm is rather slow, at least it is slower than the method used until now.
  • I did not find the name of the method used until now. For now, I named it "fast", if some of you have a better idea, tell me.
  • When used for plot_kmedoids_digits.py, PAM gives the same results as "fast" method, it is just slower.

PROBLEM:
PAM is very slow: on plot_kmedoids_digits.py PAM runs in 2min30s against 3s for the "fast" implementation on my computer. Maybe we should implement the algorithm FastPAM from https://arxiv.org/abs/1810.05691 which is used for instance in R (https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/pam see argument pamonce)

@rth
Copy link
Contributor

rth commented Nov 7, 2020

Thanks for looking into this @TimotheeMathieu !

PAM algorithm is rather slow, at least it is slower than the method used until now.

Indeed, it's around 1000x times slower than the current method on the example,

        wall_time (s)                                                                                                                      
method           
pam     16.112913
fast     0.025839

As far as I understand it's expected that PAM is slower, but I think the current implementation might also be improved performance wise (more on that in the following comment).

For now, I named it "fast", if some of you have a better idea, tell me.

Since there is FastPAM paper, we would probably need a more explicit name. Can't think of anything right now.

We would need to add tests that check that the output is the same between methods, and also ideally check that the inertia is smaller for the problematic cases mentionned in #44 (comment)

Maybe @kno10 would have some suggestions how to improve this implementation, or might even be willing to contribute the FastPAM implementation (in a separate PR) :)

@rth
Copy link
Contributor

rth commented Nov 7, 2020

I run a quick profiling of the example,

from sklearn_extra.cluster import KMedoids
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale


digits = load_digits()
data = scale(digits.data)
reduced_data = PCA(n_components=10).fit_transform(data)
n_digits = len(np.unique(digits.target))


KMedoids(metric="manhattan", n_clusters=n_digits, method="pam").fit(reduced_data)

The output is below,
image

image

So most of the time is indeed spent computing the cost function and in particular the min function which we call 600k times.

I think re-implementing the critical parts in Cython would probably help significantly.

@jeremiedbb might also have some insights. Maybe there are Cython functions in scikit-learn we could reuse.

@kno10
Copy link
Contributor

kno10 commented Nov 7, 2020

The original PAM does not use compute_inertia.

Computing this naively supposedly adds another factor of k to the run time, making it n²k². Correct PAM is n²k, and FasterPAM is n² (all per iteration).

Assuming that you used k=10, a speedup of 10 should hence be easily possible by following the PAM algorithm more closely and computing the change in inertia directly, not the intertia. FasterPAM then removes redundant work in computing this change, because only one of the k centers can be (second) closest.

@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Nov 7, 2020

@kno10 do you have a reference for PAM algorithm ? The ref I used was the original article (Clustering by means of Medoids,) and the Wikipedia page. The Wikipedia page states that indeed, PAM's complexity is O(n^2k^2), which is the algorithm I followed, your version of PAM seems to already be an improvement over the basic PAM algorithm.

Maybe we could keep the basic PAM for completness's sake and because it can still be used on small examples but also implement one of the faster pam algorithm to make it more usable?

@kno10
Copy link
Contributor

kno10 commented Nov 7, 2020

Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM). In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2
C.f. pages 102-103.

Wikipedia states (emphasis added):

The runtime complexity of the original PAM algorithm per iteration of (3) is O(k(n-k)²), by only computing the change in cost. A naive implementation recomputing the entire cost function every time will be in O(n²k²).

Recomputing inertia is the naive approach. Feel free to make this more explicit in the Wikipedia article.

@rth
Copy link
Contributor

rth commented Nov 8, 2020

Maybe we could keep the basic PAM for completness's sake and because it can still be used on small examples but also implement one of the faster pam algorithm to make it more usable?

+1 to have both versions.

@TimotheeMathieu
Copy link
Contributor Author

@kno10 thank you for the reference.
I implemented the pam algorithm in O(kn^2) from "Partitioning Around Medoids (Program PAM). In Finding Groups in Data " however it is not very fast on digits dataset. I don't really know why, theoretically it should be faster than the naive implementation maybe I made a mistake in my code.

@kno10
Copy link
Contributor

kno10 commented Nov 8, 2020

Well, the code is all but optimized. Lots of unnecessary data structures, copies, masking, redundancies in the computation of the cases, etc.

The name "fast" for the k-means-style approach is also a bad idea, because it is not FastPAM. Its known in classic FLP literature als "alternate".

Oh, and don't forget initialization! Your implementation chooses a very poor starting point, so that will also take many additional swaps to optimize - several times longer. You can roughly estimate there are about k iterations necessary from a bad starting condition to achieve the quality of the BUILD initialization strategy. BUILD is also an integral part of PAM. (With FasterPAM, because it iterates much faster, a random initialization is possible - BUILD is slower than iterating k times with FasterPAM).

@TimotheeMathieu
Copy link
Contributor Author

Ok. I give up this PR.

@kno10
Copy link
Contributor

kno10 commented Nov 9, 2020

No, don't! I am not a contributor here (as I use python primarily for scripting, not for performance oriented code), and I do think this project needs to eventually have the PAM algorithm.
I did not mean to be rude with "all but optimized", but I'll try to flag some lines above to give you more precise critique.

@TimotheeMathieu
Copy link
Contributor Author

Ok, but please keep in mind that contributor work for free, they do not all code for a living, please try to use a less aggressive tone.
I also do not do performance oriented code but I try to learn.

@rth
Copy link
Contributor

rth commented Nov 21, 2020

BTW @kno10 if you are interested in improving the KMedoids documentation (in a separate PR) that would also be very welcome.

@kno10
Copy link
Contributor

kno10 commented Nov 21, 2020

Maybe make "build" the default, because of quality - "heuristic" does not fare very well, does it? If it was designed close to the Park and Jun approach, I'd suggest to name it this way. "heuristic" is a very unspecific name.
For completeness, you may also want to allow a "build only" use (as suggested in the PAM paper); the alternating approach may improve the quality only very little over build.

@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Nov 21, 2020

I am not necessarily convinced that 'build' should be the default initialization because it is very non-robust : if there is an outlier, build will choose this outlier as a medoid and the algorithm would then have some difficulties to escape this outliers. In my opinion, one of the big advantages of k-medoids is its robustness and with build it is not robust (I added this comment in the doctring of KMedoids if people want to use KMedoids with a dataset corrupted by outliers).

It is what I witnessed when trying to use 'build' with outliers, if I am mistaken, please say so and I will remove the comment in the docstring.

rng = np.random.RandomState(seed)
X_cc, y_cc = make_blobs(
n_samples=100,
centers=np.array([[-1, -1], [1, 1]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reduce the cluster_std to 0.2 some such value? Currently it's at 1.0 and that's why there is an overlap between clusters and we detect some fraction of points in the wrong cluster. Or alternatively comment that the 0.8 value in test is due to the data and not the the kmedoids itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented to explain. I prefer to keep the clusters not so separable in order to test the stability of the method at the same time.

@kno10
Copy link
Contributor

kno10 commented Nov 22, 2020

I am not necessarily convinced that 'build' should be the default initialization because it is very non-robust : if there is an outlier, build will choose this outlier as a medoid and the algorithm would then have some difficulties to escape this outliers. In my opinion, one of the big advantages of k-medoids is its robustness and with build it is not robust (I added this comment in the doctring of KMedoids if people want to use KMedoids with a dataset corrupted by outliers).

I understand that an "outlier cluster" may be "undesirable" from a user intuition.
But do you have an example where it actually is the worse solution, by the objective function?
I doubt that choosing worse starting conditions satisfies some statistical notion of "robustness".

From my understanding and experiments, the initialization should have rather little effect on the average result quality because of the way SWAP operates. Essentially, the initial solution (or a very similar solution) chosen by BUILD would often be reached by SWAP from many random initialization. BUILD just speeds this up, reducing the runtime usually by around k-few iterations.

The reason why FasterPAM does not use BUILD anymore is because its much more expensive than the improved swapping. In the experiments random initialization then was just as good (with the additional benefit of being randomized).

@TimotheeMathieu
Copy link
Contributor Author

Indeed, the solution is not worse by the objective function... when you compute it on the dataset with the outlier.

But do you have an example where it actually is the worse solution, by the objective function?

The idea in robust statistics is that outliers should not count so we would not compute the objective function on the outlier and in this case, yes it is easy to find examples.

I doubt that choosing worse starting conditions satisfies some statistical notion of "robustness".

There are several notions of statistical robustness, let me present one of the most widespread informally. Let your data come from a mixture of three gaussians: two gaussians are informative data and one is outlier data (see the figure). The outlier gaussian blob contains only a small number of points. The idea is a sort of continuity of the estimator, i.e. a small change in the law of the data (here for instance with respect to the total variation distance between distribution) should not have a big impact on the result of the estimation. Here this means the estimation should return something similar to what the estimation is when there are no outliers.
For more references on this subject there are numerous books but a very well known is Robust Statistics: The Approach Based on Influence Functions by Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, Werner A. Stahel.

Here is a picture of what happen in practice with KMedoids PAM (notice that swap do not succeed in recovering the two inlier gaussian blobs) there are 97 inliers and 3 outliers. The color are the clusters found by the algorithm which is not what I wanted, I wanted to recover the two gaussian blobs and ignore the outliers (here outliers are in yellow). In practice, for instance you know that you are searching for two clusters but there are outliers caused by captor error, human error...
image

@rth rth mentioned this pull request Nov 26, 2020
@rth
Copy link
Contributor

rth commented Nov 26, 2020

OK, so since the documentation will be updated in #81 this PR LGTM.

The only remaining comment I have is #73 (comment)

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TimotheeMathieu !

@rth rth changed the title Add pam algorithm Add PAM algorithm to K-Medoids Nov 26, 2020
@rth rth merged commit d099f66 into scikit-learn-contrib:master Nov 26, 2020
@rth
Copy link
Contributor

rth commented Nov 26, 2020

There is a failure in one of the CI jobs on master,

method = 'alternate', init = 'k-medoids++'

    def test_kmedoid_results(method, init):
        expected = np.hstack([np.zeros(50), np.ones(50)])
        km = KMedoids(n_clusters=2, init=init, method=method)
        km.fit(X_cc)
        # This test use data that are not perfectly separable so the
        # accuracy is not 1. Accuracy around 0.85
>       assert (np.mean(km.labels_ == expected) > 0.8) or (
            1 - np.mean(km.labels_ == expected) > 0.8
        )
E       assert (0.53 > 0.8 or (1 - 0.53) > 0.8)

which is consistent with known issues of the alternate method. We need to fix the random_state in the test.

@kasparg
Copy link

kasparg commented Nov 26, 2020

👏 Hi folks, just commenting here that this PR is great improvement and a lifesaver!

@kno10
Copy link
Contributor

kno10 commented Feb 1, 2021

I have published a Python wrapper for a Rust port of PAM and FasterPAM:
https://pypi.org/project/kmedoids/

On 20newsgroups_vectorized, cosine distance, k=20, Google Colab:

Package Algorithm Runtime [ms] Loss
sklearn cosine 13262.62
scikit-learn-extra 0.1.0b2 Alternating 2427.86 5194.06
scikit-learn-extra 0.1.0b2 Alt. Heurstic 668.57 5137.16
kmedoids 0.1.5 Alternating 1705.96 5194.06
kmedoids 0.1.5 FasterPAM 344.97 4996.85
kmedoids 0.1.5 PAM 33883.32 4994.73
kmedoids 0.1.5 BUILD 4640.23 4995.06
kmedoids 0.1.5 random init 6.14 5751.77

Averages of 20 iterations. PAM and BUILD are deterministic; all others are random seeded (except "heuristic").
On other data sets, PAM was worse than FasterPAM; this is a give or take based on starting conditions and local optima. Best of FasterPAM with 20 restarts was 4994.81; but I intend to rerun this benchmark with shuffled data as there may well be some order dependency on the 20news data. This will be worth studying, if the data order affects which local optima can be found with random initialization.

I am a bit surprised how much runtime difference the "heuristic" initialization makes here; yet the quality is still considerably worse than the PAM algorithms. In the FasterPAM experiments, initialization had little effect on quality, so choosing the fastest (random) was reasonable. And you can see that BUILD alone finds better results than Alternating.

@kno10 kno10 mentioned this pull request Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants