-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PAM algorithm to K-Medoids #73
Add PAM algorithm to K-Medoids #73
Conversation
Thanks for looking into this @TimotheeMathieu !
Indeed, it's around 1000x times slower than the current method on the example,
As far as I understand it's expected that PAM is slower, but I think the current implementation might also be improved performance wise (more on that in the following comment).
Since there is FastPAM paper, we would probably need a more explicit name. Can't think of anything right now. We would need to add tests that check that the output is the same between methods, and also ideally check that the inertia is smaller for the problematic cases mentionned in #44 (comment) Maybe @kno10 would have some suggestions how to improve this implementation, or might even be willing to contribute the FastPAM implementation (in a separate PR) :) |
I run a quick profiling of the example, from sklearn_extra.cluster import KMedoids
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
digits = load_digits()
data = scale(digits.data)
reduced_data = PCA(n_components=10).fit_transform(data)
n_digits = len(np.unique(digits.target))
KMedoids(metric="manhattan", n_clusters=n_digits, method="pam").fit(reduced_data) So most of the time is indeed spent computing the cost function and in particular the min function which we call 600k times. I think re-implementing the critical parts in Cython would probably help significantly. @jeremiedbb might also have some insights. Maybe there are Cython functions in scikit-learn we could reuse. |
The original PAM does not use Computing this naively supposedly adds another factor of k to the run time, making it n²k². Correct PAM is n²k, and FasterPAM is n² (all per iteration). Assuming that you used k=10, a speedup of 10 should hence be easily possible by following the PAM algorithm more closely and computing the change in inertia directly, not the intertia. FasterPAM then removes redundant work in computing this change, because only one of the k centers can be (second) closest. |
@kno10 do you have a reference for PAM algorithm ? The ref I used was the original article (Clustering by means of Medoids,) and the Wikipedia page. The Wikipedia page states that indeed, PAM's complexity is O(n^2k^2), which is the algorithm I followed, your version of PAM seems to already be an improvement over the basic PAM algorithm. Maybe we could keep the basic PAM for completness's sake and because it can still be used on small examples but also implement one of the faster pam algorithm to make it more usable? |
Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM). In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2 Wikipedia states (emphasis added):
Recomputing inertia is the naive approach. Feel free to make this more explicit in the Wikipedia article. |
+1 to have both versions. |
@kno10 thank you for the reference. |
Well, the code is all but optimized. Lots of unnecessary data structures, copies, masking, redundancies in the computation of the cases, etc. The name "fast" for the k-means-style approach is also a bad idea, because it is not FastPAM. Its known in classic FLP literature als "alternate". Oh, and don't forget initialization! Your implementation chooses a very poor starting point, so that will also take many additional swaps to optimize - several times longer. You can roughly estimate there are about k iterations necessary from a bad starting condition to achieve the quality of the BUILD initialization strategy. BUILD is also an integral part of PAM. (With FasterPAM, because it iterates much faster, a random initialization is possible - BUILD is slower than iterating k times with FasterPAM). |
Ok. I give up this PR. |
No, don't! I am not a contributor here (as I use python primarily for scripting, not for performance oriented code), and I do think this project needs to eventually have the PAM algorithm. |
Ok, but please keep in mind that contributor work for free, they do not all code for a living, please try to use a less aggressive tone. |
BTW @kno10 if you are interested in improving the KMedoids documentation (in a separate PR) that would also be very welcome. |
Co-authored-by: Roman Yurchak <[email protected]>
Maybe make "build" the default, because of quality - "heuristic" does not fare very well, does it? If it was designed close to the Park and Jun approach, I'd suggest to name it this way. "heuristic" is a very unspecific name. |
I am not necessarily convinced that 'build' should be the default initialization because it is very non-robust : if there is an outlier, build will choose this outlier as a medoid and the algorithm would then have some difficulties to escape this outliers. In my opinion, one of the big advantages of k-medoids is its robustness and with build it is not robust (I added this comment in the doctring of It is what I witnessed when trying to use 'build' with outliers, if I am mistaken, please say so and I will remove the comment in the docstring. |
rng = np.random.RandomState(seed) | ||
X_cc, y_cc = make_blobs( | ||
n_samples=100, | ||
centers=np.array([[-1, -1], [1, 1]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can reduce the cluster_std
to 0.2 some such value? Currently it's at 1.0 and that's why there is an overlap between clusters and we detect some fraction of points in the wrong cluster. Or alternatively comment that the 0.8 value in test is due to the data and not the the kmedoids itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented to explain. I prefer to keep the clusters not so separable in order to test the stability of the method at the same time.
I understand that an "outlier cluster" may be "undesirable" from a user intuition. From my understanding and experiments, the initialization should have rather little effect on the average result quality because of the way SWAP operates. Essentially, the initial solution (or a very similar solution) chosen by BUILD would often be reached by SWAP from many random initialization. BUILD just speeds this up, reducing the runtime usually by around k-few iterations. The reason why FasterPAM does not use BUILD anymore is because its much more expensive than the improved swapping. In the experiments random initialization then was just as good (with the additional benefit of being randomized). |
OK, so since the documentation will be updated in #81 this PR LGTM. The only remaining comment I have is #73 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TimotheeMathieu !
There is a failure in one of the CI jobs on master,
which is consistent with known issues of the alternate method. We need to fix the random_state in the test. |
👏 Hi folks, just commenting here that this PR is great improvement and a lifesaver! |
I have published a Python wrapper for a Rust port of PAM and FasterPAM: On 20newsgroups_vectorized, cosine distance, k=20, Google Colab:
Averages of 20 iterations. PAM and BUILD are deterministic; all others are random seeded (except "heuristic"). I am a bit surprised how much runtime difference the "heuristic" initialization makes here; yet the quality is still considerably worse than the PAM algorithms. In the FasterPAM experiments, initialization had little effect on quality, so choosing the fastest (random) was reasonable. And you can see that BUILD alone finds better results than Alternating. |
Following issue #46 here is an implementation of PAM algorithm.
plot_kmedoids_digits.py
, PAM gives the same results as "fast" method, it is just slower.PROBLEM:
PAM is very slow: on
plot_kmedoids_digits.py
PAM runs in 2min30s against 3s for the "fast" implementation on my computer. Maybe we should implement the algorithm FastPAM from https://arxiv.org/abs/1810.05691 which is used for instance in R (https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/pam see argumentpamonce
)