feat: method for clustering new data kmeans added #238

davidbp · 2022-10-28T15:20:54Z

Some applications require training a Kmeans with a few datapoints but then using the fitted model with a large amount of data. Currently there is no method in the package that, given a fitted model and an array, finds the cluster labels for the new data.

alyst

In the end it is just one line of code (see the Distances.pairwise() comment below), but it would be better to have it in the package than let the users rediscover it.
As it is suggested in the code review comments, please make it more generic supporting any ClusteringResult subtype and any AbstractMatrix.
I suggest to call it assign_clusters(), although potentially it could also be StatsAPI.predict().

Also, please adjust your code formatting, esp. spaces after commas and around operators.

src/kmeans.jl

test/kmeans.jl

Co-authored-by: Alexey Stukalov <[email protected]>

codecov-commenter · 2022-10-28T21:40:34Z

Codecov Report

Base: 95.18% // Head: 95.15% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (ca29e80) compared to base (82821e8).
Patch coverage: 92.85% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #238      +/-   ##
==========================================
- Coverage   95.18%   95.15%   -0.03%     
==========================================
  Files          16       16              
  Lines        1328     1342      +14     
==========================================
+ Hits         1264     1277      +13     
- Misses         64       65       +1

Impacted Files	Coverage Δ
src/utils.jl	`96.55% <92.85%> (-3.45%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

alyst

Thank you for your fixes! We still need some tweaks, especially the generic assign_clusters() implementation (see the specific comments).

src/utils.jl

alyst · 2022-11-07T07:57:16Z

src/utils.jl

+- `X`: Input data to be clustered.
+- `R`: Fitted clustering result.
+"""
+function assign_clusters(


There's some misunderstanding of how the generic assign_clusters() should be implemented.
In src/utils.jl (here) you should define the generic assign_clusters() method, which should throw "not implemented" exception, something like:

assign_clusters(X::AbstractMatrix, R::ClusteringResult; kwargs...) = error("assign_clusters(X, R::$(typeof(R))) not implemented")

Your current implementation can only work with R::KmeansResults, e.g. because it uses R.centers, which might be not available for any other ClusteringResults descendant, but also because assigning point to a cluster based on the distance to its center is valid only for the specific clustering types. You should move the best distance-based code you have here back to the src/kmeans.jl where you have originally put it, and use the more specific signature for it:

assign_clusters(X::AbstractMatrix, R::KMeansResult; distance::SemiMetric = SqEuclidean())

So in the end we will have the two implementations of the assign_clusters() method: the generic one, and the KMeans one, which would be automatically selected for R::KMeansResults, because its signature is more specific. For any clustering other than k-means the "not implemented" exception would be thrown by the generic method.

Pls let me know if you have any questions regarding this logic.

hopefully the new PRs adress this with a "fallback" implementation that returns not implemented (in utils.jl)

function assign_clusters( X::AbstractMatrix{T}, R::ClusteringResult; distance::SemiMetric = SqEuclidean(), pairwise_computation::Bool = true) where {T} if !(typeof(R) <: KmeansResult) throw(MethodError(assign_clusters, "NotImplemented: assign_clusters not implemented for R of type $(typeof(R))")) end end

and a specific kmeans implementation (in kmeans.jl) that does the computation

alyst · 2022-11-07T08:23:38Z

src/utils.jl

+    Threads.@threads for n in axes(X, 2)
+        min_dist = typemax(T)
+        cluster_assignment = 0
+
+        for k in axes(R.centers, 2)
+            dist = distance(@view(X[:, n]), @view(R.centers[:, k]))
+            if dist < min_dist
+                min_dist = dist
+                cluster_assignment = k
+            end
+        end
+        cluster_assignments[n] = cluster_assignment
+    end


I have seen your benchmarks (thank you!). I'm still not sure what kind of BLAS you use, and how the numbers change as the features or the number of samples grow. Anyway, I still think it is out of Clustering.jl scope and should be addressed by Distances.jl. I suggest you show your benchmark results in Distances.jl via issues or discussions (making a reference to this PR) -- I suspect the other people may have come across the same issue.

I agree that in some cases the low memory footprint method should be preferred, but we cannot make it the default. I am also not a fan of implicit multi-threading: the user might be already calling assign_clusters() from the multi-threaded code, and your Threads.@threads for would be interfering with the anticipated threads allocation.
Ideally, the problem should be addressed in Distances.jl, and assign_clusters() could pass through the keyword argument to the Distances.pairwise() to specify the preferred implementation.

For now, to avoid blocking this PR, please use the pairwise()-based implementation. We should be able to address your particular situation in the later PRs once we will get the feedback from Distances.jl community.

Distances is not implementing the find ids of the closest vectors to some query vectors. We can either import NearestNeighbor.jl for this or simply add the method I suggested. I have added a boolean flag to choose the implementation, but maybe using a string would be better? So that future implementations might be added with 'sensible names' that tell the user what will happen underneath.

alyst · 2022-11-07T08:29:05Z

test/kmeans.jl

@@ -204,4 +204,11 @@ end
    end
 end

+@testset "get cluster assigments" begin


Please also add the testset to test/utils.jl (it would be the new file that should be included from runtests.jl before all others) testing that assign_clusters(.., R) throws "not implemented" exception for an arbitrary ClusteringResult object other than KmeansResult, e.g. for KMedoidsResult.

I've added the test to cover the case assign_clusters does not have correct implementation for non kmeans ClusteringResult.

feat: method for clustering new data kmeans added

82e9ab2

alyst requested changes Oct 28, 2022

View reviewed changes

davidbp and others added 6 commits October 28, 2022 22:00

Update test/kmeans.jl

9cccfb2

Co-authored-by: Alexey Stukalov <[email protected]>

Update src/kmeans.jl

32cc28e

Co-authored-by: Alexey Stukalov <[email protected]>

Update src/kmeans.jl

86bc032

Co-authored-by: Alexey Stukalov <[email protected]>

refactor: update docstring

f33eedd

refactor: move assign clusters to utils

546805d

fix: inputs to cluster_assignment

a11a636

refactor: remove added lines

ca29e80

davidbp requested a review from alyst October 29, 2022 21:20

alyst requested changes Nov 7, 2022

View reviewed changes

davidbp mentioned this pull request Mar 4, 2023

Trees for non-Metrics? KristofferC/NearestNeighbors.jl#75

Closed

davidbp added 6 commits April 9, 2023 18:31

test: add not implemented check

48a8c0e

test: add test with pairwise and without it

30c98f9

refactor: use specific kmeans method within kmeans

022c2f7

fix: runtest bad import

467b4be

refactor: separate kwargs from other args in method description

7d4b7d4

refactor: rewrite kwargs description in kmeans and utils

d51e5d8

davidbp requested a review from alyst April 11, 2023 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: method for clustering new data kmeans added #238

feat: method for clustering new data kmeans added #238

davidbp commented Oct 28, 2022 •

edited

Loading

alyst left a comment

codecov-commenter commented Oct 28, 2022 •

edited

Loading

alyst left a comment

alyst Nov 7, 2022

davidbp Apr 9, 2023

alyst Nov 7, 2022

davidbp Apr 9, 2023 •

edited

Loading

alyst Nov 7, 2022

davidbp Apr 9, 2023 •

edited

Loading

feat: method for clustering new data kmeans added #238

Are you sure you want to change the base?

feat: method for clustering new data kmeans added #238

Conversation

davidbp commented Oct 28, 2022 • edited Loading

alyst left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 28, 2022 • edited Loading

Codecov Report

alyst left a comment

Choose a reason for hiding this comment

alyst Nov 7, 2022

Choose a reason for hiding this comment

davidbp Apr 9, 2023

Choose a reason for hiding this comment

alyst Nov 7, 2022

Choose a reason for hiding this comment

davidbp Apr 9, 2023 • edited Loading

Choose a reason for hiding this comment

alyst Nov 7, 2022

Choose a reason for hiding this comment

davidbp Apr 9, 2023 • edited Loading

Choose a reason for hiding this comment

davidbp commented Oct 28, 2022 •

edited

Loading

codecov-commenter commented Oct 28, 2022 •

edited

Loading

davidbp Apr 9, 2023 •

edited

Loading

davidbp Apr 9, 2023 •

edited

Loading