Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: method for clustering new data kmeans added #238
base: master
Are you sure you want to change the base?
feat: method for clustering new data kmeans added #238
Changes from 8 commits
82e9ab2
9cccfb2
32cc28e
86bc032
f33eedd
546805d
a11a636
ca29e80
48a8c0e
30c98f9
022c2f7
467b4be
7d4b7d4
d51e5d8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some misunderstanding of how the generic
assign_clusters()
should be implemented.In
src/utils.jl
(here) you should define the genericassign_clusters()
method, which should throw "not implemented" exception, something like:Your current implementation can only work with
R::KmeansResults
, e.g. because it usesR.centers
, which might be not available for any otherClusteringResults
descendant, but also because assigning point to a cluster based on the distance to its center is valid only for the specific clustering types. You should move the best distance-based code you have here back to thesrc/kmeans.jl
where you have originally put it, and use the more specific signature for it:So in the end we will have the two implementations of the
assign_clusters()
method: the generic one, and the KMeans one, which would be automatically selected forR::KMeansResults
, because its signature is more specific. For any clustering other than k-means the "not implemented" exception would be thrown by the generic method.Pls let me know if you have any questions regarding this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully the new PRs adress this with a "fallback" implementation that returns not implemented (in utils.jl)
and a specific kmeans implementation (in kmeans.jl) that does the computation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen your benchmarks (thank you!). I'm still not sure what kind of BLAS you use, and how the numbers change as the features or the number of samples grow. Anyway, I still think it is out of
Clustering.jl
scope and should be addressed by Distances.jl. I suggest you show your benchmark results in Distances.jl via issues or discussions (making a reference to this PR) -- I suspect the other people may have come across the same issue.I agree that in some cases the low memory footprint method should be preferred, but we cannot make it the default. I am also not a fan of implicit multi-threading: the user might be already calling
assign_clusters()
from the multi-threaded code, and yourThreads.@threads for
would be interfering with the anticipated threads allocation.Ideally, the problem should be addressed in Distances.jl, and
assign_clusters()
could pass through the keyword argument to theDistances.pairwise()
to specify the preferred implementation.For now, to avoid blocking this PR, please use the
pairwise()
-based implementation. We should be able to address your particular situation in the later PRs once we will get the feedback from Distances.jl community.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distances is not implementing the
find ids of the closest vectors to some query vectors
. We can either importNearestNeighbor.jl
for this or simply add the method I suggested. I have added a boolean flag to choose the implementation, but maybe using a string would be better? So that future implementations might be added with 'sensible names' that tell the user what will happen underneath.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add the testset to
test/utils.jl
(it would be the new file that should be included fromruntests.jl
before all others) testing thatassign_clusters(.., R)
throws "not implemented" exception for an arbitraryClusteringResult
object other thanKmeansResult
, e.g. forKMedoidsResult
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the test to cover the case
assign_clusters
does not have correct implementation for non kmeansClusteringResult
.