Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: method for clustering new data kmeans added #238

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
5 changes: 4 additions & 1 deletion src/Clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,10 @@ module Clustering
Hclust, hclust, cutree,

# MCL
mcl, MCLResult
mcl, MCLResult,

# utils
assign_clusters

## source files

Expand Down
46 changes: 45 additions & 1 deletion src/kmeans.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# K-means algorithm

using Distances
#### Interface

# C is the type of centers, an (abstract) matrix of size (d x k)
Expand Down Expand Up @@ -391,3 +391,47 @@ function repick_unused_centers(X::AbstractMatrix{<:Real}, # in: the data matrix
tcosts = min(tcosts, ds)
end
end


"""
assign_clusters(X::AbstractMatrix{<:Real}, R::KmeansResult; kwargs...) -> Vector{Int}

Assign the samples specified as the columns of `X` to the corresponding clusters from `R`.

# Arguments
davidbp marked this conversation as resolved.
Show resolved Hide resolved
- `X`: Input data to be clustered.
- `R`: Fitted clustering result.

# Keyword arguments
- `distance`: SemiMertric used to compute distances between vectors and clusters centroids.
- `pairwise_computation`: Boolean specifying whether to compute and store pairwise distances.

"""
function assign_clusters(
X::AbstractMatrix{T},
R::KmeansResult;
distance::SemiMetric = SqEuclidean(),
pairwise_computation::Bool = true) where {T}

if pairwise_computation
Xdist = pairwise(distance, X, R.centers, dims=2)
cluster_assignments = partialsortperm.(eachrow(Xdist), 1)
else
cluster_assignments = zeros(Int, size(X, 2))
Threads.@threads for n in axes(X, 2)
min_dist = typemax(T)
cluster_assignment = 0

for k in axes(R.centers, 2)
dist = distance(@view(X[:, n]), @view(R.centers[:, k]))
if dist < min_dist
min_dist = dist
cluster_assignment = k
end
end
cluster_assignments[n] = cluster_assignment
end
end

return cluster_assignments
end
30 changes: 29 additions & 1 deletion src/utils.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Common utilities

##### common types
using Distances

"""
ClusteringResult
Expand Down Expand Up @@ -70,3 +70,31 @@ function updatemin!(r::AbstractArray, x::AbstractArray)
end
return r
end


"""
assign_clusters(X::AbstractMatrix{<:Real}, R::ClusteringResult; kwargs...) -> Vector{Int}

Assign the samples specified as the columns of `X` to the corresponding clusters from `R`.

# Arguments
- `X`: Input data to be clustered.
- `R`: Fitted clustering result.
davidbp marked this conversation as resolved.
Show resolved Hide resolved

# Keyword arguments
- Cluster specific keyword arguments. For example, see the `assign_clusters` method in
[`kmeans`](@ref) for the description of optional `kwargs`.

"""
function assign_clusters(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some misunderstanding of how the generic assign_clusters() should be implemented.
In src/utils.jl (here) you should define the generic assign_clusters() method, which should throw "not implemented" exception, something like:

assign_clusters(X::AbstractMatrix, R::ClusteringResult; kwargs...) =
    error("assign_clusters(X, R::$(typeof(R))) not implemented")

Your current implementation can only work with R::KmeansResults, e.g. because it uses R.centers, which might be not available for any other ClusteringResults descendant, but also because assigning point to a cluster based on the distance to its center is valid only for the specific clustering types. You should move the best distance-based code you have here back to the src/kmeans.jl where you have originally put it, and use the more specific signature for it:

assign_clusters(X::AbstractMatrix, R::KMeansResult; distance::SemiMetric = SqEuclidean())

So in the end we will have the two implementations of the assign_clusters() method: the generic one, and the KMeans one, which would be automatically selected for R::KMeansResults, because its signature is more specific. For any clustering other than k-means the "not implemented" exception would be thrown by the generic method.

Pls let me know if you have any questions regarding this logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully the new PRs adress this with a "fallback" implementation that returns not implemented (in utils.jl)

function assign_clusters(
    X::AbstractMatrix{T}, 
    R::ClusteringResult;
    distance::SemiMetric = SqEuclidean(),
    pairwise_computation::Bool = true) where {T} 

    if !(typeof(R) <: KmeansResult)
        throw(MethodError(assign_clusters,
              "NotImplemented: assign_clusters not implemented for R of type $(typeof(R))"))
    end

end

and a specific kmeans implementation (in kmeans.jl) that does the computation

X::AbstractMatrix{T},
R::ClusteringResult;
distance::SemiMetric = SqEuclidean(),
pairwise_computation::Bool = true) where {T}

if !(typeof(R) <: KmeansResult)
throw(MethodError(assign_clusters,
"NotImplemented: assign_clusters not implemented for R of type $(typeof(R))"))
end

end
11 changes: 11 additions & 0 deletions test/kmeans.jl
Original file line number Diff line number Diff line change
Expand Up @@ -204,4 +204,15 @@ end
end
end

@testset "get cluster assigments" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add the testset to test/utils.jl (it would be the new file that should be included from runtests.jl before all others) testing that assign_clusters(.., R) throws "not implemented" exception for an arbitrary ClusteringResult object other than KmeansResult, e.g. for KMedoidsResult.

Copy link
Author

@davidbp davidbp Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the test to cover the case assign_clusters does not have correct implementation for non kmeans ClusteringResult.

X = rand(5, 100)
R = kmeans(X, 10; maxiter=200)
reassigned_clusters = assign_clusters(X, R; pairwise_computation=true);
@test R.assignments == reassigned_clusters

reassigned_clusters2 = assign_clusters(X, R; pairwise_computation=false);
@test R.assignments == reassigned_clusters2

end

end
3 changes: 2 additions & 1 deletion test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ using SparseArrays
using StableRNGs
using Statistics

tests = ["seeding",
tests = ["utils",
"seeding",
"kmeans",
"kmedoids",
"affprop",
Expand Down
12 changes: 12 additions & 0 deletions test/utils.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
using Test
using Clustering
using Distances

@testset "get cluster assigments not implemented method" begin

X = rand(10,5)
dist = pairwise(SqEuclidean(), X, dims=2)
R = kmedoids!(dist, [1, 2, 3])

@test_throws MethodError assign_clusters(X, R);
end