Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HDBSCAN from HorseML.jl #273

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
23bbed1
[add function or file]add hdbscan
MommaWatasu Mar 22, 2024
fa44398
[test]add test for hdbscan
MommaWatasu Mar 23, 2024
4e64fdf
[fix]change Int64 to Int
MommaWatasu Mar 23, 2024
e294565
[fix]change all Int64 into Int
MommaWatasu Mar 23, 2024
b851997
[change]change usage and remove extra code
MommaWatasu Mar 23, 2024
7822a7c
[test]update test
MommaWatasu Mar 23, 2024
8901cfe
[docs]add docs for HDBSCAN
MommaWatasu Mar 23, 2024
6de5d02
[fix]export HdbscanCluster
MommaWatasu Mar 23, 2024
85c5644
[docs]merge docs of HDBSCAN with DBSCAN.md
MommaWatasu Apr 14, 2024
edcc70a
[clean]refactoring HDBSCANGraph
MommaWatasu Apr 15, 2024
939ce65
[docs]fix docs
MommaWatasu Apr 26, 2024
2f67d07
[clean]refactoring
MommaWatasu Apr 26, 2024
61463f2
[clean]refactoring
MommaWatasu Apr 27, 2024
09ed174
[test]update test
MommaWatasu Apr 27, 2024
6039c0c
[fix]change isnothing into ===
MommaWatasu Apr 27, 2024
3cc7689
[fix]remove println
MommaWatasu Apr 27, 2024
1798148
[docs]update docs
MommaWatasu Apr 27, 2024
a0a819e
[docs]fix docs
MommaWatasu Apr 27, 2024
bf38eb6
[fix]fix docstring
MommaWatasu Apr 27, 2024
380acf1
[fix]add isnoise to the list of exprted function
MommaWatasu Apr 27, 2024
6fdebe6
core_distances() tweaks
alyst Apr 27, 2024
ff98806
Hdbscan: simplify results generation
alyst Apr 27, 2024
64a202a
remove heappush!: unused
alyst Apr 27, 2024
661edbc
hdbscan test: small tweaks
alyst Apr 27, 2024
082c5e6
fixup hdbscan assignments
alyst Apr 27, 2024
c9db368
hdbscan: further opt core_dists
alyst Apr 27, 2024
c205575
hdbscan: optimize edges generation
alyst Apr 27, 2024
bdf1aec
HDBSCANGraph -> HdbscanGraph
alyst Apr 27, 2024
9aa9841
HdbscanEdge
alyst Apr 27, 2024
c9b9374
HdbscanGraph: rename edges to adj_edges
alyst Apr 27, 2024
092ac40
MSTEdge remove no-op expand() method
alyst Apr 27, 2024
c972caa
refactor HdbscanMSTEdge
alyst Apr 27, 2024
57b05e6
hdbscan: fix graph vertices, remove add_edges!
alyst Apr 27, 2024
f616795
hdbscan_minspantree(): refactor
alyst Apr 27, 2024
0f6e992
prune_clusters!(): cleanup
alyst Apr 27, 2024
e01609b
hdbscan: fix 1.0 compat
alyst Apr 27, 2024
de6b83a
[docs]fix docstring
MommaWatasu Apr 27, 2024
38212ca
[clean]rename and refactoring
MommaWatasu Apr 27, 2024
159858d
[test]add test for unionfind
MommaWatasu Apr 27, 2024
a03c224
hdbscan_minspantree: fix edges sorting
alyst Apr 28, 2024
3a577ea
hdbscan_clusters(): fix cost type
alyst Apr 28, 2024
744af22
hdbscan_clusters: improve MST iteration
alyst Apr 28, 2024
8803573
Merge branch 'master' of https://github.com/MommaWatasu/Clustering.jl
MommaWatasu Apr 29, 2024
b94de25
[clean]rename the result structure
MommaWatasu May 1, 2024
7078223
[hotfix]apply hot fix
MommaWatasu May 2, 2024
b0791d9
[test]add test about min_size
MommaWatasu May 2, 2024
dc5dd40
[add function or file]support ClusteringResult
MommaWatasu May 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/source/dbscan.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,49 @@ dbscan
DbscanResult
DbscanCluster
```

# HDBSCAN

Hierarchical Density-based Spatial Clustering of Applications with Noise(HDBSCAN) is similar to DBSCAN but uses hierarchical tree to find clusters. The algorithm was proposed in:

> Ricardo J. G. B. Campello, Davoud Moulavi & Joerg Sander
> *Density-Based Clustering Based on Hierarchical Density Estimates* 2013

## [Algorithm](@id hdbscan_algorithm)
The main procedure of HDBSCAN:
1. calculate the *mutual reachability distance*
2. generate a *minimum spanning tree*
3. build hierarchy
4. extract the target cluster

### Calculate the *mutual reachability distance*
DBSCAN counts the number of points at a certain distance. But, HDBSCAN uses the opposite way, First, calculate the pairwise distances. Next, defined the core distance(``core_{ncore}(p)``) as a distance with the ncore-th closest point. And then, the mutual reachability distance between point `a` and `b` is defined as: ``d_{mreach-ncore}(a, b) = max\{ core_{ncore}(a), core_{ncore}(b), d(a, b) \}`` where ``d(a, b)`` is a euclidean distance between `a` and `b`(or specified metric).

### Generate a *minimum-spanning tree(MST)*
Conceptually what we will do is the following: consider the data as a weighted graph with the data points as vertices and an edge between any two points with weight equal to the mutual reachability distance of those points.

Then, check the list of edges in descending order of distance whether the points which is connected by it is marked or not. If not, add the edge into the MST. After this processing, all data points are connected to MST with minimum weight.

In practice, this is very expensive since there are ``n^{2}`` edges.

### Build hierarchy
The next step is to build hierarchy based on MST. This step is very simple: repeatedly unite the clusters which has the minimum edge until all cluster are united into one cluster.

### Extract the target cluster
First we need a different measure than distance to consider the persistence of clusters; instead we will use ``\lambda = \frac{1}{\mathrm{distance}}``.
For a given cluster we can then define values ``\lambda_{\mathrm{birth}}`` and ``\lambda_{\mathrm{death}}`` to be the lambda value when the cluster split off and became it’s own cluster, and the lambda value (if any) when the cluster split into smaller clusters respectively.
In turn, for a given cluster, for each point ``p`` in that cluster we can define the value ``\lambda_{p}`` as the lambda value at which that point ‘fell out of the cluster’ which is a value somewhere between ``\lambda_{\mathrm{birth}}`` and ``\lambda_{\mathrm{death}}`` since the point either falls out of the cluster at some point in the cluster’s lifetime, or leaves the cluster when the cluster splits into two smaller clusters.
Now, for each cluster compute the stability as

``\sum_{p \in \mathrm{cluster}} (\lambda_{p} - \lambda_{\mathrm{birth}})``.

Declare all leaf nodes to be selected clusters. Now work up through the tree (the reverse topological sort order). If the sum of the stabilities of the child clusters is greater than the stability of the cluster, then we set the cluster stability to be the sum of the child stabilities. If, on the other hand, the cluster’s stability is greater than the sum of its children then we declare the cluster to be a selected cluster and unselect all its descendants. Once we reach the root node we call the current set of selected clusters our flat clustering and return it as the result of clustering.

## Interface
The implementation of *HDBSCAN* algorithm is provided by [`hdbscan`](@ref) function
```@docs
hdbscan
HdbscanResult
HdbscanCluster
isnoise
```
4 changes: 4 additions & 0 deletions src/Clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ module Clustering
# dbscan
DbscanResult, DbscanCluster, dbscan,

# hdbscan
HdbscanResult, HdbscanCluster, hdbscan, isnoise,

# fuzzy_cmeans
fuzzy_cmeans, FuzzyCMeansResult,

Expand Down Expand Up @@ -83,6 +86,7 @@ module Clustering
include("kmedoids.jl")
include("affprop.jl")
include("dbscan.jl")
include("hdbscan.jl")
include("mcl.jl")
include("fuzzycmeans.jl")

Expand Down
303 changes: 303 additions & 0 deletions src/hdbscan.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
# HDBSCAN Graph
# edge[i] is a list of edges adjacent to the i-th vertex
# the second element of HDBSCANEdge is the mutual reachability distance.
HDBSCANEdge = Tuple{Int, Float64}
struct HDBSCANGraph
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
edges::Vector{Vector{HDBSCANEdge}}
HDBSCANGraph(nv::Integer) = new([HDBSCANEdge[] for _ in 1 : nv])
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
end

function add_edge!(G::HDBSCANGraph, v1::Integer, v2::Integer, dist::Number)
push!(G.edges[v1], (v2, dist))
push!(G.edges[v2], (v1, dist))
end

Base.getindex(G::HDBSCANGraph, i::Int) = G.edges[i]

MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
struct MSTEdge
v1::Integer
v2::Integer
dist::Number
end
expand(edge::MSTEdge) = (edge.v1, edge.v2, edge.dist)
Base.isless(edge1::MSTEdge, edge2::MSTEdge) = edge1.dist < edge2.dist

"""
HdbscanCluster
A cluster generated by the [hdbscan](@ref) method, part of [HdbscanResult](@ref).
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
- `points`: vector of points which belongs to the cluster
- `stability`: stability of the cluster (-1 for noise clusters)

The stability represents how much the cluster is "reasonable". So, a cluster which has a bigger stability is better.
The noise cluster is determined not to belong to any cluster. So, you can ignore them.
See also: [isnoise](@ref)
"""
mutable struct HdbscanCluster
parent::Int
children::Vector{Int}
points::Vector{Int}
λp::Vector{Float64}
stability::Float64
children_stability::Float64
function HdbscanCluster(points::Union{Vector{Int}, Nothing})
noise = points === nothing
return new(0, Int[], noise ? Int[] : points, Float64[], noise ? -1 : 0, noise ? -1 : 0)
end
end

Base.length(c::HdbscanCluster) = size(c.points, 1)

"""
isnoise(c::HdbscanCluster)
This function returns whether the cluster is the noise or not.
"""
isnoise(c::HdbscanCluster) = c.stability == -1
isstable(c::HdbscanCluster) = c.stability != 0
function increment_stability(c::HdbscanCluster, λbirth)
c.stability += sum(c.λp) - length(c.λp) * λbirth
end

"""
HdbscanResult
Result of the [hdbscan](@ref) clustering.
- `clusters`: vector of clusters
- `assignments`: vectors of assignments for each points
"""
struct HdbscanResult
clusters::Vector{HdbscanCluster}
assignments::Vector{Int}
end

"""
hdbscan(points::AbstractMatrix, ncore::Integer, min_cluster_size::Integer;
metric=SqEuclidean())
Cluster `points` using Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN) algorithm.
Refer to [HDBSCAN algorithm](@ref hdbscan_algorithm) description for the details on how the algorithm works.
# Parameters
- `points`: the *d*×*n* matrix, where each column is a *d*-dimensional point
- `ncore::Integer`: number of *core* neighbors of point, see [HDBSCAN algorithm](@ref hdbscan_algorithm) for a description
- `min_cluster_size::Integer`: minimum number of points in the cluster
- `metric`(defaults to Euclidean): the points distance metric to use.
"""
function hdbscan(points::AbstractMatrix, k::Int, min_cluster_size::Int; metric=Euclidean())
if min_cluster_size < 1
throw(DomainError(min_cluster_size, "The `min_cluster_size` must be greater than or equal to 1"))
end
n = size(points, 2)
dists = pairwise(metric, points; dims=2)
#calculate core distances for each point
core_dists = core_distances(dists, k)
#calculate mutual reachability distance between any two points
mrd = hdbscan_graph(core_dists, dists)
#compute a minimum spanning tree by prim method
mst = hdbscan_minspantree(mrd, n)
#build a HDBSCAN hierarchy
hierarchy = hdbscan_clusters(mst, min_cluster_size)
#extract the target cluster
prune_cluster!(hierarchy)
#generate the list of cluster assignment for each point
clusters = HdbscanCluster[]
assignments = fill(0, length(points)) # cluster index of each point
for (i, j) in enumerate(hierarchy[2n-1].children)
clu = hierarchy[j]
push!(clusters, clu)
assignments[clu.points] .= i
end
# add the cluster of all unassigned (noise) points
push!(clusters, HdbscanCluster(findall(==(0), assignments)))
return HdbscanResult(clusters, assignments)
end

# calculate the core (k-th nearest) distances of the points
function core_distances(dists::AbstractMatrix, k::Integer)
core_dists = Array{Float64}(undef, size(dists, 1))
for i in axes(dists, 2)
core_dists[i] = partialsort!(dists[:, i], k)
end
return core_dists
end

function hdbscan_graph(core_dists::AbstractVector, dists::AbstractMatrix)
n = size(dists, 1)
graph = HDBSCANGraph(div(n * (n-1), 2))
for i in 1 : n-1
for j in i+1 : n
c = max(core_dists[i], core_dists[j], dists[i, j])
add_edge!(graph, i, j, c)
end
end
return graph
end

function hdbscan_minspantree(graph::HDBSCANGraph, n::Integer)
function heapput!(h, v)
idx = searchsortedlast(h, v, rev=true)
insert!(h, (idx != 0) ? idx : 1, v)
end

minspantree = Vector{MSTEdge}(undef, n-1)

marked = falses(n)
nmarked = 1
marked[1] = true

h = MSTEdge[]

for (i, c) in graph[1]
heapput!(h, MSTEdge(1, i, c))
end

while nmarked < n
i, j, c = expand(pop!(h))

marked[j] && continue
minspantree[nmarked] = MSTEdge(i, j, c)
marked[j] = true
nmarked += 1

for (k, c) in graph[j]
marked[k] && continue
heapput!(h, MSTEdge(j, k, c))
end
end
return minspantree
end

function hdbscan_clusters(mst::AbstractVector{MSTEdge}, min_size::Integer)
n = length(mst) + 1
cost = 0
uf = UnionFind(n)
clusters = [HdbscanCluster(min_size > 1 ? Int[i] : Int[]) for i in 1:n]
sort!(mst)

for i in 1 : n-1
j, k, c = expand(mst[i])
cost += c
λ = 1 / cost
#child clusters
c1 = group(uf, j)
c2 = group(uf, k)
#reference to the parent cluster
clusters[c1].parent = clusters[c2].parent = n+i
nc1, nc2 = isnoise(clusters[c1]), isnoise(clusters[c2])
if !(nc1 || nc2)
#compute stability
increment_stability(clusters[c1], λ)
increment_stability(clusters[c2], λ)
#unite cluster
unite!(uf, j, k)
#create parent cluster
points = members(uf, group(uf, j))
push!(clusters, HdbscanCluster(points))
elseif !(nc1 && nc2)
if nc2 == true
(c1, c2) = (c2, c1)
end
#record the lambda value
append!(clusters[c2].λp, fill(λ, length(clusters[c1])))
#unite cluster
unite!(uf, j, k)
#create parent cluster
points = members(uf, group(uf, j))
push!(clusters, HdbscanCluster(points))
else
#unite the noise cluster
unite!(uf, j, k)
#create parent cluster
points = members(uf, group(uf, j))
if length(points) < min_size
push!(clusters, HdbscanCluster(Int[]))
else
push!(clusters, HdbscanCluster(points))
end
end
end
@assert length(clusters) == 2n - 1
return clusters
end

function prune_cluster!(hierarchy::Vector{HdbscanCluster})
for i in 1 : length(hierarchy)-1
if isnoise(hierarchy[i])
c = hierarchy[i]
push!(hierarchy[c.parent].children, i)
hierarchy[c.parent].children_stability += c.stability
else
c = hierarchy[i]
if c.stability > c.children_stability
push!(hierarchy[c.parent].children, i)
hierarchy[c.parent].children_stability += c.stability
else
append!(hierarchy[c.parent].children, c.children)
hierarchy[c.parent].children_stability += c.children_stability
end
end
end
end

# Below are utility functions for building hierarchical trees
heappush!(h, v) = insert!(h, searchsortedfirst(h, v), v)

# Union-Find
# structure for managing disjoint sets
# This structure tracks which sets the elements of a set belong to,
# and allows us to efficiently determine whether two elements belong to the same set.
mutable struct UnionFind
parent:: Vector{Integer} # parent[root] is the negative of the size
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
label::Dict{Int, Int}
next_id::Int

function UnionFind(nodes::Integer)
if nodes <= 0
throw(ArgumentError("invalid argument for nodes: $nodes"))
end

parent = -ones(nodes)
label = Dict([(i, i) for i in 1 : nodes])
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
new(parent, label, nodes)
end
end

# label of the set which element `x` belong to
group(uf::UnionFind, x) = uf.label[root(uf, x)]
# all elements that have the specified label
members(uf::UnionFind, x::Int) = collect(keys(filter(n->n.second == x, uf.label)))
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved

# root of element `x`
# The root is the representative element of the set
function root(uf::UnionFind, x::Integer)
if uf.parent[x] < 0
return x
else
return uf.parent[x] = root(uf, uf.parent[x])
end
end

# whether element `x` and `y` belong to the same set
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
function issame(uf::UnionFind, x::Integer, y::Integer)
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved
return root(uf, x) == root(uf, y)
end

function Base.size(uf::UnionFind, x::Integer)
return -uf.parent[root(uf, x)]
end
MommaWatasu marked this conversation as resolved.
Show resolved Hide resolved

function unite!(uf::UnionFind, x::Integer, y::Integer)
x = root(uf, x)
y = root(uf, y)
if x == y
return false
end
if uf.parent[x] > uf.parent[y]
x, y = y, x
end
# unite smaller tree(y) to bigger one(x)
uf.parent[x] += uf.parent[y]
uf.parent[y] = x
uf.next_id += 1
uf.label[y] = uf.next_id
for i in members(uf, group(uf, x))
uf.label[i] = uf.next_id
end
return true
end
19 changes: 19 additions & 0 deletions test/hdbscan.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
using Test
using Clustering

@testset "HDBSCAN" begin
# make moons for test
upper_x = [i for i in 0:π/50:π]
lower_x = [i for i in π/2:π/50:3/2*π]
upper_y = sin.(upper_x) .+ rand(50)./10
lower_y = cos.(lower_x) .+ rand(51)./10
data = hcat([lower_x; upper_x], [lower_y; upper_y])'
#test for main function
@test_throws DomainError hdbscan(data, 5, 0)
@test_nowarn result = hdbscan(data, 5, 3)

# tests for result
result = @inferred(hdbscan(data, 5, 3))
@test isa(result, HdbscanResult)
@test sum(length, result.clusters) == 101
end
Loading
Loading