Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HDBSCAN from HorseML.jl #273

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
23bbed1
[add function or file]add hdbscan
MommaWatasu Mar 22, 2024
fa44398
[test]add test for hdbscan
MommaWatasu Mar 23, 2024
4e64fdf
[fix]change Int64 to Int
MommaWatasu Mar 23, 2024
e294565
[fix]change all Int64 into Int
MommaWatasu Mar 23, 2024
b851997
[change]change usage and remove extra code
MommaWatasu Mar 23, 2024
7822a7c
[test]update test
MommaWatasu Mar 23, 2024
8901cfe
[docs]add docs for HDBSCAN
MommaWatasu Mar 23, 2024
6de5d02
[fix]export HdbscanCluster
MommaWatasu Mar 23, 2024
85c5644
[docs]merge docs of HDBSCAN with DBSCAN.md
MommaWatasu Apr 14, 2024
edcc70a
[clean]refactoring HDBSCANGraph
MommaWatasu Apr 15, 2024
939ce65
[docs]fix docs
MommaWatasu Apr 26, 2024
2f67d07
[clean]refactoring
MommaWatasu Apr 26, 2024
61463f2
[clean]refactoring
MommaWatasu Apr 27, 2024
09ed174
[test]update test
MommaWatasu Apr 27, 2024
6039c0c
[fix]change isnothing into ===
MommaWatasu Apr 27, 2024
3cc7689
[fix]remove println
MommaWatasu Apr 27, 2024
1798148
[docs]update docs
MommaWatasu Apr 27, 2024
a0a819e
[docs]fix docs
MommaWatasu Apr 27, 2024
bf38eb6
[fix]fix docstring
MommaWatasu Apr 27, 2024
380acf1
[fix]add isnoise to the list of exprted function
MommaWatasu Apr 27, 2024
6fdebe6
core_distances() tweaks
alyst Apr 27, 2024
ff98806
Hdbscan: simplify results generation
alyst Apr 27, 2024
64a202a
remove heappush!: unused
alyst Apr 27, 2024
661edbc
hdbscan test: small tweaks
alyst Apr 27, 2024
082c5e6
fixup hdbscan assignments
alyst Apr 27, 2024
c9db368
hdbscan: further opt core_dists
alyst Apr 27, 2024
c205575
hdbscan: optimize edges generation
alyst Apr 27, 2024
bdf1aec
HDBSCANGraph -> HdbscanGraph
alyst Apr 27, 2024
9aa9841
HdbscanEdge
alyst Apr 27, 2024
c9b9374
HdbscanGraph: rename edges to adj_edges
alyst Apr 27, 2024
092ac40
MSTEdge remove no-op expand() method
alyst Apr 27, 2024
c972caa
refactor HdbscanMSTEdge
alyst Apr 27, 2024
57b05e6
hdbscan: fix graph vertices, remove add_edges!
alyst Apr 27, 2024
f616795
hdbscan_minspantree(): refactor
alyst Apr 27, 2024
0f6e992
prune_clusters!(): cleanup
alyst Apr 27, 2024
e01609b
hdbscan: fix 1.0 compat
alyst Apr 27, 2024
de6b83a
[docs]fix docstring
MommaWatasu Apr 27, 2024
38212ca
[clean]rename and refactoring
MommaWatasu Apr 27, 2024
159858d
[test]add test for unionfind
MommaWatasu Apr 27, 2024
a03c224
hdbscan_minspantree: fix edges sorting
alyst Apr 28, 2024
3a577ea
hdbscan_clusters(): fix cost type
alyst Apr 28, 2024
744af22
hdbscan_clusters: improve MST iteration
alyst Apr 28, 2024
8803573
Merge branch 'master' of https://github.com/MommaWatasu/Clustering.jl
MommaWatasu Apr 29, 2024
b94de25
[clean]rename the result structure
MommaWatasu May 1, 2024
7078223
[hotfix]apply hot fix
MommaWatasu May 2, 2024
b0791d9
[test]add test about min_size
MommaWatasu May 2, 2024
dc5dd40
[add function or file]support ClusteringResult
MommaWatasu May 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/source/dbscan.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,49 @@ dbscan
DbscanResult
DbscanCluster
```

# HDBSCAN

Hierarchical Density-based Spatial Clustering of Applications with Noise(HDBSCAN) is similar to DBSCAN but uses hierarchical tree to find clusters. The algorithm was proposed in:

> Ricardo J. G. B. Campello, Davoud Moulavi & Joerg Sander
> *Density-Based Clustering Based on Hierarchical Density Estimates* 2013

## [Algorithm](@id hdbscan_algorithm)
The main procedure of HDBSCAN:
1. calculate the *mutual reachability distance*
2. generate a *minimum spanning tree*
3. build hierarchy
4. extract the target cluster

### Calculate the *mutual reachability distance*
DBSCAN counts the number of points at a certain distance. But, HDBSCAN uses the opposite way, First, calculate the pairwise distances. Next, defined the core distance(``core_{ncore}(p)``) as a distance with the ncore-th closest point. And then, the mutual reachability distance between point `a` and `b` is defined as: ``d_{mreach-ncore}(a, b) = max\{ core_{ncore}(a), core_{ncore}(b), d(a, b) \}`` where ``d(a, b)`` is a euclidean distance between `a` and `b`(or specified metric).

### Generate a *minimum-spanning tree(MST)*
Conceptually what we will do is the following: consider the data as a weighted graph with the data points as vertices and an edge between any two points with weight equal to the mutual reachability distance of those points.

Then, check the list of edges in descending order of distance whether the points which is connected by it is marked or not. If not, add the edge into the MST. After this processing, all data points are connected to MST with minimum weight.

In practice, this is very expensive since there are ``n^{2}`` edges.

### Build hierarchy
The next step is to build hierarchy based on MST. This step is very simple: repeatedly unite the clusters which has the minimum edge until all cluster are united into one cluster.

### Extract the target cluster
First we need a different measure than distance to consider the persistence of clusters; instead we will use ``\lambda = \frac{1}{\mathrm{distance}}``.
For a given cluster we can then define values ``\lambda_{\mathrm{birth}}`` and ``\lambda_{\mathrm{death}}`` to be the lambda value when the cluster split off and became it’s own cluster, and the lambda value (if any) when the cluster split into smaller clusters respectively.
In turn, for a given cluster, for each point ``p`` in that cluster we can define the value ``\lambda_{p}`` as the lambda value at which that point ‘fell out of the cluster’ which is a value somewhere between ``\lambda_{\mathrm{birth}}`` and ``\lambda_{\mathrm{death}}`` since the point either falls out of the cluster at some point in the cluster’s lifetime, or leaves the cluster when the cluster splits into two smaller clusters.
Now, for each cluster compute the stability as

``\sum_{p \in \mathrm{cluster}} (\lambda_{p} - \lambda_{\mathrm{birth}})``.

Declare all leaf nodes to be selected clusters. Now work up through the tree (the reverse topological sort order). If the sum of the stabilities of the child clusters is greater than the stability of the cluster, then we set the cluster stability to be the sum of the child stabilities. If, on the other hand, the cluster’s stability is greater than the sum of its children then we declare the cluster to be a selected cluster and unselect all its descendants. Once we reach the root node we call the current set of selected clusters our flat clustering and return it as the result of clustering.

## Interface
The implementation of *HDBSCAN* algorithm is provided by [`hdbscan`](@ref) function
```@docs
hdbscan
HdbscanResult
HdbscanCluster
isnoise
```
5 changes: 5 additions & 0 deletions src/Clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ module Clustering
# dbscan
DbscanResult, DbscanCluster, dbscan,

# hdbscan
HdbscanResult, HdbscanCluster, hdbscan, isnoise,

# fuzzy_cmeans
fuzzy_cmeans, FuzzyCMeansResult,

Expand Down Expand Up @@ -78,11 +81,13 @@ module Clustering

include("utils.jl")
include("seeding.jl")
include("unionfind.jl")

include("kmeans.jl")
include("kmedoids.jl")
include("affprop.jl")
include("dbscan.jl")
include("hdbscan.jl")
include("mcl.jl")
include("fuzzycmeans.jl")

Expand Down
232 changes: 232 additions & 0 deletions src/hdbscan.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# HDBSCAN Graph edge: target vertex and mutual reachability distance.
HdbscanEdge = Tuple{Int, Float64}

# HDBSCAN Graph
struct HdbscanGraph
adj_edges::Vector{Vector{HdbscanEdge}}

HdbscanGraph(nv::Integer) = new([HdbscanEdge[] for _ in 1 : nv])
end

# Edge of the minimum spanning tree for HDBScan algorithm
HdbscanMSTEdge = NamedTuple{(:v1, :v2, :dist), Tuple{Int, Int, Float64}}

mutable struct HdbscanNode
parent::Int
children::Vector{Int}
points::Vector{Int}
λp::Vector{Float64}
stability::Float64
children_stability::Float64

function HdbscanNode(points::Vector{Int}; λp::Vector{Float64}=Float64[], children::Vector{Int}=Int[], children_stability::Float64=0.0)
noise = isempty(λp)
return new(0, children, points, λp, noise ? -1 : 0, children_stability)
end
end

cluster_len(c::HdbscanNode) = length(c.points)

"""
HdbscanCluster

A cluster generated by the [hdbscan](@ref) method, part of [HdbscanResult](@ref).
- `points`: vector of points which belongs to the cluster
- `stability`: stability of the cluster (-1 for noise clusters)

The stability represents how much the cluster is "reasonable". So, a cluster which has a bigger stability is better.
The noise cluster is determined not to belong to any cluster. So, you can ignore them.
See also: [isnoise](@ref)
"""
struct HdbscanCluster
points::Vector{Int}
stability::Float64
end

"""
isnoise(c::HdbscanCluster)

This function returns whether the cluster is the noise or not.
"""
isnoise(c::HdbscanCluster) = c.stability == -1
isnoise(c::HdbscanNode) = c.stability == -1
isstable(c::HdbscanNode) = c.stability != 0
function compute_stability!(c::HdbscanNode, λbirth)
c.stability = sum(c.λp) - length(c.λp) * λbirth
return c.stability
end

"""
HdbscanResult

Result of the [hdbscan](@ref) clustering.
- `clusters`: vector of clusters
- `assignments`: vectors of assignments for each points

See also: [HdbscanCluster](@ref)
"""
struct HdbscanResult <: ClusteringResult
clusters::Vector{HdbscanCluster}
assignments::Vector{Int}
counts::Vector{Int}
end

"""
hdbscan(points::AbstractMatrix, ncore::Integer, min_cluster_size::Integer;
metric=SqEuclidean())

Cluster `points` using Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN) algorithm.
Refer to [HDBSCAN algorithm](@ref hdbscan_algorithm) description for the details on how the algorithm works.
# Parameters
- `points`: the *d*×*n* matrix, where each column is a *d*-dimensional point
- `ncore::Integer`: number of *core* neighbors of point, see [HDBSCAN algorithm](@ref hdbscan_algorithm) for a description
- `min_cluster_size::Integer`: minimum number of points in the cluster
- `metric`(defaults to Euclidean): the points distance metric to use.
"""
function hdbscan(points::AbstractMatrix, ncore::Integer, min_cluster_size::Int; metric=Euclidean())
if min_cluster_size < 1
throw(DomainError(min_cluster_size, "The `min_cluster_size` must be greater than or equal to 1"))
end
n = size(points, 2)
dists = pairwise(metric, points; dims=2)
# calculate core (ncore-th nearest) distance for each point
core_dists = [partialsort!(dists[:, i], ncore) for i in axes(dists, 2)]

#calculate mutual reachability distance between any two points
mrd = hdbscan_graph(core_dists, dists)
#compute a minimum spanning tree by prim method
mst = hdbscan_minspantree(mrd)
#build a HDBSCAN hierarchy
tree = hdbscan_build_tree(mst, min_cluster_size)
#extract the target cluster
extract_clusters!(tree)
#generate the list of cluster assignment for each point
clusters = HdbscanCluster[]
counts = Int[]
assignments = fill(0, n) # cluster index of each point
for (i, j) in enumerate(tree[end].children)
clu = tree[j]
push!(clusters, HdbscanCluster(clu.points, clu.stability))
push!(counts, length(clu.points))
assignments[clu.points] .= i
end
# add the cluster of all unassigned (noise) points
noise_points = findall(==(0), assignments)
isempty(noise_points) || push!(clusters, HdbscanCluster(noise_points, -1))
return HdbscanResult(clusters, assignments, counts)
end

function hdbscan_graph(core_dists::AbstractVector, dists::AbstractMatrix)
n = size(dists, 1)
graph = HdbscanGraph(n)
for i in axes(dists, 2)
i_dists = view(dists, :, i)
i_core = core_dists[i]
for j in i+1:n
c = max(i_core, core_dists[j], i_dists[j])
# add reciprocal edges
push!(graph.adj_edges[i], (j, c))
push!(graph.adj_edges[j], (i, c))
end
end
return graph
end

function hdbscan_minspantree(graph::HdbscanGraph)
# put the edge to the heap (sorted from largest to smallest distances)
function heapput!(h, v::HdbscanMSTEdge)
idx = searchsortedlast(h, v, by=e -> e.dist, rev=true)
insert!(h, idx + 1, v)
end

# initialize the edges heap by putting all edges of the first node (the root)
heap = HdbscanMSTEdge[]
for (i, c) in graph.adj_edges[1]
heapput!(heap, (v1=1, v2=i, dist=c))
end
@assert issorted(heap, by=e -> e.dist, rev=true)

# build the tree
n = length(graph.adj_edges)
minspantree = Vector{HdbscanMSTEdge}()
inmst = falses(n) # if the graph node is in MST
inmst[1] = true # root

while length(minspantree) < n-1
# get the edge with the smallest distance from the heap
(i, j, c) = pop!(heap)

# add j-th node to MST if not there
inmst[j] && continue
push!(minspantree, (v1=i, v2=j, dist=c))
inmst[j] = true

# add adjacent edges of j to the heap
for (k, c) in graph.adj_edges[j]
inmst[k] || heapput!(heap, (v1=j, v2=k, dist=c))
end
end
return sort!(minspantree, by=e -> e.dist)
end

function hdbscan_build_tree(minspantree::AbstractVector{HdbscanMSTEdge}, min_size::Integer)
n = length(minspantree) + 1
cost = 0.0
uf = UnionFind(n)
clusters = [HdbscanNode(Int[i], λp=(min_size==1) ? Float64[Inf] : Float64[]) for i in 1:n]

for (i, (j, k, c)) in enumerate(minspantree)
cost += c
λ = 1 / cost
#child clusters
c1 = set_id(uf, j)
c2 = set_id(uf, k)
#reference to the parent cluster
clusters[c1].parent = clusters[c2].parent = n+i
#unite the clusters
unite!(uf, j, k)
points = items(uf, set_id(uf, j))
if length(points) < min_size
push!(clusters, HdbscanNode(points))
else
nc1, nc2 = isnoise(clusters[c1]), isnoise(clusters[c2])
if nc1 && nc2
push!(clusters, HdbscanNode(points, λp=fill(λ, length(points))))
elseif nc1 || nc2
#ensure that c1 is the cluster
if nc1 == true
c1, c2 = c2, c1
end
push!(clusters, HdbscanNode(points, λp=[clusters[c1].λp..., fill(λ, cluster_len(clusters[c2]))...]))
else
#compute stabilities for children
c1_stability = compute_stability!(clusters[c1], λ)
c2_stability = compute_stability!(clusters[c2], λ)
#unite the two clusters
push!(clusters, HdbscanNode(points, λp=fill(λ, length(points)), children=[c1, c2], children_stability=c1_stability+c2_stability))
end
end
end
compute_stability!(clusters[end], 1/cost)
@assert length(clusters) == 2n - 1
return clusters
end

function extract_clusters!(tree::Vector{HdbscanNode})
for i in eachindex(tree)
i == length(tree) && continue
c = tree[i]
isempty(c.children) && continue
parent = tree[c.parent]
children_stability = sum([tree[c.children[i]].stability for i in eachindex(c.children)])
if children_stability > c.stability
filter!(x->x!=i, parent.children)
append!(parent.children, c.children)
end
end
c = tree[end]
children_stability = sum([tree[c.children[i]].stability for i in eachindex(c.children)])
if children_stability <= c.stability
c.children = Int[2 * length(tree) - 1]
end
end
58 changes: 58 additions & 0 deletions src/unionfind.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Union-Find
# structure for managing disjoint sets
# This structure tracks which sets the elements of a set belong to,
# and allows us to efficiently determine whether two elements belong to the same set.
mutable struct UnionFind
parent::Vector{Int} # parent[root] is the negative of the size
label::Dict{Int, Int}
next_id::Int

function UnionFind(nodes::Int)
if nodes <= 0
throw(ArgumentError("invalid argument for nodes: $nodes"))
end

parent = -ones(nodes)
label = Dict([i=>i for i in 1:nodes])
new(parent, label, nodes)
end
end

# label of the set which element `x` belong to
set_id(uf::UnionFind, x::Int) = uf.label[root(uf, x)]
# all elements that have the specified label
items(uf::UnionFind, x::Int) = [k for (k, v) in pairs(uf.label) if v == x]

# root of element `x`
# The root is the representative element of the set
function root(uf::UnionFind, x::Int)
if uf.parent[x] < 0
return x
else
return uf.parent[x] = root(uf, uf.parent[x])
end
end

function setsize(uf::UnionFind, x::Int)
return -uf.parent[root(uf, x)]
end

function unite!(uf::UnionFind, x::Int, y::Int)
x = root(uf, x)
y = root(uf, y)
if x == y
return false
end
if uf.parent[x] > uf.parent[y]
x, y = y, x
end
# unite smaller tree(y) to bigger one(x)
uf.parent[x] += uf.parent[y]
uf.parent[y] = x
uf.next_id += 1
uf.label[y] = uf.next_id
for i in items(uf, set_id(uf, x))
uf.label[i] = uf.next_id
end
return true
end
Loading
Loading