Add HDBSCAN from HorseML.jl #273

MommaWatasu · 2024-03-22T16:35:12Z

Add HDBSCAN

I found this issue. I wanted to my code to be useful, so compiled it into hdbscan.jl so it works as is.
I changed some code from my original to get closer to the code of this repo.

Abstract of functions and structures

HDBSCANGraph: is used to build a minimum-spanning-tree
HDBSCANCluster: is used to build cluster based on minimum-spanning-tree
HDBSCANResult: is used to return the result
hdbscan!: main function that performs hdbscan

As I wrote in comment, many utility functions are just converted from numpy by myself, so I don't know many about them.

Usage

This is the usage of main function.

hdbscan!(points::AbstractMatrix, k::Int64, min_cluster_size::INt64; gen_mst::Bool=true, mst=nothing)

Parameters

points: the d×n matrix, where each column is a d-dimensional coordinate of a point
k: we will define "core distance of point A" as the distance between point A and the k th neighbor point of point A.
min_cluster_size: minimum number of points in the cluster
gen_mst: whether to generate minimum-spannig-tree or not
mst: when is specified and gen_mst is false, new mst won't be generated

Example

I checked that this following code is available:

# include hdbscan.jl before run this code
using CSV
using DataFrames
using Plots
data = CSV.read("/home/watasu/Documents/code/HorseML.jl/test/data/clustering2.csv", DataFrame) |> Matrix
result = hdbscan!(data, 5, 3)
plot(title = "Clustering by HDBSCAN")
result = result.labels
for i in -1 : maximum(result)
    X = data[findall(result.==i), :]
    plot!(X[:, 1], X[:, 2], st=:scatter)
end
plot!()

make PR

codecov-commenter · 2024-03-22T16:40:07Z

Codecov Report

Attention: Patch coverage is 97.22222% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 95.56%. Comparing base (b4df21a) to head (dc5dd40).
Report is 9 commits behind head on master.

Files	Patch %	Lines
src/hdbscan.jl	97.39%	3 Missing ⚠️
src/unionfind.jl	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #273      +/-   ##
==========================================
+ Coverage   95.40%   95.56%   +0.15%     
==========================================
  Files          20       22       +2     
  Lines        1503     1647     +144     
==========================================
+ Hits         1434     1574     +140     
- Misses         69       73       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alyst · 2024-03-22T17:08:54Z

@MommaWatasu Thanks for the PR! I think it would be a useful addition to the package. I will try to review it soon. Meanwhile -- it looks like there are no unit tests for the method. Could you please add some?

there was no test for it

test failed on a ubuntu-latest-x86

tests still fails on ubuntu-latest-x86

MommaWatasu · 2024-03-23T00:48:38Z

I added some test for hdbscan. But I'm sorry for not being able to write unit test for utility functions.
If you want to write tests for them, you should check scipy for their specification.
They are coming from:

erf: scipy.special.erf
logpdf: I think the original code is this

alyst · 2024-03-23T10:13:31Z

@MommaWatasu Thanks for adding unit tests. Generally we don't test internal utility functions, we only test public API, but aim to cover both meaningful examples (small datasets with nontrivial clusterings) and corner cases (e.g. single data point).
As for some utility functions - erf/erfc are available in Distributions.jl, which Clusterings.jl already depends upon, so you should rather use these ones. pdf/logpdf are also declared there.

there were functions for Xmeans

deleted some utility functions

Documentation workflow failed due to the docs for it

I forgot to export it

MommaWatasu · 2024-03-23T14:46:48Z

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test. And also, I added simple docs since Documentation workflow failed without it.

alyst · 2024-03-23T19:58:07Z

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test.

Great, erf is provided by SpecialFunctions.jl, but as we want to be conservative on the number of package dependencies, it is convenient that we don't need it.

alyst

Thanks again for the PR!
The first iteration looks good -- we don't need much refactoring, mostly some method renames and object field tweaks for clarity.

Of a bigger things:

check the dimensions of the points matrix
use Distances.jl API to calculate the point distances (see the other methods, e.g. clustering_quality() for how we do it); it is worth precalculating all pairwise distances and passing it to the methods that calculate core distances and build the graph. We can also allow the user to specify the metric as a kwarg to hdbscan, which would be a useful generalization.
add more tests to check that the point assignments to the clusters are correct

docs/source/hdbscan.md

src/hdbscan.jl

test/hdbscan.jl

there is no need to create new file

add comment and alias

hdbscan.md remains

the progress is temporary

add comment and improve performance, etc.

changes sugges alyst

error occured with Julia1.10

forgot to remove debugging code

add description for detail

fix links

add space

make isnoise available for user

MommaWatasu · 2024-04-27T15:00:05Z

@alyst
I applied all your suggestions. Is there anything else I have to fix?

src/hdbscan.jl

we don't really need a function, this is one line operation

for consistency

and remove Base.getindex()

add_edges!() is used only once

alyst · 2024-04-27T21:53:38Z

I applied all your suggestions. Is there anything else I have to fix?

@MommaWatasu Thank you for adjusting the PR! We are getting close to be able to merge, but we would need one more iteration.

Note that I have pushed some adjustments to the code directly to your branch, so please make sure to pull them first.

TODO items:

at the moment the tests that you have do not really test whether the clusters are correct. In fact, at the moment all the points are assigned to the noise cluster (I think it was also the case before my changes).
Please add the test(s) that the assignments/clusters are correct. Ideally, we need to test a more complex clustering (more than 2 clusters), also would be nice to test that changing ncore, or min_cluster_size affects the result
Now that I understand the algorithm a bit more, it looks like HdbscanCluster is an internal structure, and it is rather a HdbscanTree node than the cluster you return to the user. It contains the fields like stability, children etc: some of them we should not expose to the user, the others you don't really set when you are preparing the resulting clusters. I suggest that you rename this structure into HdbscanNode (this is a non-exported structure for the algorithm), and create a new one, HdbscanCluster (the exported one returned to the user). If there are any properties of the cluster, such as stability or being noise -- please make sure to add these relevant fields to the HdbscanCluster and properly initialize them when you are generating the result.
HdbscanResult should inherit from ClusteringResult and support its API (in particular, counts is missing)
move UnionFind to a separate source file unionfind.jl. I'm not 100% sure it belongs to Clustering.jl. We may potentially use the DataStructures.jl, but this code looks rather compact, so I think I prefer keeping it over depending on another big package.
cleanup UnionFind terminology. I think set_id/issameset/items would be an optimal choice.
add the unit tests for UnionFind, e.g. that finding root, issameset, unite! work as expected
add the newlines in the docstrings that separate declaration from the description
add newlines that separate struct fields from the inner constructor

replace eachcol with alternatives

add newline

cleanup UnionFind terminology and move it to the other file

there wes no test for it

rename HdbscanCluster into HdbscanNode and create new one to expose to the user

the algorithm went wrong

ensure that `min_size` effects properly

add counts field into HdbscanResult

MommaWatasu · 2024-05-02T02:20:55Z

@alyst I have one thing to apology. I found that the reason why the clustering result went wrong was my serious mistake about the algorithm. I fixed the algorithm and checked the result is correct. In addition, all the TODO items have been done(but I couldn't add the unit test about ncore because I don't know how it effects to the result).

I would appreciate it if you could check for any performance issues regarding the fixed algorithm.

sztal · 2024-08-30T22:21:59Z

Hi, thanks for the effort of bringing HDBSCAN to Clustering.jl! I wonder what is the current status of this issue?

[add function or file]add hdbscan

23bbed1

make PR

MommaWatasu added 3 commits March 23, 2024 09:09

[test]add test for hdbscan

fa44398

there was no test for it

[fix]change Int64 to Int

4e64fdf

test failed on a ubuntu-latest-x86

[fix]change all Int64 into Int

e294565

tests still fails on ubuntu-latest-x86

MommaWatasu added 4 commits March 23, 2024 22:22

[change]change usage and remove extra code

b851997

there were functions for Xmeans

[test]update test

7822a7c

deleted some utility functions

[docs]add docs for HDBSCAN

8901cfe

Documentation workflow failed due to the docs for it

[fix]export HdbscanCluster

6de5d02

I forgot to export it

alyst requested changes Apr 13, 2024

View reviewed changes

MommaWatasu added 12 commits April 15, 2024 08:54

[docs]merge docs of HDBSCAN with DBSCAN.md

85c5644

there is no need to create new file

[clean]refactoring HDBSCANGraph

edcc70a

add comment and alias

[docs]fix docs

939ce65

hdbscan.md remains

[clean]refactoring

2f67d07

the progress is temporary

[clean]refactoring

61463f2

add comment and improve performance, etc.

[test]update test

09ed174

changes sugges alyst

[fix]change isnothing into ===

6039c0c

error occured with Julia1.10

[fix]remove println

3cc7689

forgot to remove debugging code

[docs]update docs

1798148

add description for detail

[docs]fix docs

a0a819e

fix links

[fix]fix docstring

bf38eb6

add space

[fix]add isnoise to the list of exprted function

380acf1

make isnoise available for user

alyst reviewed Apr 27, 2024

View reviewed changes

src/hdbscan.jl Outdated Show resolved Hide resolved

alyst reviewed Apr 27, 2024

View reviewed changes

src/hdbscan.jl Outdated Show resolved Hide resolved

alyst reviewed Apr 27, 2024

View reviewed changes

src/hdbscan.jl Outdated Show resolved Hide resolved

alyst reviewed Apr 27, 2024

View reviewed changes

src/hdbscan.jl Outdated Show resolved Hide resolved

alyst added 13 commits April 27, 2024 13:41

remove heappush!: unused

64a202a

hdbscan test: small tweaks

661edbc

fixup hdbscan assignments

082c5e6

hdbscan: further opt core_dists

c9db368

we don't really need a function, this is one line operation

hdbscan: optimize edges generation

c205575

HDBSCANGraph -> HdbscanGraph

bdf1aec

for consistency

HdbscanEdge

9aa9841

HdbscanGraph: rename edges to adj_edges

c9b9374

and remove Base.getindex()

MSTEdge remove no-op expand() method

092ac40

refactor HdbscanMSTEdge

c972caa

hdbscan: fix graph vertices, remove add_edges!

57b05e6

add_edges!() is used only once

hdbscan_minspantree(): refactor

f616795

prune_clusters!(): cleanup

0f6e992

alyst and others added 12 commits April 27, 2024 15:08

hdbscan: fix 1.0 compat

e01609b

replace eachcol with alternatives

[docs]fix docstring

de6b83a

add newline

[clean]rename and refactoring

38212ca

cleanup UnionFind terminology and move it to the other file

[test]add test for unionfind

159858d

there wes no test for it

hdbscan_minspantree: fix edges sorting

a03c224

hdbscan_clusters(): fix cost type

3a577ea

hdbscan_clusters: improve MST iteration

744af22

Merge branch 'master' of https://github.com/MommaWatasu/Clustering.jl

8803573

[clean]rename the result structure

b94de25

rename HdbscanCluster into HdbscanNode and create new one to expose to the user

[hotfix]apply hot fix

7078223

the algorithm went wrong

[test]add test about min_size

b0791d9

ensure that `min_size` effects properly

[add function or file]support ClusteringResult

dc5dd40

add counts field into HdbscanResult

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HDBSCAN from HorseML.jl #273

Add HDBSCAN from HorseML.jl #273

MommaWatasu commented Mar 22, 2024

codecov-commenter commented Mar 22, 2024 •

edited

Loading

alyst commented Mar 22, 2024

MommaWatasu commented Mar 23, 2024

alyst commented Mar 23, 2024

MommaWatasu commented Mar 23, 2024

alyst commented Mar 23, 2024

alyst left a comment

MommaWatasu commented Apr 27, 2024

alyst commented Apr 27, 2024 •

edited

Loading

MommaWatasu commented May 2, 2024

sztal commented Aug 30, 2024

Add HDBSCAN from HorseML.jl #273

Are you sure you want to change the base?

Add HDBSCAN from HorseML.jl #273

Conversation

MommaWatasu commented Mar 22, 2024

Add HDBSCAN

Abstract of functions and structures

Usage

Parameters

Example

codecov-commenter commented Mar 22, 2024 • edited Loading

Codecov Report

alyst commented Mar 22, 2024

MommaWatasu commented Mar 23, 2024

alyst commented Mar 23, 2024

MommaWatasu commented Mar 23, 2024

alyst commented Mar 23, 2024

alyst left a comment

Choose a reason for hiding this comment

MommaWatasu commented Apr 27, 2024

alyst commented Apr 27, 2024 • edited Loading

MommaWatasu commented May 2, 2024

sztal commented Aug 30, 2024

codecov-commenter commented Mar 22, 2024 •

edited

Loading

alyst commented Apr 27, 2024 •

edited

Loading