Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running HDBSCAN on data with duplicates #70

Open
azizkayumov opened this issue Jul 19, 2024 · 0 comments
Open

Running HDBSCAN on data with duplicates #70

azizkayumov opened this issue Jul 19, 2024 · 0 comments

Comments

@azizkayumov
Copy link
Contributor

azizkayumov commented Jul 19, 2024

HDBSCAN's flat clustering results rely on cluster stability calculation which is prone to output erroneous clustering partitions if there are duplicate data objects, which means that some data objects would have zero core distances => making the mutual reachability distances between such duplicates also zeros.

This current implementation handles this case as follows to avoid the division by zero error:

        let info = mst[node - n];
        let lambda = if info.2 > A::zero() {
            A::one() / info.2
        } else {
            A::max_value()
        };

Python HDBSCAN uses infinity instead:

        children = hierarchy[node - num_points]
        left = <np.intp_t> children[0]
        right = <np.intp_t> children[1]
        if children[2] > 0.0:
            lambda_value = 1.0 / children[2]
        else:
            lambda_value = INFTY

When the lambda values are set to infinity, the extraction of flat clusters from the cluster hierarchy are meaningless: the clusters with infinite lambdas are always selected during the cluster stability comparison in the hierarchy.

Basically, this behavior depends on the parameter selection for minPts, as long as any core distance is not zero, flat clustering results should not be affected.

Do you think it would be a good idea to warn the users about this behavior? The original Java implementation of HDBSCAN leaves a warning message, advising that the user should increase his minPts. Somehow the Python version is silent about this (so is this Rust version), which may leave the users believe in the flat clustering results and wrongly confuse them to search for other alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant