Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDBSCAN not available #266

Open
lakonis opened this issue Mar 22, 2023 · 6 comments
Open

HDBSCAN not available #266

lakonis opened this issue Mar 22, 2023 · 6 comments

Comments

@lakonis
Copy link

lakonis commented Mar 22, 2023

Hello, I have the following packages running python 3.7.16:

tensorflow                     2.5.0
numpy                          1.19.5
hdbscan                        0.8.24
pixplot                        0.0.113

yet, pixplot gives me the following error when accessing my dataset and metadata csv:

2023-03-22 17:53:45.147839: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-22 17:53:45.147862: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-22 17:53:46.145373: HDBSCAN not available; using sklearn KMeans
2023-03-22 17:53:49.159517: CUML not available; using umap-learn UMAP
2023-03-22 17:53:49.159901: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-22 17:53:49.161109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-22 17:53:49.161125: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-22 17:53:49.161142: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (nicolas-hpeb830g8): /proc/driver/nvidia/version does not exist
2023-03-22 17:53:49.161469: I tensorflow/core/common_runtime/direct_session.cc:361] Device mapping: no known devices.

I don't understand the errors neither why HDBSCAN is not available

Thanks for your help!

@pleonard212
Copy link
Owner

Interesting -- if you start Python and try:

import hdbscan

...do you get no response (which is good!) or an error?

@lakonis
Copy link
Author

lakonis commented Mar 23, 2023

Error indeed :

> python                                                                                                
Python 3.7.16 (default, Mar 22 2023, 16:00:53) 
[GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hdbscan
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages/hdbscan/__init__.py", line 1, in <module>
    from .hdbscan_ import HDBSCAN, hdbscan
  File "/home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 21, in <module>
    from ._hdbscan_linkage import (single_linkage,
  File "hdbscan/_hdbscan_linkage.pyx", line 1, in init hdbscan._hdbscan_linkage
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

So it has something to do with numpy.

I did try to install different versions of numpy and hdbscan corresponding to pixplot last release (2020). And during those tests I noticed this error:

> pip install hdbscan==0.8.29                                                                                                       
Collecting hdbscan==0.8.29
  Using cached hdbscan-0.8.29-cp37-cp37m-linux_x86_64.whl
Collecting numpy>=1.20
  Using cached numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Requirement already satisfied: scikit-learn>=0.20 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (0.24.2)
Requirement already satisfied: scipy>=1.0 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (1.4.0)
Requirement already satisfied: cython>=0.27 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (0.29.33)
Requirement already satisfied: joblib>=1.0 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from scikit-learn>=0.20->hdbscan==0.8.29) (3.1.0)
Installing collected packages: numpy, hdbscan
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
  Attempting uninstall: hdbscan
    Found existing installation: hdbscan 0.8.26
    Uninstalling hdbscan-0.8.26:
      Successfully uninstalled hdbscan-0.8.26
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.21.6 which is incompatible.
pixplot 0.0.113 requires numpy==1.19.5, but you have numpy 1.21.6 which is incompatible.
Successfully installed hdbscan-0.8.29 numpy-1.21.6
WARNING: You are using pip version 22.0.4; however, version 23.0.1 is available.
You should consider upgrading via the '/home/nicolas/.pyenv/versions/3.7.16/bin/python3.7 -m pip install --upgrade pip' command.

pixplot has worked (with "hdbscan not available") with config numpy==1.19.5 and hdbscan=0.8.24-0.8.29

@lakonis
Copy link
Author

lakonis commented Mar 23, 2023

I believe it has something to do with tensorflow, cuda, libcudart.so.11.0, etc. I am not sure I want to go that deep since I am using pixplot for ~1000 images dataset and an Intel GPU, which involves more heavy installations..

However, it seems that hdbscan takes into account the label/category column into the clustering, which is particularly interesting in my case. I believe the sklearn KMeans does not, is that correct ?

Am I missing something else without CUML ?

HDBSCAN not available; using sklearn KMeans
CUML not available; using umap-learn UMAP

Thank you !

@pleonard212
Copy link
Owner

CUML is just a library that contains an accelerated implementation of UMAP; no worries there. You're correct that there are some real annoyances around numba and numpy; not sure if you're on Linux or not but there's some notes on the very end of this wiki page that might help:

https://github.com/YaleDHLab/pix-plot/wiki/Ubuntu-20-&-22-with-GPU

@lakonis
Copy link
Author

lakonis commented Mar 23, 2023

I am on Linux Manjaro, but I have a GPU Intel. Therefore, I am trying this, installing intel-extension-for-tensorflow 1.1.0, but it upgrades everything and breaks pixplot requirements.

Again, GPU or speed is not crucial to me. It's rather hdbscan that could improve my clustering from what I understand. But maybe I am mistaking ?

@nabsiddiqui
Copy link

I was able to get it to work with the following:

pip install hdbscan==0.8.31
pip install 'urllib3<2.0'
pip install https://github.com/yaledhlab/pix-plot/archive/master.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants