-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can we apply the Gower metric to UMAP? #356
Comments
To make that work you would need to write a |
Thanks @lmcinnes! I just have a few questions: Firstly, does the distance metric compute the distance between all pairwise points in one call and return a matrix of distances or does it compute one distance between a pair of points and is repeatedly called to populate a distance matrix between all points in the dataset? The existing Python implementations I've seen for the Gower distance all output a matrix for the pairwise distances between all points, rather than an actual distance between one pair of points. However, for the jit compilation with numba, I believe your examples used njit (for nopython jit), which speeds up compilation. Would simply having jit (or even slower, jit with forceobject=True) also fit the bill? For some function Also, I would want to have a relatively simple, but highly generic Gower metric which can be easily used across multiple datasets (for some "out of the box" inference solution which can apply to arbitrary mixed datasets). In that case, do you think glossing over the customizing of the Gower metric would lead to acceptable results? If not, how would you propose a suitable learning / fitting preprocessing step for tuning the Gower metric on arbitrary mixed datasets? |
One more roadblock I discovered for myself was that when I tried to fit / transform UMAP on a mixed dataset, it failed due to this line: The issue for me was that So, even if the Gower metric were to be suitably defined, it seems that using fit / transform on mixed datasets just isn't supported currently? Could you please confirm if this is correct? |
With proper checks in place you could also change the line in a PR. See here for an example of adding a distance metric: lmcinnes/pynndescent#86 |
Thanks for the response, @sleighsoft So you're suggesting on creating a PR to enable passing in mixed datasets into the UMAP transformer (with the proper checks for validating such a mixed dataset is still valid for mixed data metrics)? Also, your example links back directly to this issue - did you intend to link to somewhere else instead? For now, I've actually opted to use FAMD for mixed data analysis instead, due to the above gotchas, although I continue to be interested in UMAP too. |
My mistake, I copied the wrong link. Updated it now. Every contribution is definitely welcome :) |
Thanks for the link! That seems like a great example of implementing a custom distance metric and I will likely refer to it if I end up adapting the Gower metric for UMAP. Interestingly enough though, I saw several references to the Gower metric for UMAP, but did not come across any njit implementation of the Gower metric for use with UMAP. |
Hi @simeng-yang, were you able to successfully implement Gower for UMAP? I'm interested in exploring the same thing, and I'd be very interested to see your implementation before starting from scratch. |
Hey @AdamSpannbauer, due to the above issues with UMAP not being directly suitable towards mixed datasets and having non-negligible runtime overhead compared to some simpler methods, I did not choose to make any further progress on this path. Notably, I ended up investigating FAMD - Factor Analysis of Mixed Data - instead, which is a union of linear techniques that can handle both numerical and categorical data. Perhaps you might be interested in taking a look there. However, if you do want to further explore the option of creating a custom implementation of the Gower metric for UMAP, you may wish to refer to these existing standalone Gower metric implementations and try to "refit" those implementations to work with UMAP. You would also have to develop the proper checks to handle mixed datasets with object columns. You can see here for an example of adding a distance metric: lmcinnes/pynndescent#86 (credit to @sleighsoft). I think this would still be a worthwhile endeavor. Mixed datasets are very prevalent in a wide variety of data analysis situations. |
@simeng-yang@lmcinnes |
I'm not sure if this is related, but I am trying this
and I get this error:
Does this mean it doesn't recognize the "precomputed" option or that it doesn't recognize what's in the distance matrix? |
From my rough work, if we let the custom metric be the Gower metric, the distance matrix for all points in the dataset can be computed for both numerical and categorical data. However, it seems this is simplest when we only use the Gower metric for precomputing the distance matrix for the entire dataset, i.e. with
umap.UMAP(metric="precomputed").fit_transform(precomputed_distances)
While it is possible to compute the distance matrix for a dataset beforehand, using metric="precomputed" is inappropriate towards a further transform on the embedding for new data, which is needed for inference, since it doesn't allow for a .transform on the embedding for new data.
I think what I would want is to have a metric which can be plugged into umap.UMAP() such that this metric can handle both numerical and categorical features.
From the examples in the doc, it seems the metric is used for computing the distances between each pair of points separately (i.e. such a metric returns distance(point1, point2)),
I'm wondering how one could use the Gower distance metric for both fitting against training data and transforming on test data?
Or is transform for mixed datasets currently still unsupported despite the above?
This is important for me since I'm trying to use UMAP for dimensionality reduction on complex mixed datasets for inference/classification.
The text was updated successfully, but these errors were encountered: