-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Use libcudf Dictionary type for CategoricalColumn in Python #8573
Comments
This issue has been labeled |
This desire came up again recently in relation to #14138, where it is noted that we implement a lot of "heavyweight" algorithms as a sequence of calls in Python, rather than pushing down into libcudf. @isVoid's implementation work in #8567 stalled due to some differences in the way libcudf and pandas (and hence cudf) choose to model dictionary-encoded columns. In libcudf, the keys of the dictionary are required to be sorted, and the encoding looks up the value by indexing into the keys array. This restricts dictionary encoding to keys that admit a total order, and (I think) doesn't have a hook for a user-provided comparator. In pandas, categoricals (dictionary encoded columns) come in two flavours
The latter do not require that the keys admit a total order (or indeed a partial one), and can be applied even in the case where the key type does have a "natural" ordering, e.g.:
Ordered categoricals either use the natural ordering induced by the key type (this matches libcudf), or allow for a user-defined ordering. This enables the user to impose a total order on naturally unordered key types (for example floats), and/or provide one that is different from the natural order:
AIUI, it was interfacing these differences that caused too many hacks/workarounds on the python side. In light of this, we should consider if the libcudf side would need some extensions to support cudf's use case of dictionary encoding. Or if there is a smart way of managing things in a translation layer that doesn't require huge amounts of special-casing. |
Another reason Michael's work stalled is that due to the fact that it's not directly mapping to a libcudf type categorical data in cudf is special-cased all over the place and therefore requires a large amount of work to track. We were hoping that it would be simpler to work on that after we had refactored cudf internals to a place where the categorical logic was better isolated to just the categorical column, or at least more contained in some other way. I'm not opposed to revisiting the work now, but just an FYI that I'd hope this would become substantially easier after we restructure cudf internals around pylibcudf over the next couple of releases. |
cuDF Python would like to back the CategoricalColumn with the Dictionary type. Work has been initiated toward this goal in #8567
The text was updated successfully, but these errors were encountered: