-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
categorical features handled "correctly"? #92
Comments
Bump, does DecisionTree actually handle categorical variables without OHE? when trying to feed input with categorical columns in MLJ an error is thrown with the Could this be clarified and/or an example shown? xref: JuliaAI/MLJModels.jl#134 |
Works fine on the DT.jl side. The issue might be with MLJ.jl, potentially need to overload isless(). using Random, DecisionTree
features, labels = load_data("adult")
# Note that the data here is of type Any
# I would cast them to string.() for better build performance
typeof(features)
Random.seed!(1);
# native API
t1 = build_tree(labels, features)
Random.seed!(1);
# SKL API
t2 = fit!(DecisionTreeClassifier(), features, labels) |
Yes, but if DecisionTree.jl is using the fact that strings are ordered (they have the lexographical order) as @tlienart suggests, then DecisionTree is presumably not applying the standard splitting criterion for unordered factors (that is, any subset of all classes is a candidate, not just subsets split using the order). If so, this ought to be made clearer in the documentation. Funny, when I reviewed DecisionTree code a long while back, I seem to remember the algorithm for unordered factors being there. Can you clarify @bensadeghi |
@ablaom Yes, lexicographical order is used for the splitting criteria, where subsets of the features are sorted before being searched through for the best split (via information gain). I'm not sure how to word this in the docs (readme) |
@bensadeghi Thanks for that quick response. In this case, then, the claim that the package provides "support for mixed categorical and numerical data" is indeed misleading. You only support ordered data (encoded as I suggest, one replaces "support for mixed categorical and numerical data" in the readme with:
|
Thanks @ablaom. I've updated the readme with your input. |
@bensadeghi Thanks for that. However, it seems the feature requirements stated for classifiers and regressor are now different: For classification we have "support for ordered features (encoded as Reals or Strings)" (which I think is correct). For regressors we have "support for numerical features", which I think should be the same as for classification. I think "features" is commonly interpreted as "inputs" - perhaps there is some confusion with the target? |
Does this package "correctly" handle categorical variables (e.g. without conversion to numerical encoding schemes like one-hot or ordinal encoding), as that ability is a distinct advantage of decision trees and their progeny? Issues #61 and #13 are related but it is not clear to me what the current status is. Perhaps if they are supported, I could make a documentation PR for a brief mention on the README.
If so, it would be a good reason for some users to switch from scikit-learn's RF implementation, which still requires numerical encoding.
The text was updated successfully, but these errors were encountered: