Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical features handled "correctly"? #92

Open
aprive opened this issue Dec 14, 2018 · 7 comments
Open

categorical features handled "correctly"? #92

aprive opened this issue Dec 14, 2018 · 7 comments

Comments

@aprive
Copy link

aprive commented Dec 14, 2018

Does this package "correctly" handle categorical variables (e.g. without conversion to numerical encoding schemes like one-hot or ordinal encoding), as that ability is a distinct advantage of decision trees and their progeny? Issues #61 and #13 are related but it is not clear to me what the current status is. Perhaps if they are supported, I could make a documentation PR for a brief mention on the README.

If so, it would be a good reason for some users to switch from scikit-learn's RF implementation, which still requires numerical encoding.

@tlienart
Copy link

tlienart commented Nov 18, 2019

Bump, does DecisionTree actually handle categorical variables without OHE? when trying to feed input with categorical columns in MLJ an error is thrown with the isless operation not being defined so it seems that the answer is no but the README seems to say otherwise.

Could this be clarified and/or an example shown?

xref: JuliaAI/MLJModels.jl#134

@bensadeghi
Copy link
Member

Works fine on the DT.jl side. The issue might be with MLJ.jl, potentially need to overload isless().

using Random, DecisionTree

features, labels = load_data("adult")
# Note that the data here is of type Any
# I would cast them to string.() for better build performance
typeof(features)

Random.seed!(1);
# native API
t1 = build_tree(labels, features)

Random.seed!(1);
# SKL API
t2 = fit!(DecisionTreeClassifier(), features, labels)

@ablaom
Copy link
Member

ablaom commented Nov 19, 2019

Yes, but if DecisionTree.jl is using the fact that strings are ordered (they have the lexographical order) as @tlienart suggests, then DecisionTree is presumably not applying the standard splitting criterion for unordered factors (that is, any subset of all classes is a candidate, not just subsets split using the order). If so, this ought to be made clearer in the documentation.

Funny, when I reviewed DecisionTree code a long while back, I seem to remember the algorithm for unordered factors being there. Can you clarify @bensadeghi

@bensadeghi
Copy link
Member

@ablaom Yes, lexicographical order is used for the splitting criteria, where subsets of the features are sorted before being searched through for the best split (via information gain).

I'm not sure how to word this in the docs (readme)

@ablaom
Copy link
Member

ablaom commented Nov 19, 2019

@bensadeghi Thanks for that quick response.

In this case, then, the claim that the package provides "support for mixed categorical and numerical data" is indeed misleading. You only support ordered data (encoded as Reals or Strings, I guess). Although the user is free to pretend his unordered factor is ordered, the general understanding is that the CART algorithm treats ordered and unordered factors differently.

I suggest, one replaces "support for mixed categorical and numerical data" in the readme with:

  • "support for ordered features, which can be encoded as Reals or Strings"; or
  • "support for a mixture of numerical and ordered factor data", with an explanation that ordered factors can be encoded as strings (ordered lexicographically) or integers.

@bensadeghi
Copy link
Member

Thanks @ablaom. I've updated the readme with your input.

@ablaom
Copy link
Member

ablaom commented Feb 13, 2022

@bensadeghi Thanks for that. However, it seems the feature requirements stated for classifiers and regressor are now different:

For classification we have "support for ordered features (encoded as Reals or Strings)" (which I think is correct).

For regressors we have "support for numerical features", which I think should be the same as for classification.

I think "features" is commonly interpreted as "inputs" - perhaps there is some confusion with the target?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants