categorical features handled "correctly"? #92

aprive · 2018-12-14T18:41:17Z

Does this package "correctly" handle categorical variables (e.g. without conversion to numerical encoding schemes like one-hot or ordinal encoding), as that ability is a distinct advantage of decision trees and their progeny? Issues #61 and #13 are related but it is not clear to me what the current status is. Perhaps if they are supported, I could make a documentation PR for a brief mention on the README.

If so, it would be a good reason for some users to switch from scikit-learn's RF implementation, which still requires numerical encoding.

tlienart · 2019-11-18T13:41:58Z

Bump, does DecisionTree actually handle categorical variables without OHE? when trying to feed input with categorical columns in MLJ an error is thrown with the isless operation not being defined so it seems that the answer is no but the README seems to say otherwise.

Could this be clarified and/or an example shown?

xref: JuliaAI/MLJModels.jl#134

bensadeghi · 2019-11-19T02:26:11Z

Works fine on the DT.jl side. The issue might be with MLJ.jl, potentially need to overload isless().

using Random, DecisionTree

features, labels = load_data("adult")
# Note that the data here is of type Any
# I would cast them to string.() for better build performance
typeof(features)

Random.seed!(1);
# native API
t1 = build_tree(labels, features)

Random.seed!(1);
# SKL API
t2 = fit!(DecisionTreeClassifier(), features, labels)

ablaom · 2019-11-19T05:01:30Z

Yes, but if DecisionTree.jl is using the fact that strings are ordered (they have the lexographical order) as @tlienart suggests, then DecisionTree is presumably not applying the standard splitting criterion for unordered factors (that is, any subset of all classes is a candidate, not just subsets split using the order). If so, this ought to be made clearer in the documentation.

Funny, when I reviewed DecisionTree code a long while back, I seem to remember the algorithm for unordered factors being there. Can you clarify @bensadeghi

bensadeghi · 2019-11-19T05:12:55Z

@ablaom Yes, lexicographical order is used for the splitting criteria, where subsets of the features are sorted before being searched through for the best split (via information gain).

I'm not sure how to word this in the docs (readme)

ablaom · 2019-11-19T22:05:57Z

@bensadeghi Thanks for that quick response.

In this case, then, the claim that the package provides "support for mixed categorical and numerical data" is indeed misleading. You only support ordered data (encoded as Reals or Strings, I guess). Although the user is free to pretend his unordered factor is ordered, the general understanding is that the CART algorithm treats ordered and unordered factors differently.

I suggest, one replaces "support for mixed categorical and numerical data" in the readme with:

"support for ordered features, which can be encoded as Reals or Strings"; or
"support for a mixture of numerical and ordered factor data", with an explanation that ordered factors can be encoded as strings (ordered lexicographically) or integers.

bensadeghi · 2019-11-20T06:27:30Z

Thanks @ablaom. I've updated the readme with your input.

ablaom · 2022-02-13T21:05:05Z

@bensadeghi Thanks for that. However, it seems the feature requirements stated for classifiers and regressor are now different:

For classification we have "support for ordered features (encoded as Reals or Strings)" (which I think is correct).

For regressors we have "support for numerical features", which I think should be the same as for classification.

I think "features" is commonly interpreted as "inputs" - perhaps there is some confusion with the target?

tlienart mentioned this issue Nov 18, 2019

DecisionTree doesn't handle multiclass input? JuliaAI/MLJModels.jl#134

Closed

ablaom mentioned this issue Feb 26, 2021

Need help in creating a MLJModelInterface.Model interface of a complex model JuliaAI/MLJ.jl#744

Closed

tlienart mentioned this issue Dec 8, 2021

why do input scitypes not include Multiclass? JuliaAI/MLJDecisionTreeInterface.jl#10

Closed

ablaom mentioned this issue Feb 13, 2022

Update doc-strings to meet new standard JuliaAI/MLJDecisionTreeInterface.jl#12

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

categorical features handled "correctly"? #92

categorical features handled "correctly"? #92

aprive commented Dec 14, 2018 •

edited

Loading

tlienart commented Nov 18, 2019 •

edited

Loading

bensadeghi commented Nov 19, 2019

ablaom commented Nov 19, 2019

bensadeghi commented Nov 19, 2019

ablaom commented Nov 19, 2019

bensadeghi commented Nov 20, 2019

ablaom commented Feb 13, 2022

categorical features handled "correctly"? #92

categorical features handled "correctly"? #92

Comments

aprive commented Dec 14, 2018 • edited Loading

tlienart commented Nov 18, 2019 • edited Loading

bensadeghi commented Nov 19, 2019

ablaom commented Nov 19, 2019

bensadeghi commented Nov 19, 2019

ablaom commented Nov 19, 2019

bensadeghi commented Nov 20, 2019

ablaom commented Feb 13, 2022

aprive commented Dec 14, 2018 •

edited

Loading

tlienart commented Nov 18, 2019 •

edited

Loading