Decision Tree

An implementation of the simplified C4.5 algorithm.

Use milksets package for loading UCI datasets.

Installation

git clone https://github.com/usualwitch/decision-tree.git
cd decision-tree
pip install .

Usage

from milksets import seeds
from dt import DecisionTree
X, y = seeds.load()
tr = DecisionTree(X, y, prune_method='Comp', loss='entropy')
tr.postprune()

Algorithms

For details of the tree building and pruning algorithms, see this post.

The general cost-complexity pruning method we employ is to first build a tree without pruning, and calculate alphas from this tree. Then we use cross validation to select a best alpha to prune the tree.

Experiments

Comparison of C4.5 and CART

Both algorithms use cost-complexity pruning strategy. We applied a 5-fold stratified cross-validation to find the best alpha. The results are listed in the following table.

name	tree	alpha	acc
german	C4.5	0.0	0.700
	CART	0.0	0.675
	CART	0.5	0.704
iris	C4.5	0.0	0.430
iris	CART	0.0	0.933
page_blocks	C4.5	0.0	0.721
page_blocks	CART	0.0	0.963
seeds	C4.5	0.0	0.774
seeds	CART	0.0	0.936
wine	C4.5	0.0	0.636
wine	CART	0.0	0.936

In general CART works better than C4.5. This simplified version of C4.5’s test error is quite unstable with regard to the random choice of train and test sets, which should be resolved to make it applicable.

Different prune methods and loss functions

The experiment result is displayed in the picture below.

Prune methods correspond to the three rows, including

Reduce error pruning (error can be substituted by another loss function)
Pessimistic pruning
Cost-complexity pruning

Loss functions correspond to the three columns, including

empirical entropy
gini impurity
0-1 loss

From the figure,

Reduce error pruning and pessimistic pruning is insensitive to loss functions.
In C4.5, where we use entropy to build the tree, cost-complexity pruning works pretty bad with 0-1 loss. Using entropy as loss in postpruning is a more coherent choice.

The average accuracy of the prune methods and losses are listed in the table below.

prune method	loss	acc
cost-complexity	0-1	0.517
	entropy	0.671
	gini	0.601
pessimistic	0-1	0.489
	entropy	0.469
	gini	0.489
reduce error	0-1	0.654
	entropy	0.617
	gini	0.620

The best choice among these is the cost-complexity pruning with entropy loss.

Reference

J. Ross Quinlan. C4.5: programs for machine learning (1993). https://dl.acm.org/doi/book/10.5555/152181
Mingers, J. An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine Learning, 4, 227–243 (1989). https://doi.org/10.1023/A:1022604100933
Hastie, T.; Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning, Springer New York Inc. , New York, NY, USA (2001).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
dt		dt
imgs		imgs
reference		reference
test		test
.gitignore		.gitignore
README.md		README.md
data_analysis.ipynb		data_analysis.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision Tree

Installation

Usage

Algorithms

Experiments

Comparison of C4.5 and CART

Different prune methods and loss functions

Reference

License

About

Releases

Packages

Languages

usualwitch/decision-tree

Folders and files

Latest commit

History

Repository files navigation

Decision Tree

Installation

Usage

Algorithms

Experiments

Comparison of C4.5 and CART

Different prune methods and loss functions

Reference

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages