- Each cell in the decision tree is called a node.
- The top most node is called the root node.
- The all other nodes, excluding the bottom most nodes are called decision nodes.
- The bottom most nodes are called leaf nodes.
Based on the count of features and possible value of each feature, there are several decision trees for each application. The job of a decision tree algorithm is to pick the one that does the best on the training set and also generalizes well to new data.
First off, we need to pick the root node feature.
Then we should look at the left node and decide what feature to use next and so on till we pick the left most branch features.
As for the left, we continue the process for the right branch and select the features for them as well.
- How to choose what feature to split at each node? Select by the maximum purity(purity of a feature is the precision of that feature)
- When do you stop splitting?
- When a node is 100% one class.
- When splitting a node will result in the tree exceeding a maximum depth(depth is the number of hops that takes to reach the selected node from root node)
- Keeping the maximum depth small prevents the tree to become too big.
- Keeping maximum depth small makes it less prone to overfitting.
- When improvements in purity score are below a threshold.
- When the number of examples in a node is below a threshold.
We use entropy function to measure the impurity.
When
If
If
If
We take
If
note) There's also another algorithm called Gini that is used in decision trees, but we wont need it in this course.
We'll decide what feature to split on at each node based on what choice of feature increases the purity the most.
The reduction of entropy is called information gain.
We get different entropies based on splitting by different features at each node. To find the best value, we take the sum of all the weighted averages of the entropies of each sub-node and subtract the result from the entropy of the current node to determine the improvement in purity.
-
$E(S)$ : The current entropy on our node$S$ , before any split -
$|S|$ : The size or the number of instances in$S$ -
$A$ : An attribute in$S$ that has a given set of values (Let’s say it is a discrete attribute) -
$v$ : Stands for value and represents each value of the attribute$A$ -
$S_v$ : After splitting$S$ using$A$ ,$S_v$ refers to each of the resulted subnodes from$S$ , that share the same value in$A$ -
$E(S_v)$ : The entropy of a node$S_v$ . This should be computed for each value of$A$ (assuming it is a discrete attribute) -
$p(c_i)$ : The probability/percentage of class$c_i$ in a node
source
Overall process of making a decision tree:
- Start with all the examples at the root node
- Calculate the IG for all possible features, and pick the one with the highest IG
- Split dataset according to the selected feature, and create left and right branches of the tree
- Keep repeating the splitting process until the stopping criteria is met:
- When a node is 100% one class
- When spitting a node will result in the tree exceeding a maximum depth
- IG from additional splits is less than threshold
- When number of examples in a node is below a threshold
There are many ways to decide where to stop the tree, and one of the most common ways is to set a maximum depth.
In theory we can use cross validation set to pick the parameter, although in practice, the open-source libraries have better ways to choose this parameter.
Another way is to set a threshold for minimum IG, or number of items in each split.
If a categorical feature can take on
When we are splitting, we'll consider different values to split on, carry out the usual IG calculation and decide to split on the value that gives the highest IG.
In the case of regression, we can only predict the average of a group of examples that have been grouped together.
The key decision for this type of tree is to choose the features to split on.
When building a regression tree, instead of trying to reduce entropy, unlike the decision tree, we need to reduce the variance.
-
$Var(S)$ : The current variance on our node$S$ , before any split -
$|S|$ : The size or the number of instances in$S$ -
$A$ : An attribute in$S$ that has a given set of values (Let’s say it is a discrete attribute) -
$v$ : Stands for value and represents each value of the attribute$A$ -
$S_v$ : After splitting$S$ using$A$ ,$S_v$ refers to each of the resulted subnodes from$S$ , that share the same value in$A$ -
$Var(S_v)$ : The variance of a node$S_v$ . This should be computed for each value of$A$ (assuming it is a discrete attribute) -
$\mu$ : Normal average of present values in given$S$
A single decision trees can be highly sensitive to small changes in the data, and a solution to make it less sensitive is to make a lot of decision trees which is called a tree ensemble.
For example, it we change just one example in the set, the resulting decision tree might have a complete different structure.
Sampling with replacement is randomly picking an example from the dataset. We might get the same example more than once, but it's ok.
Given a training set of size
For
- Use sampling with replacement to create a new training set of size
$m$
We can choose
After we have all the random decision trees, we can give the new examples to all of the trees and select the most repeated result as the final result.
This specific instance creation of tree ensemble is sometimes called bagged decision tree. The term
A change will make it a random forest; The problem with this bagged decision tree is, even though with this sampling with replacement algorithm, sometimes the generated trees will be similar in the splits and root node.
At each node, when choosing a feature to use to split, if
When
Now by this change, we have the random forest algorithm.
Awful joke
Where do a machine learning engineer go camping?Random Forest.
The most popular decision tree algorithm is XGBoost.
The idea of boosting comes from in each iteration, increasing the probability of choosing the mispredicted classes which changes the algorithm into:
For
- Use sampling with replacement to create a new training set of size
$m$ . But instead of picking from all examples with equal (1/m) probability, make it more likely to pick misclassified examples from previously trained trees. - Train a decision tree on the new dataset.
- Open source implementation of boosted trees
- Fast efficient implementation
- Good choice of default splitting criteria and criteria for when to stop splitting
- Built in regularization to prevent overfitting
- Highly competitive algorithm for machine learning competitions e.g. Kaggle competitions
note) Rather than doing something with replacement, XGBoost assigns different weights to different training examples; so it doesn't need to generate a lot of randomly chosen training sets, and this way is more efficient than sampling and replacement.
Classification using sample:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Regression using sample:
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Decision Trees and Tree ensembles
- Works well on tabular (structured) data
- Not recommended for unstructured data(images, audio, text)
- It's very fast to train and predict
- Small decision trees may be human interpretable
Neural Networks
- Works well on all types of daa, including tabular (structured) and unstructured data
- May be slower than a decision tree
- Works with transfer learning
- When building a system of multiple models working together, it might be easier to string together multiple neural networks