In supervized learning we are given a data matrix
We distinguish two cases :
- Classification : In this setting the targets are classes draw from a discrete set. The metric used to measure the accuracy of the model can then be simply
$\frac{1}{N} \sum_{i=1}^N \mathbb{1}_{\hat y_i = y_i}$ . - Regression : Here the targets are drawn from a continuous set (in opposition to a discrete set). The metric can be thought as a distance between the true target and the predicted one, for example the Mean Squared Loss (MSE) :
$\frac{1}{N} \sum_{i=1}^N (\hat y_i - y_i)^2$
With this simple algorithm we only need the number of neighbors
A point is assigned a class by considering the k neareast neighbors (in the training set) and taking the most represented class from those. If there is an equality between two classes we take only the k-1 neighbors.
The assigned value of a point is the mean of the k neareast neighbors (in the training set).
The Least Squares aim to minimize the mean squared error :
The polynomial case simply add the power of the precised power of each feature to the input data
In this case we add an
In this case we add an
This has unfortunetly no closed form, two well know solutions exists to estimate the parameter vector
- Least Angle Regression
- Forward stepwise regressions
(upcoming)
Binary trees can be used as a machine learning model as both regressor and classifier. We will focus on CART decision trees (which have the benefit of being alble to process also categorial data) but other methods exists (ID3, C4.5 ...).
Each node of the tree perform a split on the data using a single variable using a threshold
The tree is created given a dataset
- Finding the best split to divide to divide the data based on some metric (gini index for classification and mean squared error for regression), by testing all values of the dataset as a threshold.
- Create two new nodes (left and right) using the splitted data according the best split defined before.
- A Leaf is created if when creating a node either the maximum depth of the tree is attained or the minimum number of samples is bigger than the number of left datapoints.
When creating a CART classification decision tree, the gini index is used. The gini index is a cost function to evaluate the split defined as the following :
$$
G = 1 - \sum_{i \in N} p_{i}^2
$$
with
The GPR uses kernels, the default and most used is the Squared Exponential Kernel (also called RBF), is defined as follows : $$ \kappa_{y}\left(x_{p}, x_{q}\right)=\sigma_{f}^{2} \exp \left(-\frac{1}{2 \ell^{2}}\left(x_{p}-x_{q}\right)^{2}\right)+\sigma_{n}^{2} \delta_{p q} $$
We consider we have noise observation scaled by
The SVC solves the following primal problem
Subject to
Dual problem :
Decision function :
$$\min_ {w, b, \zeta, \zeta^} \frac{1}{2} w^T w + C \sum_{i=1}^{n} (\zeta_i + \zeta_i^)$$
Subject to :
Dual problem : $$\min_{\alpha, \alpha^} \frac{1}{2} (\alpha - \alpha^)^T Q (\alpha - \alpha^) - y^T (\alpha - \alpha^) + \sum_i (\alpha_i + \alpha_i^*)$$
Subject to $$\sum_i (\alpha_i - \alpha_i^) = 0,\ \text{and } 0 \leq \alpha_i, \alpha_i^ \leq C, i=1, ..., n$$
And the decision is make using :