Standard ridge regularisation is mathematically equivalent to the following procedure:
- Make many copies of the dataset
- Add uncorrelated noise to each variable (equivalent to adding a multiple of the identity matrix to the covariance matrix of the variables)
- Compute the OLS estimator on this perturbed dataset
In practice, we are sometimes faced with datasets where certain variables will clearly be correlated with nearby variables. It might therefore make sense to link the noise with a kernel function describing the correlation of the noise added to different variables.
In this repo, I present an example of this modified procedure. For a hand-written digit dataset (sklearn.datasets.load_digits
), I regress a one-hot encoded class label against the pixels in the corresponding images.
I present average-case LOOCV MSE across 174 multiple non-overlapping samples, each of size 30 (containing 3 instances of each class), for the following estimators:
$\beta=(\frac{1}{n-1}X^TX+\lambda I)(\frac{1}{n-1}X^Ty)$ $\beta=(\frac{1}{n-1}X^TX+\lambda M)(\frac{1}{n-1}X^Ty)$
In these formulas:
-
$X$ is the feature matrix -
$y$ is the target vector -
$n=29$ is the size of the training dataset -
$I$ is the$64 \times 64$ identity matrix -
$M$ is a$64\times 64$ square matrix (indexed by points in the image), with$M_{i,j} = \exp\left(-\left(\frac{d(i,j)}{\text{lengthscale}}\right)^2\right)$ , where$d(i,j)$ is the distance between two points in the image
Results by digit are shown below. Digits for which the second estimator outperforms (in some cases) are:
- 1
- 3
- 4 (for very small regularisation penalty)
- 8
- 9
A lengthscale approaching zero will have the same behaviour as ridge regularisation, but these results suggest that lengthscales slightly above zero may outperform ridge regularisation depending on the strength of the penalty.
This is what the empirical covariance matrix of the pixels looks like (in the particular one-dimensional order I've arranged them), vs. the penalty matrix added for regularisation.