Since CS231n, Andrew Ng's new DL course, Andrew Ng's CS229 and Google ML course are all introducing basic concepts about ML and DL, so I combine them together. Material from huaxiaozhuan provides good tutorials about commom machine learning algorithms.
Model complexity: VC dimension
- Hyperplane:
w_T * x + b = 0
- Margin (i.e., distance between positive support vectors and negative support vectors):
2 / ||w||
. - Maximize the margin:
max 2 / ||w||
s.t.y_i (w^T * x_i + b) >= 1
-> Minimize the opposite:min 1/2 * ||w||^2
s.t.y_i (w^T * x_i + b) >= 1
. This formula can be solved directly by quadratic programming (QP), but can also be solved by Lagrance efficiently - Lagrance
L(w, b, a) = 1/2 * ||w||^2 + sum(a_i * (1 - y_i * (w_T * x_i + b)))
[1]- Let the partial derivative of
L
with respect tow
andb
euqal to 0, thenw = sum(a_i * y_i * x_i), 0 = sum(a_i * y_i)
. Replace above euqation to [1], thenL = max sum(a_i) - 1 / 2 * sum(sum(a_i * a_j * y_i * y_j * x_i^T.dot(x_j)))
, s.t.sum(a_i * y_i) = 0 and a_1 >=0
[2], whicha
can be solved by Sequential Minimal Optimization (SMO) (select two variablesa_i
anda_j
, fix othera
) - KKT: solution of prime problem == solution of dual problem
- Principle: map low dimension linear non-separable space to high dimension linear separable space use function fai(x)
- However,
x_i^T.dot(x_j)
->fai(x_i).dot(fai(Y_i))
can be hard to compute, then kernel perform the linear transformation and return the inner product of this transformation directly. - Catetories
- Linear kernel:
x_i^Tx_j
- Polynomial kernel:
(x_i^Tx_j)^d
- Gaussian (RBF) kernel: exp(- ||x_i-x_j||^2 / 2sigma^2)
- Linear kernel:
min 1/2 * ||w||^2 + C * sum(epison_i)
, s.t.y_i (w^T * x_i + b) >= 1 - epison_i
andepison_i >=0
C
is a regularization parameter- small
C
allows constraints to be easily ignored → large margin - large
C
makes constraints hard to ignore → narrow margin - Note, there is only one parameter,
C
- Definition:
max(0, 1-z)
- SVM:
min 1/2 * ||w||^2 + C * sum(max(0, 1 - y_i(w^T * x_i + b))))
- Relationship with NN
- SGD with hinge loss
-
- Loss function: tube loss,
max(0, |s-y|-epison)
- Loss function: tube loss,
-
When two points appears in a range, the loss is zero
-
Ranking SVM
-
One-class SVM
- Principle
- Recursive partition of the space with greedy strategy (i.e., each partition is optimum), the boundary is axis-parallel
- Hyper parameter:
max-depth
- Binary tree, easy interpret (good for medical analysis)
- Do not need standarizaion or normalization, and missing data cannot affect the model
- Categories
- ID3
- Criterion: maximize information gain. For one feature
a
, which has 10 different values,IG = Entropy(D) - sum(v in range(10): |D_v| / |D| * Entropy(D_v))
- Multi-branch tree
- Drawback: perfer features that has many values
- Criterion: maximize information gain. For one feature
- C4.5
- Criterion: maximize information gain ratio = information gain / intrinsic value (feature)
- Multi-branch tree
- CART
- Criterion: minimize suqared error for regression (
min(min(sum(y_i-c_1)^2) + min(sum(y_1-c_2)^2))
, wherec_1
andc_2
is mean of two parts splitted by feature value) and Gini index for classification (1 - sum(p_k^2)
) - Binary-branch tree: is or not is for one feature
- Relationship between entropy and Gini index,
f(x)=-ln(x)
Tayor expansion withx=1
, thenf(x)=1-x
, thenentropy = -sum(p_i * ln(p_i)) = sum(p_i * (1-p_i))) = gini index
- Criterion: minimize suqared error for regression (
- For regression, recursive partition with the principle that splitting the value of a feature that minimizes the sum of squared errors (SSE) of prediction. To avoid overfitting,
|tree size|
is added to SSE - Multi-variate decision tree
- Criterion: each node is a linear classifier instead of a univariate feature
- ID3
- Overfitting
- Pre-pruning
- prone to under-fitting
- Post-pruning
- For CART, plus the
|tree size|
in SSE
- For CART, plus the
- Pre-pruning
- Features
- Continuous feature
- To discrete feature via bi-partition (e.g., >10 ?)
- First sort the value, then choose the median of two adjacent value as the condition
- Categorical feature
- Decision subset. E.g. X1: ['A', 'B'], then subset: ['A'], ['B'], ['AB']
- Continuous feature
- Missing data
- Split termination criterion
- All
y
is same: all samples belong to the same category - All
X
is same: feature is None, or all samples have same values on all features, no decision stumps - No samples available
- All
- Principle: resample the samples with replacement, and resample the features at the same time
- Hyper parameter:
number of trees
,number of feature for each tree: sqrt(# features)
- Reduce overfit, and reduce variance
- Principle: use decision tree to fit the residue, like Taylor expansion
- Reduce bias
- tutorial 1, tutorial 2
- Gradient boosting performs gradient descent
- Learning rate, i.e., shrinkage: shrinks the contribution of each tree by learning_rate
- Expectation Maximization is MLE or MAP with latent variable, which is used to estimate the parameter iteratively
- Two steps
-
E
product(p(y|theita)) -> sum(lnp(y|theita)) = sum(ln sum(y,z|theita))
, wheresum
inln
make derivative equal to0
cannot be computedQ(theita, theita_i): sum(p(z|y,theita) * lnp(y,z|theita))
, compute expectation oflnp(y,z|theita)
with weightp(z|y,theita)
(you can think that first choose the cluster first in GMM), wheretheita_i
is known andtheita
is the parameter that to be estimated
-
M
argmax Q
find theita with making partial derivative ofQ
with repect totheita
equal to0
-
- Applications
- Entropy
- Information size:
-logp(x)
- Information size:
- Union entropy
- Condition entropy
- Mutual information
- Measure the non-linear relationship between two variables
- Can be used to select feature
- Relative entropy (i.e., KL divergence)
- Measure the difference between two distributions
D(p||q) != D(q||p)
D(p||q) >= 0
- Cross entropy
- Relationship with KL divergence:
CE(p, q) = H(p) + D(p||q)
- Relationship with KL divergence:
Instead of learning curve, we usually look at the gap bewteen training accury and testing accury to check whether the model is over-fitting.
1.5.2 Model ensemble, sklearn
- Bagging
- Random feature and random samples
- Reduce high-variance
- Boosting
- Reduce high-bias
- Difference between GBDT and XgBoost: first-order and second-order derivative
- Blending
- Split data to first-step training data and second-step training data
- Based on first-step training data, we can train different models
[m_1, m_2, ..., m_n]
- Based on second-step training data, first get the predictions
[y_1, y_2, ..., y_n]
using[m_1, m_2, ..., m_n]
, then a new model (usually simple model such as linear model) fit on[y_1, y_2, ..., y_n], true-y
- Stacking
- Basic idea of Stacking is same as Blending
- Stacking uses cross-fold-validation (the out-of-fold is used to train the next layer), while Blending uses a holdout validation (part of the train is used in the first layer, part in the second)
- Numerical
- Scaling
MinMaxScaler
,StandardScaler
- Log
- Statistics
- The gap between the price and the average price
- Discreting (e.g.,
age
)pd.cut
,pd.qcut
- Scaling
- Categorical (enumerated)
- Boolean (i.e., is it yes or no ?)
- For example,
red, yellow, green
are categorical feature, a object both havered
andgreen
feature can be represented as[1, 0, 1]
- For example,
pd.get_dummies
- Histogram
- For example, there are
gender
andhobbies
two features, we can calculate the precent of hobby onmale
andfemale
category respectively. And replacing this percent tohobbies
feature to each sample
- For example, there are
- Boolean (i.e., is it yes or no ?)
- Time
- Continuous
- During time (i.e., browsing during time)
- Interval time (i.e., the time between last time buying and now)
- Discrete
- The quarter in one year (i.e., 0, 1, 2, 3)
- The week in one year (i.e., 0, 1, ..., 54)
- The day in one week (i.e., 0, 1, ..., 6)
- The hour in one day (i.e., 0, 1, ..., 23)
- Workday or weekend
- Continuous
- Missing value
- Additional boolean feature
- For example, some sample does not have
age
feature, then additional featureis_age_defined
is added.
- Univariate
- Selecting the best features based on univariate statistical tests
- For example, chi2, Pearson, mutual information
- Recursive feature elimination
- Selecting features by recursively considering smaller and smaller sets of features
- SelectFromModel
- Lasso with L1 regularization
- Tree-based
- Featuretools for automatic feature engineering.
- Mean square error (MSE)
- Linear regression
- L2 loss
- Logarithmic loss
- Logistic regression.
- L(Y, f(x)) = -logf(x). Fast.ai.wiki gives a detailed explanation of log loss.
- Cross entropy
- shortcut of KL divergence, relative entropy
- Max-margin loss
J = max(0, 1+s-s_c)
- Exponential loss
- Adaboost
- Hinge loss
- SVM
- Cosine similarity
- Cosine curve (degree=0, cos_value=1; degree=90, cos_value=0; degree=180, cos_value=-1)
ERM = minimize loss, it may leads to over-fitting phenomenon. For example, maximum likelihood estimation (MLE).
SRM = ERM + regulairzation. For example, maximum a posterior (MAP).
- Choose value that maximize the probability of observed data, i.e.,
argmax_theita P(D|theita)
- Choose value that is most probable given observed data and prior belief, i.e.,
argmax_theita P(theita|D) = argmax_theita P(D|theita) * P(theita)
- MLE is a special case of MAP, where the prior is uniform
Consider a scenario that predicting the gender of the user (i.e., male or female). This is a classical binary classification prediction problem. Let predicted 1 (i.e., positive) indicates male and predicted 0 (i.e., negative) indicates female.
Confusion matrix:
Predicted positive | Predicted negative | |
---|---|---|
Real positive | TP | FN |
Real negative | FP | TN |
Multi-class confusion matrix: element at row i and column j denotes the true class i and is being classified in class j.
Accuracy = (TP + TN) / (TP+FP+FN+TN), suffer from class imbalance problem.
Error = (FP + FN) / (TP+FP+FN+TN) = 1 - accuracy
Precision = TP / (TP + FP) What's the correct proportion of predicted positive ?
Recall = TP / (TP + FN) What's the pick up proportion of all positive obsverations ?
F1 = 2 * Precision * Recall / (Precision + Recall).
Increase Precision indicates increase classification threshold; meanwhile, increase Recall indicates decrease classification threshold.
False Positive = FP / (FP + TN)
True Positive = TP / (TP + FN), i.e., Recall.
Which model is better?
Model 1:
Predicted 1 | Predicted 0 | |
---|---|---|
Real 1 | 4 | 2 |
Real 0 | 3 | 1 |
Model 2:
Predicted 1 | Predicted 0 | |
---|---|---|
Real 1 | 6 | 0 |
Real 0 | 4 | 0 |
Model 1: accuracy = 92%.
Model 2: accuracy = 95%.
Model 2 has higher accuracy than model 1, but model 2 is useless. This is called accuracy paradox, which means the model with higher accuracy may not have better generalization power.
In general, when TP < FP, the accuracy will always increase when we change the classifier to always output 'negative'. Conversely, when TN < FN, the same will happen when we change the classifier to always output 'positive' [1].
Recall@k
ROC (Receiver Operating Characteristic curve): A curve of true positive rate vs. false positive rate at different classification thresholds. The x-axis is False Positive rate, and the y-axis is True Positive rate. [3] .
Close to the up left point (TPR=1.0, FPR=0.0) indicates the model is better. On the diagonal line, TPR = FPR, which means the random guess.
AUC (Area under the ROC curve): aggregate measure of performance across all possible classification thresholds.
One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. [3]
References:
[1] https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/
[2] https://www.zhihu.com/question/30643044
[3] https://developers.google.cn/machine-learning/crash-course/classification/true-false-positive-negative
Suppose that the logistic regression, p = sigmoid(z)
:
- The input
z
, i.e.,z = WX + b
, is a linear function of x. - Log-odds, i.e.,
log (p/(1-p)) = WX
, is a linear function of parametersW
. - Decision boundary, i.e.,
WX = 0
, is linear. This does not mean LR can only handle linear-separable data. Actually, we can convert low-dimensional data to high-dimensional data with the help of feature mapping. And the data are linear-separable in the high-dimensional space. - Both logistic regression and softmax regression can be modeled by exponential family distribution.
- It is wrong that linear regression model generates straight lines and nonlinear regression model curvature. Actually, both linear and non-linear models can fit curves.
- Linear means linear in parameters
W
but not inX
. For example, both functionsf_1 = w_1x_1 + w_2x_2 + b
andf_2 = w_1x_1 + w_2x_2 + w_3x_2^{2} + b
are linear functions. The most common operation is adding polynomial terms (i.e., quadratic term or cubic term) in linear model. - Example of non-linear function
f_3 = w_1x^{w_1}
log
transform can be used to convert nonlinear functiony=e^{b}x_1^{w_1}x_2^{w_2}
to linear functionlny = b + w_1lnx_1 + w_2lnx_2
Precision = 0.8, Recall = 0.73.
Precision = 0.88, Recall = 0.64.
Precision = 0.75, Recall = 0.82.
References:
https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
https://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier
https://stats.stackexchange.com/questions/88603/why-is-logistic-regression-a-linear-model
http://cs229.stanford.edu/notes/cs229-notes1.pdf
http://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/
http://statisticsbyjim.com/regression/curve-fitting-linear-nonlinear-regression/
https://developers.google.cn/machine-learning/crash-course/classification/precision-and-recall
Logistic regression application in Meituan:
https://tech.meituan.com/intro_to_logistic_regression.html
- Principle: make sure that the distribution of validation set and test set are same.
- Train and dev/test may from slightly different distribution.
- Learning curve shows the training and validation score along the increases of training examples.
- Manually check say 100 mis-classified validaton samples, and count their error types.
- All advanced optimization algorithms are based on exponentially weighted averages (
V_t = beta * V_t-1 + (1-beta) * theita_t
), which approximately averages1/(1-beta)
days. An example,V_0 = 0, V_1 = 0.9 * V_0 + 0.1 * theita_1, V_2 = 0.9 * V_1 + 0.1 * theita_2
, ..., therefore,V_2 = 0.1 * theita_2 + (0.1 * 0.9) * theita_1
, the weights oftheita_x
is exponentially decay. - At early steps, bias correction is used. i.e.
V_t = V_t / (1 - beta^t)
. Ift
is large, then1 - beta^t
approximates1
, bias correction does not work now.
V_dw = beta * V_dw + (1-beta) * dw
w = w - alpha * V_dw
S_dw = beta * S_dw + (1-beta) * dw ** 2
w = w - alpha * dw / [S_dw^(1/2) + epsilon]
, where** 2
is element-wise multiplication andepslion=1e-8
.
V_dw = beta_1 * V_dw + (1-beta_1) * dw; S_dw = beta_2 * S_dw + (1-beta_2) * dw ** 2;
V_dw_corr = V_dw / (1-beta_1^t); S_dw_corr = S_dw / (1-beta_2^t)
w = w - alpha * V_dw_corr / [S_dw_corr^(1/2) + epsilon]
beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8
- Decay the learning rate after each epoch
learning_rate / (1.0 + num_epoch * decay_rate)
- Exponential decay:
learning_rate * 0.95^num_epoch
- First-order derivative is zero.
- For one dimension, the saddle point is local maximum, but for another dimension, the saddle point is local minimum.
- Results: early layers are converged slower than later layers.
- Reason:
sigmoid
andtanh
activations suffer from vanishing gradients. ButReLU
activation does not have this problem. - Solutions:
- Activation function.
ReLU
orLeaky ReLU
.ReLU
can have dying states (caused by i.e., large learning rate or large negative bias), whose both outputs and gradients are zero.Leaky ReLU
solves this problem. Variants ofLeaky ReLU
israndomized leaky ReLU (RReLU)
,parametric leaky ReLU (PReLU)
.exponential linear unit (ELU)
.
ELU > Leaky ReLU > ReLU > tanh > sigmoid [5]. - Weights initialization. i.e.,
Xavier
initialization (Two goal: the outputs variance of each layer is equal to the inputs variance; the gradients variance before and after flowing through a layer is equal) forsigmoid
andtanh
,He
initialization forReLU
andLeaky ReLU
. - Batch Normalization. Address vanishing or exploding gradients problem during training [6].
- Activation function.
- Implementation [5]
# xavier
tf.contrib.layers.fully_connected(inputs, num_outputs, weights_initializer=initializers.xavier_initializer())
# He
he_init = tf.contrib.layers.variance_scaling_initializer() # default is He initialization that only consider fan-in
tf.contrib.layers.fully_connected(inputs, num_outputs, weights_initializer=he_init)
# elu
tf.contrib.layers.fully_connected(inputs, num_outputs, activation_fn=tf.nn.elu)
# leaky relu
def leaky_relu(z, name=None):
return tf.maximum(0.01*z, z, name=name)
tf.contrib.layers.fully_connected(inputs, num_outputs, activation_fn=leaky_relu)
- Results: gradients and cost become
NaN
. - Reason: large weights and derivative of activation multiplication during back propagation. It is particularly occured in RNNs.
- Solution: gradients clipping.
- Implementation:
# pseudo code
if norm(gradients) > max_gradients_norm:
gradients *= max_gradients_norm/norm(gradients)
# real code
variables = tf.trainable_variables()
gradients = tf.gradients(ys=cost, xs=variables)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_norm)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
optimize = optimizer.apply_gradients(grads_and_vars=zip(clipped_gradients, variables), global_step=self.global_step)
- If the forget gate is on and the input and output gates are off, it just passes the memory cell gradients through unmodified at each time step [4].
- CEC (Constant Error Carrousel) mechanism with gate [7].
References:
[1] https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b
[2] https://www.quora.com/Why-is-it-a-problem-to-have-exploding-gradients-in-a-neural-net-especially-in-an-RNN
[3] http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
[4] http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15%20Exploding%20and%20Vanishing%20Gradients.pdf
[5] Hands on machine learning with Scikit-Learn and TensorFlow p278, P281
[6] https://www.zhihu.com/question/38102762
[7] https://www.zhihu.com/question/34878706
- Covariate shift means data distribution changes, BN reduces the distribution of neurons shifts. No matter how
z1
andz2
changes, BN keeps there mean and std same. - The dimension of
beta
andgamma
is(n^l, 1)
, because BN is used to scale the mean and variance of the inputZ
of each neuron. - He initialization and ELU can reduce the vanishing gradients at the begining of training. BN can address the vanishing gradients during training.
- Batch means evaluating the mean and standard deviation of the inputs over current mini-batch.
- BN can reduce vanishing gradients problem, less sensitive to the weight initialization, reduce the need for other regularization techniques (such as dropout).
- At test time, use the whole training set's mean and standard deviation, i.e., exponential weighted averages of each mini-batch's
beta
andgamma
. bias=False
- Neural networks essentially automatic learns feature crosses, which is necessary for solving non-linear problems.
- Each neuron is f (weighted average of last layers' outputs), where
f
is non-linear function.
Chain rule with memorization.
4.3 How to choose the number of hidden layers ?
DNN could extract features layer by layer, and it has a hierarchical architecture. For many problems, one or two hidden layers will works fine. For complex problem, you can gradually increase the number of hidden layers, until overfitting occurs.
4.4 How to set the number of neurons per hidden layer ?
A common strategy is to size the number of neurons to form a funnel (i.e., deeper layer has fewer neruons). The analogy is many low-level features are coalesce into fewer high-level features.
A simple approach is to pick a complex model with early stopping to prevent from overfitting.
References:
[1] Hands on machine learning with Scikit-Learn and TensorFlow p271
Batch gradient descent
- slow
- may converge to local minimum, and yield worse performance
- suffer from GPU memory limitation
Stochastic gradient descent
- fast
- not stable
Mini-batch gradient descent
- fast as SGD (matrix operation with GPU)
- escape from local minimum and more stable
How to set mini-batch size
- If the whole training set size is less than 2,000, then use the whole training data
- Otherwize, 64, 128, 256, 512 is common choices
One hidden layer with two neurons. The activation in above image is step function.
Input 10 training samples, shut down the dropout and L2 regularizations, predict the 10 testing samples (same as the 10 training samples). If the cost is approximately 0, the codes may not have mistakes
- Don't use grid search (useful when the number of parameters is small), try random values. Say parameters
w1
andw2
, grid search will loopw1 = 5
andw2 = 1, 2, 3, 4, 5
, all of the five models have samew1
. But with random values, five models can have differentw1
- Coarse to fine strategy
- Log scale for learning rate
- Step 1: suppose that we want to choose learning rate from range
[0.0001, ..., 1]
- Step 2: we take
log scale
(withbase 10
), then the range becomes[-4, 0]
- Step 3: uniform choose one paramter with
r = -4 * np.random.randn(), alpha = 10^r
- Step 1: suppose that we want to choose learning rate from range
At each step, update the parameters by their moving average.
References:
[1] Use it in tensorflow
[2] example
Why are the shape of the L1 norm and L2 norm (aka, weight decay) diamond like and circle like respectively?
See reference [1].
L0 norm is the number of non-zero elements, which has sparse property.
L1 norm is the best convex approximation to L0 norm. We can view L1 norm as a compromise between the L0 and L2 norms, inheriting the sparsity-inducing property from the former and convexity from the latter [2].
The L1 norm despite being convex, is not everywhere differentiable (unlike the L2 norm) [2], see picture from [3].
Traditional gradient descent cannot be used to optimize L1 regularized models, therefore more advanced techniques like proximal gradient or primal-dual methods like ADMM are required [2].
L1 regularization helps performing feature selection.
Since squaring a number punishes large values more than it punishes small values [2], L2 regularization leads to small distributed weights.
Both are used for solve over-fitting.
In Bayesian view:
L1 regularization is equivalent to MAP estimation using Laplacian prior.
L2 regularization is equivalent to MAP estimation using Gaussian prior.
Always try L2 regularization first, since it will give you the best result [2].
- forward propagation computes cost
# suppose that there are sigler hidden layer neural network # W1 and W2 are parameters for input X and hidden neurons # Since we add regularization term, the cost cannot be zero at anytime. L2_regularization_cost = 1.0 / m * lambd / 2.0 * (np.sum(np.square(W1)) + np.sum(np.square(W2))) cost = cross_entropy_cost + L2_regularization_cost
- back propagation comptutes gradients
dZ2 = A2 - Y # the cross_entropy loss dW2 = 1./m * np.dot(dZ2, A1.T) + W2 * lambd / m db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) dZ1 = np.multiply(dA1, np.int64(A1 > 0)) # relu activation dW1 = 1./m * np.dot(dZ1, X.T) + W1 * lambd / m db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
The following picture is from [2].
# from http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
1 / (2 * n_samples) * ||y - Xw||^2_2 + alpha * l1_ratio * ||w||_1 + 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2
References:
[1] https://www.quora.com/Why-does-an-L1-norm-unit-ball-have-diamond-shaped-geometry#!n=12
[2] https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
[3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
[4] https://feature.engineering/regularization/
- Intuition: can't rely on any one feature.
- At test time, do not use dropout.
- Implementation:
d1 = np.random.rand(shape[0], shape[1]) < keep_prob
a1 = np.multiply(a1, d1)
a1 /= keep_prob # inverted dropout
- Data augmentation: horizontally flip
- Early stopping
- Parameter sharing: feature detector that's useful in one part of the image is probably useful in another part of the image
- Sparsity of connection
- Translation invariance
- Usually
odd
, i.e.,3*3
,5*5
, which has a central point - Filter kernel:
width * height * number of channels
- Each filter has a
bias
- After convolution, we should apply non-linearity, i.e.,
relu(convolution + bias)
- Role
- Prevent image size ** shrinking** after convolution
- Put more emphasis on the corner pixel (i.e., top-left, top-right, down-left, down-right), otherwise, the corner pixel will be convolved once
- Types
Valid
: No paddingSame
: output size is the same as the input size,p = (filter_size - 1) / 2
stride=2
means stride both on horizontal and vertical directions- Image size after convolution, padding and stride
- Horizontal direction:
round([(horizontal image size + 2 * padding - filter_size) / stride] + 1)
- Vertical direction:
round([(vertical image size + 2 * padding - filter_size) / stride] + 1)
- Horizontal direction:
- Hyperparameters: filter size and stride
- Pooling layer has but does not have any learnable parameters
- Dose not change the number of channels
- Local Response Normalization (LRN): across whole channels
- Fixed convolution filter size (3 * 3, stride=1) and pooling size (2 * 2, stride=2)
- VGG16 almost have same performance compared with VGG19
- Residual block:
a^(l+2) = g(z^(l+2) + a^(l))
- In extremely scenario with strong regularization,
z^(l+2)
equals0
, anda^(l+2) = g(0 + a^(l))
, it turns out that the identity function is eary to learn for residual block. - Can conquer gradients vanishing problem, why ResNet works ?
- 1 * 1 CONV + 1 * 1 CONV -> 3 * 3 CONV + 1 * 1 CONV -> 5 * 5 CONV + MAXPOOL 3 * 3, stride=1, same -> 1 * 1 CONV
- Work as a bottleneck layer for reducing computational cost
See colah's great post on Understanding LSTM Networks.
-
- Learn word embeddings from large text corpus (Or download pre-trained embeddings online).
-
- Transfer embedding to new task with smaller training set.
-
- Optional: Continue to finetune the word embeddings with new data.
- If your data size is small, just retrain the last layer; if your data size is large, you can retrain last two or three or all layers
- Task A and task B share the same low level features, so transfer learning works
- You can preprocess the input images with frozen layer and save them to disk.
- Data augmentation methods
- Mirroring
- Random cropping
- Color shifting: PCA
- A set of tasks share the same low level features
- Needs large amount of data
- Data denoising
- Dimension reduction
- Usage: In fraud detection, use Auto-Encoder to compress all data to dense vector, then kNN is used to detect outliers
-
Reinforcement learning (RL) is an area of machine learning that focuses on how agent to act in an environment in order to maximize some given reward
-
Markov Decision Processes (MDPs)
- Components
- Agent: decision maker
- Environment
- State
- Action
- Reward
- Sequential process
- At each time step, given the environment’s state, the agent selects an action to take. Then the environment is transitioned into a new state, and the agent is given a reward as a consequence of the previous action
- Goal
- The goal of an agent in an MDP is to maximize its cumulative rewards (i.e., expected discounted (i.e., care more about the immediate reward over future rewards, discount rate γ) return of rewards)
- Components
-
Policy and Value function
- Policy
- A function that maps a given state to probabilities of selecting each possible action from that state, let symbol
π
denotes the policy
- A function that maps a given state to probabilities of selecting each possible action from that state, let symbol
- Value
- State-value function v_π: gives us the value of a state under
π
- Action-value function q_π: gives us the value of an action under
π
.q_π
is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. The letter 'Q' is used to represent the quality of taking a given action in a given state
- State-value function v_π: gives us the value of a state under
- RL Goal
- Reinforcement learning algorithms seek to find a policy (i.e., optimal policy) that will yield more return to the agent than all other policies
- Policy
-
Bellman optimality equation
-
Q-function: input the state-atcion pair, output the Q-value. The letter “Q” is used to represent the quality of taking a given action in a given state.
- It is used for learning the optimal policy by learning the optimal Q-values for each state-action pair in a Markov Decision Process
- Q-table: store the Q-values for each state-action pair, the dimension is (#states * #actions)
- Steps
- Initialize all Q-values in the Q-table to 0
- For each time-step in each episode:
- Choose an action (using the exploration-exploitation trade-off, i.e., epsilon greedy strategy)
if random_num > epsilon:
choose action via exploitation: choose the action with the highest Q-value for its current stateelse:
choose action via exploration: randomly choosing action
- Update the Q-value function
- Choose an action (using the exploration-exploitation trade-off, i.e., epsilon greedy strategy)
- Steps
- Use a neural network to approximate the Q-function
- Input the state, output the Q-values for each action that can be taken from that state, the loss is the gap between the output Q-values (by policy network) and the target Q-values (by target network) from Bellman equation
- Experience replay: store the agent’s experiences at each time step called the replay memory data set, which stores the last
N
experiences. At timet
, the agent's experiencee_t
is defined as this tuple:e_t=(s_t, a_t, r_t+1, s_t+1)
- Train the network with randomly choose the samples in replay memory to break the correlation between consecutive samples
- Target network
- Clone of the policy network, weights are frozen with policy network’s weights, and after
x
time steps, copy policy network's weights to target network - The goal of target network is find the value of the
max
term in Bellman euqation to calculate the target Q-value
- Clone of the policy network, weights are frozen with policy network’s weights, and after