Improving Deep Neural Networks - Deep Learning Specialization 2

deeplearning.ai by Andrew Ng on Coursera

Full Course Name: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

W1: Practical aspects of Deep Learning

Setting up your Machine Learning Application

Applied ML is a highly iterative process, need a lot cycles as below Alt text

Train / Development / Test sets

Alt text

Bias and Variance

In DL era, less discussion about trade-off, but bias and variance themselves
- (show in V3) Trade-off only happens in pre-DL era, because there is no tool can only reduce bias without increasing variance

Systematic Way of Improve DL

Alt text

Regularizing your neural network

L2 Regularization

V1: L2 regularization

add a regularization part in cost function

V2: Why can prevent overfitting

will slightly increase bias but significantly decrease variance

Dropout Regularization

V3&V4:

Used when training only, and do not use when testing

Others

Alt text

Setting up your optimization problem (to speed up and debug)

Normalizing Inputs $A^{[0]}$

Alt text

Better Initialization $W^{[L]}$

extreme deep NN accumulates $Z^{[L]}$ values and generate extreme large or small outputs, which leads to study difficulties

Debug of Gradient Checking

to verify the back propagation is correctly implemented

V4: numerical approx of gradients Alt text

V5: gradient checking Alt text

V6: gradient check for improvement Alt text

W2: Optimization algorithms (to speed up training)

Mini-batch Gradient Descent

V1&V2: mini-batch gradient descent

split huge training set into small batches
compared to batch gradient descent, mini-batch can make progress without training the whole training set. So each descent step is faster but oscillating

Alt text

Gradient Descent with Momentum

V3&V4: exponentially weighted moving averages Alt text

V5: bias correction in early phase of weighted average Alt text

V6: gradient descent with momentum

using weighted average to reduce oscillation especially in mini-batch gradient descent (batch GD has much weaker oscillation)
it is an idea of weighted average, and can be explained as physical movement affected by friction and acceleration
has negligible help on small learning rate and simplistic dataset

RMSprop

V7: RMSprop

Alt text

Adam Optimization

V8: Adam optimization = momentum + RMSprop Alt text

Learning Rate Decay

V9: learning rate decay in training process Alt text

Understanding Local Optima

V10: problems of local optima Alt text

W3: Hyperparameter tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

V1: tuning process

use random search for higher efficiency, not grid search
from coarse to fine

Alt text

V2: appropriate scale for searching

choose the scale to search efficiently

Alt text

V3: tuning practice Alt text

Batch Normalization

V1: normalizing activations in NN

it uses the same logic of normalizing input partially, but use on each activation functions by normalizing Z (or A in rarer case)
however, we don’t want all activation functions are distributed at $N(0,1)$ as input $X$, so we use beta and gamma as learn parameters to tune the mean and variance

V2: add in batch norm

Alt text

V3: why works

to get faster gradient descent as same as normalized input
to make weights of deeper layer more robust to the weight changes of earlier layers，and let each layer train more relied on itself
slightly regularization effect when combine with mini-batch