Improving Deep Neural Networks - Deep Learning Specialization 2
deeplearning.ai by Andrew Ng on Coursera
Full Course Name: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
W1: Practical aspects of Deep Learning
Setting up your Machine Learning Application
Applied ML is a highly iterative process, need a lot cycles as below

Train / Development / Test sets

Bias and Variance
- In DL era, less discussion about trade-off, but bias and variance themselves
- (show in V3) Trade-off only happens in pre-DL era, because there is no tool can only reduce bias without increasing variance

- (show in V3) Trade-off only happens in pre-DL era, because there is no tool can only reduce bias without increasing variance
Systematic Way of Improve DL

Regularizing your neural network
L2 Regularization
V1: L2 regularization
- add a regularization part in cost function

V2: Why can prevent overfitting
- will slightly increase bias but significantly decrease variance

Dropout Regularization
V3&V4:
- Used when training only, and do not use when testing

Others

Setting up your optimization problem (to speed up and debug)
Normalizing Inputs $A^{[0]}$

Better Initialization $W^{[L]}$
- extreme deep NN accumulates $Z^{[L]}$ values and generate extreme large or small outputs, which leads to study difficulties

Debug of Gradient Checking
to verify the back propagation is correctly implemented
V4: numerical approx of gradients

V5: gradient checking

V6: gradient check for improvement

W2: Optimization algorithms (to speed up training)
Mini-batch Gradient Descent
V1&V2: mini-batch gradient descent
- split huge training set into small batches
- compared to batch gradient descent, mini-batch can make progress without training the whole training set. So each descent step is faster but oscillating

Gradient Descent with Momentum
V3&V4: exponentially weighted moving averages

V5: bias correction in early phase of weighted average

V6: gradient descent with momentum
- using weighted average to reduce oscillation especially in mini-batch gradient descent (batch GD has much weaker oscillation)
- it is an idea of weighted average, and can be explained as physical movement affected by friction and acceleration
- has negligible help on small learning rate and simplistic dataset

RMSprop
V7: RMSprop

Adam Optimization
V8: Adam optimization = momentum + RMSprop

Learning Rate Decay
V9: learning rate decay in training process

Understanding Local Optima
V10: problems of local optima

W3: Hyperparameter tuning, Batch Normalization and Programming Frameworks
Hyperparameter Tuning
V1: tuning process
- use random search for higher efficiency, not grid search
- from coarse to fine

V2: appropriate scale for searching
- choose the scale to search efficiently

V3: tuning practice

Batch Normalization
V1: normalizing activations in NN
- it uses the same logic of normalizing input partially, but use on each activation functions by normalizing Z (or A in rarer case)
- however, we don’t want all activation functions are distributed at $N(0,1)$ as input $X$, so we use beta and gamma as learn parameters to tune the mean and variance

V2: add in batch norm

V3: why works
- to get faster gradient descent as same as normalized input
- to make weights of deeper layer more robust to the weight changes of earlier layers,and let each layer train more relied on itself
- slightly regularization effect when combine with mini-batch

V4: batch norm at test time
- estimate mean and variance by weighted average across mini-batches –> so need to define mini-batch size during testing

Multi-class Classification
V1: softmax regression
- use softmax as last layer

V2: training a softmax classifier

Introduction to Programming Frameworks
V1: Deep learning frameworks

V2: TensorFlow
- only define cost function (forward prop) by hand, and the backward part will be automatically done

