Improving Deep Neural Networks - Deep Learning Specialization 2
deeplearning.ai by Andrew Ng on Coursera
Full Course Name: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
W1: Practical aspects of Deep Learning
Setting up your Machine Learning Application
Applied ML is a highly iterative process, need a lot cycles as below
Train / Development / Test sets
Bias and Variance
- In DL era, less discussion about trade-off, but bias and variance themselves
- (show in V3) Trade-off only happens in pre-DL era, because there is no tool can only reduce bias without increasing variance
Systematic Way of Improve DL
Regularizing your neural network
L2 Regularization
V1: L2 regularization
- add a regularization part in cost function
V2: Why can prevent overfitting
- will slightly increase bias but significantly decrease variance
Dropout Regularization
V3&V4:
- Used when training only, and do not use when testing
Others
Setting up your optimization problem (to speed up and debug)
Normalizing Inputs $A^{[0]}$
Better Initialization $W^{[L]}$
- extreme deep NN accumulates $Z^{[L]}$ values and generate extreme large or small outputs, which leads to study difficulties
Debug of Gradient Checking
to verify the back propagation is correctly implemented
V4: numerical approx of gradients
V5: gradient checking
V6: gradient check for improvement
W2: Optimization algorithms (to speed up training)
Mini-batch Gradient Descent
V1&V2: mini-batch gradient descent
- split huge training set into small batches
- compared to batch gradient descent, mini-batch can make progress without training the whole training set. So each descent step is faster but oscillating
Gradient Descent with Momentum
V3&V4: exponentially weighted moving averages
V5: bias correction in early phase of weighted average
V6: gradient descent with momentum
- using weighted average to reduce oscillation especially in mini-batch gradient descent (batch GD has much weaker oscillation)
- it is an idea of weighted average, and can be explained as physical movement affected by friction and acceleration
- has negligible help on small learning rate and simplistic dataset
RMSprop
V7: RMSprop
Adam Optimization
V8: Adam optimization = momentum + RMSprop
Learning Rate Decay
V9: learning rate decay in training process
Understanding Local Optima
V10: problems of local optima
W3: Hyperparameter tuning, Batch Normalization and Programming Frameworks
Hyperparameter Tuning
V1: tuning process
- use random search for higher efficiency, not grid search
- from coarse to fine
V2: appropriate scale for searching
- choose the scale to search efficiently
V3: tuning practice
Batch Normalization
V1: normalizing activations in NN
- it uses the same logic of normalizing input partially, but use on each activation functions by normalizing Z (or A in rarer case)
- however, we don’t want all activation functions are distributed at $N(0,1)$ as input $X$, so we use beta and gamma as learn parameters to tune the mean and variance
V2: add in batch norm
V3: why works
- to get faster gradient descent as same as normalized input
- to make weights of deeper layer more robust to the weight changes of earlier layers,and let each layer train more relied on itself
- slightly regularization effect when combine with mini-batch
V4: batch norm at test time
- estimate mean and variance by weighted average across mini-batches –> so need to define mini-batch size during testing
Multi-class Classification
V1: softmax regression
- use softmax as last layer
V2: training a softmax classifier
Introduction to Programming Frameworks
V1: Deep learning frameworks
V2: TensorFlow
- only define cost function (forward prop) by hand, and the backward part will be automatically done