Improving Deep Neural Networks - Deep Learning Specialization 2

deeplearning.ai by Andrew Ng on Coursera

Full Course Name: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

W1: Practical aspects of Deep Learning

Setting up your Machine Learning Application

Applied ML is a highly iterative process, need a lot cycles as below Alt text

Train / Development / Test sets

Alt text

Bias and Variance

  • In DL era, less discussion about trade-off, but bias and variance themselves
    • (show in V3) Trade-off only happens in pre-DL era, because there is no tool can only reduce bias without increasing variance Alt text

Systematic Way of Improve DL

Alt text

Regularizing your neural network

L2 Regularization

V1: L2 regularization

  • add a regularization part in cost function Alt text

V2: Why can prevent overfitting

  • will slightly increase bias but significantly decrease variance Alt text

Dropout Regularization

V3&V4:

  • Used when training only, and do not use when testing Alt text

Others

Alt text

Setting up your optimization problem (to speed up and debug)

Normalizing Inputs $A^{[0]}$

Alt text

Better Initialization $W^{[L]}$

  • extreme deep NN accumulates $Z^{[L]}$ values and generate extreme large or small outputs, which leads to study difficulties Alt text

Debug of Gradient Checking

to verify the back propagation is correctly implemented

V4: numerical approx of gradients Alt text

V5: gradient checking Alt text Alt text

V6: gradient check for improvement Alt text

W2: Optimization algorithms (to speed up training)

Mini-batch Gradient Descent

V1&V2: mini-batch gradient descent

  • split huge training set into small batches
  • compared to batch gradient descent, mini-batch can make progress without training the whole training set. So each descent step is faster but oscillating

Alt text

Gradient Descent with Momentum

V3&V4: exponentially weighted moving averages Alt text

V5: bias correction in early phase of weighted average Alt text

V6: gradient descent with momentum

  • using weighted average to reduce oscillation especially in mini-batch gradient descent (batch GD has much weaker oscillation)
  • it is an idea of weighted average, and can be explained as physical movement affected by friction and acceleration
  • has negligible help on small learning rate and simplistic dataset Alt text

RMSprop

V7: RMSprop

Alt text

Adam Optimization

V8: Adam optimization = momentum + RMSprop Alt text

Learning Rate Decay

V9: learning rate decay in training process Alt text

Understanding Local Optima

V10: problems of local optima Alt text

W3: Hyperparameter tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

V1: tuning process

  • use random search for higher efficiency, not grid search
  • from coarse to fine

Alt text

V2: appropriate scale for searching

  • choose the scale to search efficiently

Alt text

V3: tuning practice Alt text

Batch Normalization

V1: normalizing activations in NN

  • it uses the same logic of normalizing input partially, but use on each activation functions by normalizing Z (or A in rarer case)
  • however, we don’t want all activation functions are distributed at $N(0,1)$ as input $X$, so we use beta and gamma as learn parameters to tune the mean and variance Alt text

V2: add in batch norm

Alt text

V3: why works

  • to get faster gradient descent as same as normalized input
  • to make weights of deeper layer more robust to the weight changes of earlier layers,and let each layer train more relied on itself
  • slightly regularization effect when combine with mini-batch Alt text

V4: batch norm at test time

  • estimate mean and variance by weighted average across mini-batches –> so need to define mini-batch size during testing

Alt text

Multi-class Classification

V1: softmax regression

  • use softmax as last layer Alt text

V2: training a softmax classifier

Alt text

Introduction to Programming Frameworks

V1: Deep learning frameworks Alt text

V2: TensorFlow

  • only define cost function (forward prop) by hand, and the backward part will be automatically done Alt text

Alt text

PREVIOUSStructuring Machine Learning Projects - Deep Learning Specialization 3
NEXTNeural Networks and Deep Learning - Deep Learning Specialization 1