Sequence Models - Deep Learning Specialization 5

deeplearning.ai by Andrew Ng on Coursera

W1: Recurrent Neural Networks

Building Sequence Model

Notation:

Alt text

Model Architecture:

Why standard network works not well?
- Inputs, outputs can be different lengths in different samples
- Doesn’t share features learned across different positions of text
  - CNN learns from one part of the image and generalize to other parts, where each filter represents one kind of learning object and convolution apply it across the image
  - RNN is also like a ‘filter’ swapping through the sequence data
- Size of one-hot encoded input is too large to handle
Uni-directional RNN (get the information from past steps only)

Types of RNN

Alt text

Language Model and Sequence Generation

Purpose: exam the probability of sentences

Alt text

Training the model: Alt text

Sampling Novel Sequence: to get a sense of model prediction, after training Alt text

Character-level Language Model: can handle unknown words but much slower

Address Vanishing Gradient by GRU / LSTM

Also has exploding gradient problem, but it is easier to be solved by gradient clipping

Vanishing Gradient:

Like very deep neural network, for a very deep RNN, the gradient for earlier layer is too small to affect those parameters
In practice, it means that the result of later layers are hard to be strongly influenced by earlier layers. In other words, RNN tend not to be good at capturing long-ranged dependencies.
- can be understood as with only “short-term” memory
Sentence example: use was or were?
- The cat, which … [long parenthesis], was full.
- The cats, which … [long parenthesis], were full.

Gated Recurrent Unit (GRU): simplified from LSTM

Basic idea:
- conventional RNN uses linear weighted past information, so by going through large number of layers, the information from earlier layer is ‘weighted’ too many times and left nearly none.
- GRU use a gate to control ‘update or not update’ each element in activation function, so that if the old information is not ‘significant’ enough ($\Gamma_u$), it will be replaced by new information
Compared to LSTM, it represents Update Gate and Forget Gate in LSTM, by $\Gamma_u$ and $1-\Gamma_u$

Long Short Term Memory (LSTM): more general than GRU Alt text

Bidirectional RNN

Condition of Application: need entire sentence to get the result Alt text

Deep RNN

Alt text

W2: Natural Language Precessing & Word Embeddings

Introduction to Word Embeddings

One-hot representation of words treats each word as a thing unto itself, which is hard for algorithm to generalize the cross words

Word Embedding: Featurized Representation to find out words with similar properties Alt text

Transfer Learning with Word Embedding:

Learn word embeddings from large text corpus (1-100B words), or download pre-trained embedding.
Transfer embedding to new task with smaller training set (maybe ~100k words).
Optional if step 2 has enough large dataset: continue to finetune the word embeddings with new data.

Difference between Encoding and Embedding:

Encoding in face recognition is a algorithm can use any image as input, and then find out the characteristic on them
Embedding in NLP cannot handle any unknown vocabulary from input

Analogy Reasoning with Word Vector: Alt text

Embedding Matrix Notation:

Alt text

Learning Word Embeddings: Word2vec & GloVe

Word2Vec (Context and Target Pair):

This is to learn word embedding matrix, not to predict

Main problem is that softmax is computational expensive

Negative Sampling: similar but more efficient than skip-grams by transforming softmax to binary classification Alt text

GloVe Word Vectors:

Alt text

Application using Word Embeddings

Sentiment Classification:

With word embeddings, only moderate size of labeled training dataset can build good model

Alt text

Debiasing Word Embeddings: Alt text

W3: Sequence Models & Attention Mechanism

Seq2Seq with Encoder + Decoder Architecture

Difference between Language Model and Seq2seq: Alt text

Picking the Most Likely Sentence:

Why not greedy search, which picks each ONE word with highest probability at a time?
- The result is affected by the popularity of each word, like ‘is going to visit’ is more common compared to ‘is visiting’ but worse translation
- The final sequence is not the sequence with highest probability
Why not considering the probability of the whole sequence?
- computational expensive, a sequence with 10 words selecting from 10k vocabulary list has $10,000^{10}$ combinations
Beam Search: approximation algorithm; not guarantee highest prob output
1. take beam width = 3 as example
2. selecting 3 results in 1st stage with top 3 highest prob
3. feed these 3 results as input of 2nd stage and find out the 3 results with top 3 highest prob among $10,000^{3}$ combinations
4. continue Step 3 until end of the sequence

Improving Beam Search: length normalization Alt text

Error Analysis on Beam Search:

Compare the probability of the translation from human and algorithm, to identify the error comes from RNN or beam search

Bleu Score: bilingual evaluation understudy

Evaluate ‘accuracy’ of a model predicting multiply equally good answers, being a substitute for human evaluating each output

Attention Model

Counter the problem of long sentence, which requires the ability of memory but not badly need a NN to do this kind of job. Instead of ‘remembering’ the whole sentence and then generate output sequence, only focus on nearby words corresponding to each time step, while the range of ‘nearby’ is learnt by gradient descent, and ‘adjusted’ corresponding to every different case (input sequence).

Ability of attention adaptation to input is because $\alpha$ is determined by both output LSTM at t-1 and input LSTM at t

Alt text

Speed Recognition - Audio Data

Approach 1: Attention Model: Alt text

Approach 2: CTC Model: Connectionist Temporal Classification Alt text

Alt text