Sequence Models - Deep Learning Specialization 5
deeplearning.ai by Andrew Ng on Coursera
W1: Recurrent Neural Networks
Building Sequence Model
Notation:
Model Architecture:
- Why standard network works not well?
- Inputs, outputs can be different lengths in different samples
- Doesn’t share features learned across different positions of text
- CNN learns from one part of the image and generalize to other parts, where each filter represents one kind of learning object and convolution apply it across the image
- RNN is also like a ‘filter’ swapping through the sequence data
- Size of one-hot encoded input is too large to handle
- Uni-directional RNN (get the information from past steps only)
Types of RNN
Language Model and Sequence Generation
Purpose: exam the probability of sentences
Training the model:
Sampling Novel Sequence: to get a sense of model prediction, after training
Character-level Language Model: can handle unknown words but much slower
Address Vanishing Gradient by GRU / LSTM
Also has exploding gradient problem, but it is easier to be solved by gradient clipping
Vanishing Gradient:
- Like very deep neural network, for a very deep RNN, the gradient for earlier layer is too small to affect those parameters
- In practice, it means that the result of later layers are hard to be strongly influenced by earlier layers. In other words, RNN tend not to be good at capturing long-ranged dependencies.
- can be understood as with only “short-term” memory
- Sentence example: use was or were?
- The
cat
, which … [long parenthesis],was
full. - The
cats
, which … [long parenthesis],were
full.
- The
Gated Recurrent Unit (GRU): simplified from LSTM
- Basic idea:
- conventional RNN uses linear weighted past information, so by going through large number of layers, the information from earlier layer is ‘weighted’ too many times and left nearly none.
- GRU use a gate to control ‘update or not update’ each element in activation function, so that if the old information is not ‘significant’ enough ($\Gamma_u$), it will be replaced by new information
- Compared to LSTM, it represents Update Gate and Forget Gate in LSTM, by $\Gamma_u$ and $1-\Gamma_u$
Long Short Term Memory (LSTM): more general than GRU
Bidirectional RNN
Condition of Application: need entire sentence to get the result
Deep RNN
W2: Natural Language Precessing & Word Embeddings
Introduction to Word Embeddings
One-hot representation of words treats each word as a thing unto itself, which is hard for algorithm to generalize the cross words
Word Embedding: Featurized Representation to find out words with similar properties
Transfer Learning with Word Embedding:
- Learn word embeddings from large text corpus (1-100B words), or download pre-trained embedding.
- Transfer embedding to new task with smaller training set (maybe ~100k words).
- Optional if step 2 has enough large dataset: continue to finetune the word embeddings with new data.
Difference between Encoding and Embedding:
- Encoding in face recognition is a algorithm can use any image as input, and then find out the characteristic on them
- Embedding in NLP cannot handle any unknown vocabulary from input
Analogy Reasoning with Word Vector:
Embedding Matrix Notation:
Learning Word Embeddings: Word2vec & GloVe
Word2Vec (Context and Target Pair):
This is to learn word embedding matrix, not to predict
- Main problem is that softmax is computational expensive
Negative Sampling: similar but more efficient than skip-grams by transforming softmax to binary classification
GloVe Word Vectors:
Application using Word Embeddings
Sentiment Classification:
With word embeddings, only moderate size of labeled training dataset can build good model
Debiasing Word Embeddings:
W3: Sequence Models & Attention Mechanism
Seq2Seq with Encoder + Decoder Architecture
Difference between Language Model and Seq2seq:
Picking the Most Likely Sentence:
- Why not greedy search, which picks each ONE word with highest probability at a time?
- The result is affected by the popularity of each word, like ‘is going to visit’ is more common compared to ‘is visiting’ but worse translation
- The final sequence is not the sequence with highest probability
- Why not considering the probability of the whole sequence?
- computational expensive, a sequence with 10 words selecting from 10k vocabulary list has $10,000^{10}$ combinations
Beam Search
: approximation algorithm; not guarantee highest prob output- take beam width = 3 as example
- selecting 3 results in 1st stage with top 3 highest prob
- feed these 3 results as input of 2nd stage and find out the 3 results with top 3 highest prob among $10,000^{3}$ combinations
- continue Step 3 until end of the sequence
Improving Beam Search: length normalization
Error Analysis on Beam Search:
- Compare the probability of the translation from human and algorithm, to identify the error comes from RNN or beam search
Bleu Score: bilingual evaluation understudy
Evaluate ‘accuracy’ of a model predicting multiply equally good answers, being a substitute for human evaluating each output
Attention Model
Counter the problem of long sentence, which requires the ability of memory but not badly need a NN to do this kind of job. Instead of ‘remembering’ the whole sentence and then generate output sequence, only focus on nearby words corresponding to each time step, while the range of ‘nearby’ is learnt by gradient descent, and ‘adjusted’ corresponding to every different case (input sequence).
Ability of attention adaptation to input is because $\alpha$ is determined by both output LSTM at t-1 and input LSTM at t
Speed Recognition - Audio Data
Approach 1: Attention Model:
Approach 2: CTC Model: Connectionist Temporal Classification