Machine Learning - Andrew Ng @ Coursera
Week 1: Introduction
Application of ML
- Database mining
- large dataset growth of automation/web
- Application can’t program by hand
- handwriting recognition, NLP, computer vision
- Self-customizing programs
- recommendations
- Understanding human learning (brain, real AI)
Definition of ML
Tom Mitchell: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with experience E
Supervised Learning: “right answers” is given
- Regression
- Classification
Unsupervised Learning: no idea about the result
- Social network analysis / news grouping / ..
- Cocktail party problem: separate the overlapping sounds
- easily implement in Octave:
[W,s,v] = svd((repmat(sum(x.\*x,1),size(x,1),1).\*x)\*x');
- Octave is commonly used for prototyping machine learning algorithm before going to industry
- easily implement in Octave:
Model and Cost Function
Model Representation
Training a Model
Cost Function: the target we are going to minimize
- Square error function for regression (also used in DL)
Parameter Learning
Gradient Descent: a general algorithm used all over the place in ML
- gradient represents that when or a unit, how much will or correspondingly
- then because we want to decrease, we update with partial gradient (partial unit of change) controlled by learning rate
- update all parameters at the same time, so that the cost function keeps same for all parameters in single round
Gradient Descent of Linear Regression
- it is always a convex function without local optima
- Batch Gradient Descent: consider all training examples when doing gradient descent, corresponding to minibatch
Linear Algebra Review
Matrix: (2d matrix) Vector: (n-dimensional vector) Scalar: an single value object, not a matrix or vector
Properties:
- not commutative
- associative
- Identity matrix has
Inverse:
- Matrix without inverse matrix is called singular or degenerate matrix
Transpose:
Week 2: Linear Regression with Multiple Variables
Multivariate Linear Regression
Expression:
- (n+1)-dimensional
vector
where keep (case with only one sample)
Gradient Descent Optimization 1: Features Scaling
- If all features are on similar scale, gradient descent can converge more quickly
- Mean normalization is commonly used
- where can be standard deviation or range
- target is to make
approximately
or
Gradient Descent Optimization 2: Learning Rate
Feature Engineering:
- Combine features, such as using square instead of length + width
- Transfer to polynomial regression
- created new features and where and
Normalize
, and to similar scalar- Train model as multivariate linear regression
Computing Parameters Analytically
Normal Equation: alternative analytical method of gradient descent in linear regression
- use
matrix
- pros: no need to consider scalar / faster
- if is non-invertible
- pseudo inverse function can handle
- reasons
- redundant features with linearly dependency: delete
- too many features (like ): delete or use regularization
Week 3: Logistic Regression
Classification and Representation
Logistic Regression Representation:
- exactly a node in DL –> logistics regression + sigmoid function (or ReLu in DL)
- output the
probability
of positive class
Decision Boundary:
- property of hypothesis, not of dataset
Logistic Regression Model
Cost Function:
Gradient Descent:
- same as linear regression but with the hypothesis changed
- features scaling also helps on faster descending
Advanced algorithms other than gradient descent:
- Conjugate gradient / BFGS / L-BFGS
- pros: auto pick / faster descent
- cons: complex
- Call integrated tools to apply advanced algorithms
% Create function for cost and gradient
function [J, grad] = costFunction(theta, X, y)
J = [...code to compute J(theta)...];
for i = 1:size(X)
grad(i) = [...code to compute derivative of J(theta)...];
end
end
% Set your algorithm
% 'GradObj'=on tells that input function has both cost and gradient
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
% Call fminunc to find the local minimum of unconstrained multivariable function
% @(t) tells that theta is the updating input in this function
[theta, cost] = ...
fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
% For constrained multivariable function use fmincon
Multiclass Classification
One-vs-All:
- Train multiple hypothesis returning probability of belonging to each calss
- Run
max
to output the class with highest prob
PS: also can use softmax here, but need to replace all sigmoid activate functions with a single softmax activate function
Solving the Problem of Overfitting
Overfitting: high variance
- Caused by a complicated function made by too many features
- Solution: reduce features manually or automatically (may lose useful information) / regularization
Regularization on Cost Function:
- the smaller parameters correspond to simpler hypothesis
- conventionally does not include where
Gradient Descent with Regularization:
-
Change the content of corresponding to linear regression or logistic regression
-
For linear regression, can also use Normal Equation
Week 4: Neural Networks: Representation
Non-linear Model
Why we need non-linear model: non-linear decision boundary is more flexible to fit various cases/datasets
Why we need neural network: logistic regression with polynomial features can provide non-linear result but it is a costly approach –> we need another way to build non-linear model
- because polynomial LR needs to add in all possible combinations between features with different order/degree, which will explode when there are already a lot of features
Neural Networks
(refer to Deep Learning Specialization)
Applications
Building XNOR logic: non-linear decision boundary
Week 5: Neural Networks: Learning
Cost Function and Backpropagation
Cost function:
Backpropagation algorithm: (refer to better explanation in deep learning specialization)
Backpropagation in Practice
Unroll and reshape theta and gradient:
- due to
fminunc
only handle theta as vector
% Unroll to vector
thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ]
deltaVector = [ D1(:); D2(:); D3(:) ]
% Reshape to matrix
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
Gradient checking:
- Use 2-sided difference to estimate gradient, then compare it with backprop gradient
Random initialization: symmetry breaking
Flow of training:
- Randomly initialize the weights
- Implement forward propagation to get hΘ(x(i)) for any x(i)
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
Week 6: Advice for Applying Machine Learning
Evaluating a Learning Algorithm
Evaluating a Hypothesis: 0.7 train / 0.3 test split
- Linear regression: cost
- Logistic regression: test error (not cost)
Model Selection & Hyperparameter Tuning: 0.6 train / 0.2 validation / 0.2 test split
- Parameters are fitting training set
- Hyperparameters are fitting validation set for predicting new data
- Finally, use test set to report the generalization ability of selected model with hyperparameters and trained parameters
Bias vs. Variance
Regularization and Bias-Variance:
Machine Learning Diagnostic with Learning Curves: to understand bias-variance of your model
What to Do Next: based on the diagnostic result of bias-variance
- Fixed high variance (overfitting)
- getting more training examples
- trying smaller sets of features
- increasing lambda of regularization
- smaller neural network
- Fixed high bias (underfitting)
- adding features
- adding polynomial features
- decreasing lambda of regularization
- larger neural network
Case Study: Building a Spam Classifier
Brainstorming before starting helps you saving plenty of time
Recommended Approach:
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
Error Analysis
on CV set – Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were madeSingle numerical evaluation
to test whether an improvement approach works or not, such as accuracy
Handling Skewed Data
Skewed Data: positive and negative samples are extremely imbalance, like 99.5% w/o cancer vs 0.5% with cancer
Error Metrics Analysis for Skewed Classes:
- Set y=1 as rare class, and if both precision and recall are high, then the model performs well
Precision-Recall Trade Off: using F score instead
- By tuning threshold of logistic regression , precision and recall has reverse correlation
F Score
()takes both precision and recall into consideration as
Using Large Data Sets
When does more data help?
- Step 1: (
Human Performance Level
) Given the input , can a human expert confidently predict ?- If YES –> go to Step 2
- If NO –> has not enough information for prediction –> add in more features
- Step 2: If the model is complex enough to achieve low training error (low bias)?
- If YES –> more data will help (it is hard to overfit very large dataset)
- If NO –> add in more parameters
Week 7: SVM
Large Margin Classification
From logistic regression to SVM:
- Compared to LR:
- SVM use hinge loss as activation function
in cost function
, which considers margin between 2 groups, instead of considering theprobability difference
between prediction and true label in LR - Hypothesis of SVM uses discriminant function, while that of LR uses sigmoid function with specific threshold
- SVM use hinge loss as activation function
- If C is very large, then algorithm tends to choose theta that the cost part equal to zero, and leave the regularization part only
- If we have outlier examples that we don’t want to affect the decision boundary, then we can reduce C
SVM controls the decision margin between classes:
- From the definition of vector inner product, represents the projection of vector along the vector , times the length of vector
- Therefore, by SVM cost function, is trained to guarantee the projection of all samples upon vector is at least 1 or -1, with shortest (regularization term)
Kernels
Use SVM as non-linear classifier with kernel tricks
Gaussian kernel: combine all features into relative distance between samples
- Replaces all features with distance between any pair of samples (using each sample as a landmark), where the distance is normalized by Gaussian distribution
- Hyperparameters: the higher the further sample prone to be in a similar group of this landmark
SVMs in Practice
LR or SVM:
- larger than (10k vs 10~1k)
- LR or SVM w/o kernel
- small, intermediate (1~1k vs 10~10k)
- SVM with Gaussian kernel
- small, large (1~1k vs 50k+)
- add more features, then LR or SVM w/o kernel
Hyperparameters:
- : larger C, less regularization
- Kernel (similarity function):
- No kernel (linear kernel): when is large, is small
- Gaussian kernel: when is small, is large
- : larger , feature vary more smoothly –> higher bias, lower variance
- need feature scaling before performing kernel transformation
- Others rarely used
Multi-class classification: one-vs-all, training K classifier for K classes
Week 8: Unsupervised Learning
Clustering
K-means:
- Cost function
- also called the distortion of the training examples
% Minimize J by assigning c(1)..c(m), holding mu(1)..mu(K) fixed
for i = 1 to m:
c(i):= index (from 1 to K) of cluster centroid closest to x(i)
% Minimize J by moving centroid mu(1)..mu(K)
for k = 1 to K:
mu(k):= average (mean) of points assigned to cluster k
- 1st loop:
- 2nd loop:
Random initialization: 50-1000 iterations to avoid local optima
- Randomly pick distinct training examples
- Set equal to these examples
Choosing number of clusters :
- Follow the application purpose of running clustering
- Elbow method (less used)
Application:
- Compress image from 24-bit to 4-bit
- find out the major 16 colors in a image (centroids)
- store these colors with 24*16 bit
- replace color of each pixel with corresponding centroid color 1~16 (4-bit)
Bonus: Drawbacks of K-Means
- k-means assumes the variance of the distribution of each attribute (variable) is spherical
- all variables have the same variance
- the prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations
PCA
Try model without PCA first, and use PCA only when pre-train model cannot satisfy you
Motivation:
- Data Compression
- combine highly correlated variables
- remove features with less ability of distinguish samples
- Visualization
Optimization target: find a direction onto which to project the data so as to minimize the projection error
Procedure of PCA:
- Feature scaling with zero mean and comparable scale, for example,
- Compute covariance matrix
Sigma = (1/m) * X' * X
- Compute eigenvectors of covariance matrix
[U,S,V] = svd(Sigma)
- Take the first k columns of U matrix
Ureduce = U(:,1:k)
- Compute Z
Z = X * Ureduce
Choosing K: how much variance is retained
where is from diagonal matrix S
in [U,S,V] = svd(Sigma)
Applications:
- Reconstruction from compressed representation
- Speedup supervised learning
- PCA should be defined by training set, and then apply to validation and test set
- useful for image dataset
Do NOT use PCA to prevent overfitting
- Overfitting is a problem of model, not of dataset
- Regularization is the way to improve model, and it takes labels into account
- PCA necessarily remove some information from dataset, and it only considers features, dropping out labels
Week 9
Anomaly Detection (Semi-supervised)
The ET drift annotation system in GF is a kind of anomaly detection, which uses last 50 samples as training set of a single feature, and highlights anomaly if that new sample is out of 3 sigma (probability density threshold as 3 sigma). Precisely, this detection is a unsupervised version because it uses unlabeled data as training set.
Supervised vs. Semi-supervised
- Supervised learning uses large number of both positive and negative samples
- algorithm is able to get a sense of what positive samples look like
- the future positive samples are similar to the ones in training set
- Semi-supervised learning uses very small number of positive samples with large number of negative samples
- it is impossible for algorithm to learn from positive samples about what positive samples look like
- the future anomalies may look nothing like any samples in training set
Target: return the probability density from Gaussian distribution, where indicates is anomaly
Train / Validation / Test split:
- Full set example: 10k normal samples + 20 anomalous samples
- Train set: 6k normal
- Validation set: 2k normal + 10 anomalous
- Test set: 2k normal + 10 anomalous
Procedure and Metrics:
- Fit parameters with train set
- Calculate and
- Using validation set, calculate
- Use Precision/Recall/ score to evaluate model and tune
- Using test set, evaluate generalization ability
Feature Engineering: for model improvement
- Transform features to Gaussian distribution, by using , , etc.
- Manually choose / add / create features that might take on unusually large or small values in the event of an anomaly (could be based on your understanding of the application)
- Otherwise, use multivariate Gaussian distribution version
Anomaly Detection with Multivariate Gaussian Distribution:
- Original AD treats each features as
independent
, modeling them separately, which means only a feature with extreme low probability density, then the sample is detected as anomaly- Graphically, original algorithm (pink circles) generate ellipse with both axis paralleling to the direction of features, while multivariate version (blue circles) covers more general cases by generating ellipse with rotated axis, considering
correlation between features
- Graphically, original algorithm (pink circles) generate ellipse with both axis paralleling to the direction of features, while multivariate version (blue circles) covers more general cases by generating ellipse with rotated axis, considering
- Instead of original algorithm,
Comparison between Two Algorithm:
Recommendation
Special category of ML: feature learning (automatically select good features)
Content Based Recommendation: w/o feature learning because features are known already
Collaborative Filtering Recommendation:
Repeat learning features (contents) from all user rates, and then learning parameters (user preference) based on contents.
All user helps to learn movie features, and then by these features, algorithm predicts better result for every user. However, the feature learnt is hard to interpret but can be used to find the related movies by calculate
- Given , estimate
- Given , estimate
- Combine both and run simultaneously (initial random values + gradient descent)
Vectorization with Low Rank Matrix Factorization:
Mean Normalization for User w/o Any Rating:
- Normalize rates with their mean before running algorithm
Week 10: Large Scale Machine Learning
Gradient Descent with Large Datasets
With larger m trained by a model, the variance will be smaller. But when m is extreme large, like 100M, each epoch of gradient descent is computational expensive due to calculating and averaging over 100M values.
Stochastic Gradient Descent: 1 sample per update
- Randomly “shuffle” the dataset
- Iterate ,
- Repeat above 1-10 times (epoch), resulting NEAR global minimum
Mini-batch Gradient Descent: samples per update (fastest with vectorization)
- Randomly “shuffle” the dataset
- Set , iterate ,
- Repeat above 1-10 times (epoch)
Gradient Check: for Stochastic GD
- With every iterations, plot the average cost of last samples with current
- normally sets as 500-2000
Choosing Learning Rate: normally keep constant, otherwise can decrease over iterations
Online Learning
Handle unlimited continuous stream data, so only perform train-and-drop way of training, where no need to re-use the data. And this method can adapt to the changing of user preference due to forever training with fresh data.
Predicted Click Through Rate (CTR): return 10 results in 100 phones when user search
- features of phone, how many words in user query match name of phone, how many words in query match description of phone, etc.
- is click on link.
- Generate 10 samples per search action.
- Learn .
- Drop these 10 samples and continue gradient descent with new 10 samples
Map Reduce and Data Parallelism
Great technique to handle even larger machine learning problems: split jobs onto more than one core or computer
MapReduceable: the model can be expressed as computing sums of functions over the training set
- Linear regression / logistic regression
- Neural Network (each core computes the forward and backward propagation of a subset, then combine)
Week 11: Case Study – Photo OCR
Pipeline of Photo OCR
Divide a large project into stages and assign manpower corresponding to the workload
- Text detection (2D sliding windows)
- Character segmentation (1D sliding windows for character split)
- Character classification (LR or SVM)
Artificial Data Synthesis
Artificially Create More Data: Distortion of Data
- Should represent types of noise / distortion in test set or real applications
- background noise
- bad cellphone connection
- Purely random / meaningless noise has no help
How much work would it be to get 10x as much data as we currently have?
- Quantitatively understand the effort of getting more data, and then make better decision
- Approaches:
- Artificial data synthesis
- Collect/label it yourself
- “Crowd Source” like Amazon Mechanical Turk
Ceiling Analysis for Pipeline
Prioritize your work in a pipeline