Applied Machine Learning - Microsoft Certificate in Data Science 9a

1. Time Series and Forecasting

Introduction to Time Series

Finance / stock / currency exchange rate / sales forecast / temperature / heartrate / Semicon ET and inline long-term trend / …

The Nature of Time Series Data

Time Series vs. Random or Independent Noise @Time Series vs. Random or Independent Noise

Autocorrelation: value at $t=0$ has correlation with the value at following $t$

Autocorrelation
@Autocorrelation

Regular Reporting: Some algorithms can only work with regular reporting

Regular vs. Irregular Reporting @Regular vs. Irregular Reporting

Decomposition – STL Package (Investigating)

Components of Time Series Data

STL Package Procedure

Start with time series data $X$
Use Loess to find a general trend $T$
Use Moving Average Smoothing $X-T$ to find fine-grained trend $C$
Get seasonal / periodic component $S = X-T-C$
Get final trend $V$ by smoothing the nonseasonal trend $X-S$ with Loess
Get remainder $R = X-S-V$

Lowess / Loess Regression

General trend $T$ for smoothing time series data

Idea: fit local polynomial models and merge them together local for flexible polynomial for smooth

Step 1: Define the window width m, and do local regression with m nearest neighbors Alt text

Step 2: Choose a weight function giving higher weights to nearer points to center Alt text

Step 3: Do quadratic Taylor’s polynomial regression considering the weights from Step 2 Alt text

Step 4: Substitute $x_{0}$ with $\widehat{x_{0}}$ , which is calculated from regression when $t_{i}=t_{0}$

Step 5: Repeat above for each $\widehat{x}$ of $t$ , then connect points to get the general trend

Adjusting Window

Moving Average Smoothing

Fine-grained trend $C$ for smoothing time series data with clearly periodicity (after extract general trend)

Procedure of Moving Average Smoothing

Stationary Remainder / Time Series

Second-order stationarity conditions:

Constant Mean
Constant Variance
An autocovariance that does not depend on time

Technique 1: Boxplots with binned data point into upper hierarchy Alt text

Technique 2: Boxplots with binned data point into upper hierarchy Alt text

Autocorrelation and Autocovariance

Same as Correlation (normalized Covariance) used to describe (linear) relationship between Feature X and Feature Y

Auto = self ACF = Autocorrelation Function

Alt text

Working with Time Series

Introduction of models for modeling different types of time series data so that we can do forecasting

Remainder can also have time series pattern which need to be carefully modeled and removed, then the left residue (prediction error) should be normal distributed along time

Noticed successful STL should have below appearance

histogram of remainder is close to normal distribution

boxplot of remainder at seasonal level (like month) is stable

Moving Average Models MA(q)

Microsoft announces one news everyday, and its stock will be affected by today’s and last 2-days news

A model has only short memory of the previous noise Moving Average Model

ACF: sharp cut off after order q; can identify whether you data can be modeled as $MA(q)$ with what order $q$ Alt text

Autoregression Models AR(p)

Today’s value is slightly different from a combination of the last $p$ day’s values Alt text

ACF: Exponential decay; can not identify order $p$

Alt text

Partial Autocorrelation

The correlation that is not accounted for all of the lags in between

Alt text

@Comparison between ACF and PACF of AR(1)

Auto-Regressive Moving Average Model ARMA(p,q)

Used when both ACF and PACF shows slow decay

Alt text

Auto-Regressive Integrated Moving Average Model ARIMA(d,p,q)

Differencing

Non-stationary time series can have stationary differences

Alt text

Higher order trends can be turned into stationary models through repeated differencing Alt text

Model Details

Alt text

Exponentially Weighted Moving Averages Model EWMA / Simple Exponential Smoothing Model SES

Most widely used for business applications / forecasting

Alt text

Forecasting in Context

Alt text

Reference

Time Series Analysis (TSA) in Python - Linear Models to GARCH

2. Spatial Data Analysis

Mobile marketing / smart watch data / oil exploration / real estate pricing / transportation network / crimes data …

Introduction to Spatial Data

Types of Spatial Data

Points (location only)
Polygons
Pixels / Raster (location + count/density shown as colors)

Types of Distance

Euclidean distance (physical distance; use built-in tool to calculate since earth is round)
Driving / Walking distance
Adapted to the local area (like same building) Distance Matrix

Visualize relationship of different features and overlay multiple features in one plot by various way like bubble size or filled color

Kernel Density Estimation KDE

Go-to method for density / event rate $\lambda$ estimation
“Nonparametric”, meaning that there is a bump on each point

Alt text

K-Nearest Neighbour

Localized technique of probability estimation

Classification by majority vote
Regression by average vote
Take care
- scale sensitive: consider normalization
- selection of K and weight of distance

Working with Spatial Data

Spatial Poisson Processes

Probability estimation of occurrence count in an area in a period, based on Poisson distribution, which is a discrete probability distribution

Alt text

Variogram

Estimate the (label) covariance between samples with spatial changes in units, which is just like the ACF and PACF for time series Input data is labeled Consider overall data in dataset

Alt text

Reference
- Semi-Variogram: Nugget, Range and Sill
- Estimation and Modeling of Spatial Correlations (about the second-order stationary assumption)

Kriging / Gaussian Process / Spatial Regression

Overall technique of probability estimation based on Variogram providing the covariance k

k can be modelled by arbitrary covariance function in Variogram stage

Interpolation method for estimating the property of unsampled location, so can get the complete map

Alt text

Spatial Data in Context

Alt text

3. Text Analytics

Summary of text / compare between text or classification

Introduction to Text Analytics

Alt text

Word Frequency

Frequency plot
Cumulative plot to examine the cleaned up dataset

Stemming

Only for English

connection, connected, connective, connecting –> connect

Porter’s Algorithm
- V is one or more vowels (A, E, I, O, U)
- C is one or more consonants
- All words are of the following form
  - [C]VC{m}[V], optional in brackets and stack times in parentheses
- For each words, we check whether it obeys a condition, and shorten or lengthen it accordingly

Feature Hashing (Dimensionality Reduction)

Fast and space-efficient way of vectorizing features, by applying a hash function to the features and using their hash values as indices directly.

Also called hashing (kernel) trick.

Wiki: Feature hashing

Working with Text

Calculating Word Importance by TF-IDF

TF = Term Frequency (the number of times you see a word) IDF = Inverse Document Frequency $TF\cdot log(\frac{\#Documents}{\#Number\ of\ Documents\ Word\ Appears})$

TF-IDF is the key factor used in search engines

TF-IDF is high when
- the term appears many times in few documents
TF-IDF is low when
- the term appears in almost all documents
- the term does not appear often

Introduction to Natural Language Processing

Alt text

Text Analytics in Context

Alt text

4. Image Analysis

Photographs / Security cameras / Check reader / Medical images / Art work analysis …

Introduction to Image Analysis

Read / Plot Image

misc function from scipy (output a numpy array with rows and columns as the image size)
imshow from matplotlib.pyplot
glob.glob for multiple images reading

Image Properties

Examine the distribution of gray scale
- Histogram (ideal image has nearly uniform distribution)
- CDF (ideal image has a straight line)
Adaptive Histogram Equalization to improve contrast
- The histogram equalization algorithm attempts to adjust the pixel values in the image to create a more uniform distribution
- exposure.equalize_adapthist from skimage
  - Before and After Equalization

Image Manipulation

Resize by misc.imresize from scipy
Rotate by interpolation.rotate from scipy.ndimage

Blurring and Denoising

Pre-whitening together with Denoising can improve the sobel edge detection result –> clearer edge

The reason may be that it covers and removes the unnecessary / meaningless portion of image, which also happens in time series analysis when doing cross-correlation function for two series case

Pre-whitening to add noise
Denoising by gaussian_filter / median_filter from scipy.ndimage.filters

Alt text

Working with Images

Feature Extraction

Sobel Edge Detection

Detecting edge by looking for single direction gradients within selected area Viola and Jones Method in Course 7

Alt text

Segmentation

Remove noise or unwanted portion

Simplest way –> threshold (move out the points under or over threshold)

Harris Corner Detection

Compute Q matrix in E function representing a ellipse, and detect a corner when Q has 2 large eigenvalues which illustrates smaller principal axes of ellipse

Alt text