0%

## Overview & Learning Map

supervised learning

• regression
• classification
• linear model
• non-linear model
• deep learning
• SVM, decision tree, K-NN…
• structured learning

semi-supervised learning
transfer learning
unsupervised learning
reinforcement learning

## The Next Step for Machine Learning

• Anomaly Detection (让机器知道”我不知道“)
• Explainable AI (为什么知道)
• Life-long Learning
• Meta-learning / Learn to learn
• Few-shot / Zero-shot Learning
• 增强式学习
• Network Compression
• 如果训练资料和测试资料很不一样

## Regression

### Step1: Model 模型

$x_i$表示attribute of input x，也就是$x$的各种属性feature 特征
$w_i$表示weight权重
$b$表示bias偏差

### Step2: Goodness of Function

$L(f)=L(w,b)=\sum_{n}(\hat{y}^n-(b+w\cdot x^n_{cp}))^2$，得到Estimation Error 估测误差，以下便用这个函数估价

### Step3: Best Function

Pick the “Best” Function, $f^{\ast}=arg \min L(f)$ 或 $w^{\ast},b^{\ast}=arg \min L(w,b)$

$w^{n}=w^{n-1}-\eta \left . \frac{\partial L}{\partial w}\right |_ {w=w^{n-1},b=b^{n-1}},b^{n}=b^{n-1}-\eta \left . \frac{\partial L}{\partial b}\right |_ {w=w^{n-1},b=b^{n-1}}$

### Model Selection

A more complex model does not always lead to better performance on testing data. This is Overfitting.

### Regularization

$L=\sum_{n}(\hat{y}^n-(b+\sum w_ix_i))^2+\lambda (w_i)^2$，这里也就是我们希望参数越小越好，这里也是不需要$b$也就是bias的。

If some noises corrupt input $x_i$ when testing. A smoother function has less influence.
Training error: larger $\lambda$, considering the training error less.
We prefer smooth function, but don’t be too smooth.

## Bias and Variance

### Review

Error due to “bias” and “variance”.

### Estimator

assume the mean of $x$ is $\mu$
assume the variance of $x$ is $\sigma^2$

$m=\frac{1}{N}\sum_{n}x^n\neq \mu$
$E[m]=E[\frac{1}{N}\sum_{n}x^n]=\frac{1}{N}\sum_{n}E[x^n]=\mu$，$m$的期望就是$\mu$，这个是unbiased的
$Var[m]=\frac{\sigma^2}{N}$，这个Variance depends on the number of samples，如果数量较多那么就会比较集中，如果较少那么就会比较分散了

$m=\frac{1}{N}\sum_{n}x^n\neq \mu,s^2=\frac{1}{N}\sum_{n}(x^n-m)^2$

### Variance

Model 比较简单那么可能获得一个Small Variance的情况，而比较复杂的Model 可能得到一个Large Variance

### Bias

$E[f^{\ast}]=\bar{f}$
Bias: If we average all the $f^{\ast}$, is it close to $\hat{f}$

### Large Bias

Diagnosis：

• If your model cannot even fit the training examples, then you have large bias.
• If you can fit the training data, but large error on testing data, then you probably have large variance

• Add more features as input.
• A more complex model.

### Large variance

• More data
这个可以自己生成许多训练数据

• Regularization
这时调整了function space，可能会影响到bias，需要调整weight来取得平衡

### Model Selection

• There is usually a trade-off between bias and variance.
• Select a model that balances two kinds of error to minimize total

• Cross Validation
将Training Set分成Training Set和Validation Set，然后用Validation Set来选择Model

• N-fold Cross Validation
分成多份，将其中一份当成Validation Set

Gradient:$\triangledown L=\begin{bmatrix}\frac{\partial L(\theta_1)}{\partial \theta_1}\\ \frac{\partial L(\theta_2)}{\partial \theta_2}\end{bmatrix}$
$\theta^{n}=\theta^{n-1}-\eta \triangledown L(\theta^{n-1})$

### Tip 1: Tuning your learning rate

• Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
就是说经过几次计算后越来越接近最优解，就要调小Learning Rate，比如$\eta^t=\eta/\sqrt{t+1}$这个压子
• Learning rate cannot be one-size-fits-all
Give different parameters different learning rates.

• Divide the learning rate of each parameter by the root mean square the its previous derivatives.
Adagrad: $w^{t+1}=w^{t}-\frac{\eta^{t}}{\sigma^{t}}g^{t},\sigma^{t}=\sqrt{\frac{1}{t+1}\sum_{i=0}^{t}(g^{i})^2},\eta^t=\eta/\sqrt{t+1}$
$\sigma^{t}$：root mean square of the previous derivatives of parameter w，就是前面所有的微分的平方的平均值再开根号
约分一下得到$w^{t+1}=w^{t}-\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^{i})^2}}g^{t}$

Adagrad呢，上面$g^{t}$越大，变化越大，而下面的$\sqrt{\sum_{i=0}^{t}(g^{i})^2}$，会导致$g^{t}$越大，变化越小

### Tip 2: Stochastic Gradient Descent

Loss函数$L=\sum_{n}\left (\hat{y}^n-(b+\sum w_ix_i^n) \right )^2$

### Tip 3: Feature Scaling

Make different features have the same scaling.

$R$个向量组$x^i$，For each dimension $i$，都算mean: $m_i$,standard deviation: $\sigma_i$

### Theory

Taylor Series

Very slow at the plateau

## Classification: Probabilistic Generative Model

### Ideal Alternatives

Function(Model): x->f(x):g(x)>0?class=1:class2
Loss function: $L(f)=\sum_{n}\delta (f(x^n)\neq \hat{y}^n)$, The number of times f get incorrect results on training data
Find the best function: Perceptron, SVN

### Generative model

$P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}$
$P(x)=P(x|C_1)P(C_1)+P(x|C_2)P(C_2)$

### Prior

Assume the points are sampled from a Gaussian distribution.
Find the Gaussion distribution behind them

• Gaussian Distribution:
$f_{\mu,\Sigma}(x)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}\exp\{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)\}$
Input: vector x, output: probability of sampling x
The shape of the function determines by mean $\mu$ and convariance matrix $\Sigma$
这部分有空可以再详细看下

How to find $\mu$ and $\Sigma$ ?

• Maximum Likelihood
我们假设这些向量由一个高斯分布generate，要找一个Gussian$(\mu^{\ast},\Sigma^{\ast})$ with maximum likelihood.
$L(\mu,\Sigma)=f_{\mu,\Sigma}(x^1)f_{\mu,\Sigma}(x^2)…f_{\mu,\Sigma}(x^n)$.
$\mu^{\ast},\Sigma^{\ast}=arg\max_{\mu,\Sigma}L(\mu,\Sigma)$.
$\mu^{\ast}=\frac{1}{n}\sum {x^i}(average x),\Sigma^{\ast}=\frac{1}{n}\sum (x^i-\mu^{\ast})(x^i-\mu^{\ast})^{T}$

### Modifying Model

Bernoulli distributions

Naive Bayes Classifier

### Posterior Probability

$P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}=\frac{1}{1+\frac{P(x|C_2)P(C_2)}{P(x|C_1)P(C_1)}}$
Let $z=\ln \frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}$
$=\frac{1}{1+\exp(-z)}=\sigma(z)$ Sigmoid function

$P(x|C_1)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}\exp\{-\frac{1}{2}(x-\mu^1)^{T}\Sigma^{-1}(x-\mu^1)\}$.
$P(x|C_2)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}\exp\{-\frac{1}{2}(x-\mu^2)^{T}\Sigma^{-1}(x-\mu^2)\}$.
$P(C_1)=\frac{N_1}{N_1+N_2},P(C_2)=\frac{N_2}{N_1+N_2}$.

## Classification: Logistic Regression

$f_{w,b}(x)=\sigma(\sum_{i}w_ix_i+b)$ output是0到1之间的数

$w_{i+1}=w_{i}-\eta \sum_{n} -(\hat{y}-f_{w,b}(x^n))x_i^n$和Linear regression的式子一样的

### Discriminative v.s. Generative

Benefit of generative model

• With the assumption of probability distribution, less training data is needed.
• With the assumption of probability distribution, more robust to the noise.
• Priors and class-dependent probabilities can be estimated from different sources.

Generative model因为有假设，所以可能会无视一些数据，受数据影响较小，所以可能数据量小时表现会比较好，数据量大的时候就比Discriminative Model 表现较差了

### Multi-class Classification

$y_i=e^{z_i}/\sum_{n}e^{z_n}$

$\hat{y}$几类就设置成几维，然后自己属于哪一维，那一维就为1，其他为0

### Limitation of Logistic Regression

• Feature Transformation
就把Logistics Regression接起来用，用几次以后当成转换后的值，在做Logistic Regression。
我们把每一个Logistic Regression叫做一个Neuron，串起来就是Neuron Network 类神经网络，就到了 Deep Learning

## Deep Learning

### Ups and downs of Deep Learning

• 1958: Perceptron (linear model)
• 1969: Perceptron has limitation
• 1980s: Multi-layer perceptron
• Do not have significant difference from DNN today
• 1986: Backpropagation
• Usually more than 3 hidden layers is not helpful
• 1989: 1 hidden layer is “good enough”, why deep?
• 2006: RBM initialization (breakthrough)
• 2009: GPU
• 2011: Start to be popular in speech recognition
• 2012: win ILSVRC image competition

### Three Step of Deep Learning

Neural Network: Different connection leads to different network structures.
Network parameter $\theta$ :all the weights and biases in the “neurons”.

### Fully Connect Feedforward Network

input layer - hidden layers - output layer

Deep = Many hidden layers

Matrix Operation: 其实就是input vector 来进行matrix operation （每一层集合所有的weights 和 biases，每一层最后有一个activation function），最后得到output vector

feature extractor replacing feature engineering

## Backpropagation

### Chain Rule

Forward pass: Compute $\frac{\partial z}{\partial w}$ for all parameters.
Backward pass: Compute $\frac{\partial C}{\partial z}$ for all activation function inputs z, Compute $\frac{\partial C}{\partial z}$ from the output layer.

### Keras

Interface of TensorFlow or Theano
Easy to learn and use(still have some flexibility)

### Mini-batch

one epoch

• Randomly initialize network parameters
• Pick the 1st batch, Update parameters once
• Pick the 2nd batch, Update parameters once
• Until all mini-batches have been picked

Repeat the above process

## Tip for Deep Learning

### Bad Results on Training Data

• New activation function

#### Rectified Linear Unit(ReLU)

Reason:

1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
ReLU variant
Leaky ReLU:$\sigma(z)=\left\{\begin{matrix}a=z, && z\geqslant 0\\ a=0.01z, && z<0\end{matrix}\right.$
Parametric ReLU: $\sigma(z)=\left\{\begin{matrix}a=z, && z\geqslant 0\\ a=\alpha z, && z<0\end{matrix}\right., \alpha also learned by gradient descent$

#### Maxout

ReLU is a special class of Maxout
Learnable activation function.

Activation function in maxout network can be any piecewise linear convex function
How many pieces depending on how many elements in a group.

Maxout - Training

Different thin and linear network for different examples.

#### RMSProp

RMSProp是这样计算的 Root Mean Square of the gradients with previous gradients being decayed
$w^{1}\leftarrow w^{0}-\frac{\eta}{\sigma^{0}}g^{0}, \sigma^{0}=g^{0}$
$w^{2}\leftarrow w^{1}-\frac{\eta}{\sigma^{1}}g^{1}, \sigma^{1}=\sqrt{\alpha (\sigma^{0})^2+(1-\alpha)(g^{1})^{2}}$
$w^{n+1}\leftarrow w^{n}-\frac{\eta}{\sigma^{n}}g^{n}, \sigma^{n}=\sqrt{\alpha (\sigma^{n-1})^2+(1-\alpha)(g^{n})^{2}}$

#### Momentum

Movement: movement of last step minus gradient at present

Step:
Start at point $\theta^{0}$
Movement $v^{0}=0$
Compute gradient at $\theta^{0}$
Movement $v^{1}=\lambda v^{0}-\eta \triangledown L(\theta^{0})$
Move to $\theta^{1}=\theta^{0}+v^{1}$
Compute gradient at $\theta^{1}$
Movement $v^{2}=\lambda v^{1}-\eta \triangledown L(\theta^{1})$
Move to $\theta^{2}=\theta^{1}+v^{2}$

Movement not just based on gradient, but previous movement.
$v^{i}$ is actually the weighted sum of all the previous gradient: $\triangledown L(\theta^{0}),\triangledown L(\theta^{1}),…,\triangledown L(\theta^{i-1})$

### Bad Results on Testing Data

#### Regularization

New loss function to be minimized

• Find a set of weight not only minimizing original cost but also close to zero
$L’(\theta)=L(\theta)+\lambda \frac {1}{2} \begin{Vmatrix}\theta\end{Vmatrix}_ {2},\theta=\{w_1,w_2,…\},\begin{Vmatrix}\theta\end{Vmatrix}_ {2}=(w_1)^2+(w_2)^2+…$.
就是找一个尽量平滑的函数嘛，后面那个Regularization term就是使函数平滑
这里是$\theta$的L2 norm，也叫作L2 regularization
做Regularization一般不会考虑biases，因为只是平滑函数而已

$Update: w^{t+1}=w^{t}-\eta \frac{\partial L’}{\partial w}=(1-\eta \lambda)w^{t}-\eta \frac{\partial L}{\partial w}$.

$L’(\theta)=L(\theta)+\lambda \frac {1}{2} \begin{Vmatrix}\theta\end{Vmatrix}_ {1},\theta=\{w_1,w_2,…\},\begin{Vmatrix}\theta\end{Vmatrix}_ {1}=|w_1|+|w_2|+…$.
$Gradient:\frac{\partial L’}{\partial w}=\frac{\partial L}{\partial w}+\lambda sgn(w)$.
$Update: w^{t+1}=w^{t}-\eta \frac{\partial L’}{\partial w}=w^{t}-\eta \frac{\partial L}{\partial w}-\eta \lambda sgn(w^{t})$.
L1每次都减去一个固定的值

#### Dropout

Each time before updating the parameters.
Each neuron has $p\%$ to dropout, the structure of the network is changed.
Using the new network for training.
For each mini-batch, we resample the dropout neurons.

Testing
No dropout
If the dropout rate at training is $p\%$, all the weights times $1-p\%$.
Assume that the dropout rate is $50\%$.
If a weight $w=1$ by training, set $w=0.5$ for testing.

Dropout is a kind of ensemble.
Using one mini-batch to train one network.
Some parameters in the network are shared.

## Convolutional Neural Network

### Why CNN for image

Some patterns are much smaller that the whole image
A neuron does not have to see the whole image to discover the pattern, Connecting to small region with less parameters.

The same patterns appear in different regions

Subsampling the pixels will not change the object.

### The whole CNN

repeat : convolution - max pooling
then flatten - fully connected feedforward network

• Some patterns are much smaller that the whole image.(Convolution)
• The same patterns appear in different regions.(Convolution)
• Subsampling the pixels will not change the object.(Max Pooling)

### CNN - Convolution

Do the same process for every filter

Filter 增加一个RGB维度

### CNN v.s. Fully connected network

Less parameters，减去一些连接的weights
Even less parameters，shared weights

### CNN - Max Pooling

Each filter is a channel.
The number of the channel is the number of filters.

### CNN in Keras

Only modified the network structure and input format(vector -> 3-D tensor)

Convolution Layer

MaxPooling

Flatten + Fully connected layer

### What does CNN learn?

Degree of the activation of the k-th filter: $a^k=\sum_{i} \sum_{j} a^{k}_ {ij}$

Find an image maximizing the output of neuron :$x^{\ast}=arg \max_{x} a^{j}$
Each figure corresponds to a neuron

$x^{\ast}=arg \max_{x} y^{i}$

$x^{\ast} = arg \max_{x} (y^{i}-\sum_{i,j}|x_{ij}|)$

Deep Dream
CNN exaggerates what it sees

Deep Style
Given a photo, make its style like famous painting

output 也就是 content像原相片，style也就是filter之间的correlation像风格相片

## Why Deep Learning ?

### Modularization

Each basic classifier can have sufficient training example

Sharing by the following classifiers as module

The most basic classifiers
Use 1st layer as module to build classifiers
Use 2nd layer as module…

The modularization is automatically learned from data.

### Universality Theorem

Any continuous function f
can be realized by a network with one hidden layer(given enough hidden neurons)
Yes, shallow network can represent any function.
However, using deep structure is more effective.

### End-to-end Learning

What each function should do is learned automatically

Do deep nets really need to be deep ? Yes !

## Semi-supervised Learning

### Introduction

Supervised learning:$\{(x^r,\hat{y}^r)\}^{R}_ {r=1}$

• E.g. $x^r$:image, $\hat{y}^{r}$:class labels

Semi-supervised leaning: $\{(x^r,\hat{y}^r)^{R}_ {r=1}\}, \{x^u\}^{R+U}_ {u=R}$

• A set of unlabeled data, usually $U>>R$
• Transductive learning: unlabeled data is the testing data
• Inductive learning: unlabeled data is not the testing data

Why semi-supervised learning

• Collecting data is easy, but collecting “labelled” data is expensive
• We do semi-supervised learning in our lives

Why semi-supervised learning helps?

• The distribution of the unlabeled data tell us something.
就是这些input的分布会影响结果的判断
• Usually with some assumptions 通常伴随着一些假设

### Semi-supervised Learning for Generative Model

Supervised Generative Model
Given labelled training examples $x^r\in C_1,C_2$

• looking for most likely prior probability $P(C_i)$ and class-dependent probability $P(x|C_i)$
• $P(x|C_i)$ is a Gaussian parameterized by $\mu^i$ and $\Sigma$

Semi-supervised Generative Model

• Initialization:$\theta = \{P(C_1),P(C_2),\mu^{1},\mu^{2},\Sigma\}$
• Step1(E): compute the posterior probability of unlabeled data $P_{\theta}(C_1|x^{u})$, Depending on model $\theta$
• Step2(M): update model
$P(C_1)=\frac{N_1+\Sigma_{x^u}P(C_1|x^u)}{N}$.
$N$: total number of examples
$N_1$: number of examples belonging to $C_1$
$\mu^{1}=\frac{1}{N_1}\sum_{x^r\in C_1}x^r+\frac{1}{\sum_{x^u}P(C_1|x^u)}\sum_{x^u}P(C_1|x^u)x^u$
• Back to step1

The algorithm converges eventually, but the initialization influences the results.

Why?

• Maximum likelihood with labelled data(Closed-form solution)
$\log L(\theta)=\sum_{x^r}\log P_{\theta}(x^r,\hat{y}^r), P_{\theta}(x^r,\hat{y}^r)=P_{\theta}(x^r|\hat{y}^r)P(\hat{y}^r)$.
• Maximum likelihood with labelled + unlabeled data(Solved iteratively)
$\log L(\theta)=\sum_{x^r}\log P_{\theta}(x^r,\hat{y}^r)+\sum_{x^u}\log P(\theta)(x^u), P_{\theta}(x^u)=P_{\theta}(x^u|C_1)P(C_1)+P_{\theta}(x^u|C_2)P(C_2)$, $x^u$ can come from either $C_1,C_2$.

### Semi-supervised Learning - Low-density Separation

#### Self-training

Given: labelled data set = $\{(x^r,\hat{y}^r)\}^{R}_ {r=1}$, unlabeled data set = $\{x^u\}^{R+U}_ {u=l}$
Repeat:

• Train model $f^{\ast}$ from labelled data set.(Independent to the model)
• Apply $f^{\ast}$ to the unlabeled data set
• Obtain $\{x^u,y^u\}^{R+U}_ {u=l}$(Pseudo-label)
• Remove a set of data from unlabeled data set, and add them into the labeled data set
How to choose the data set remains open
You can also provide a weight to each data.

Similar to semi-supervised learning for generative model
Hard label v.s. Soft lable
neural network的时候一定是用Hard label

#### Entropy-based Regularization

Entropy of $y^u$: Evaluate how concentrate the distribution $y^u$ is $E(y^u)=-\sum_{m}y^{u}_ {m}\ln (y^{u}_ {m})$, $m$: number of class. As small as possible.
$L=\sum_{x^r}C(y^r,y^r)+\lambda \sum_{x^u}E(y^u)$ 可微分，用Gradient Descent minimize，后面的可以让他不Overfitted

### Semi-supervosed Learning Smoothness Assumption

• Assumption: “similar” $x$ has the same $\hat{y}$
• More precisely:
• x is not uniform.
• If $x^1$ and $x^2$ are close in a high density region,$\hat{y}^1$ and $\hat{y}^2$ are the same.
connected by a high density path

Cluster and then label
Using all the data to learn a classifier as usual

Graph-based Approach
Represented the data points as a graph
Graph representation is nature sometimes.

Graph-based Approach - Graph Construction

• Define the similarity $s(x^i,x^j)$ between $x^i$ and $x^j$
• K Nearest Neighbor
• e-Neighborhood
• Edge weight is proportional to $s(x^i,x^j)$

$s(x^i,x^j)=\exp (-\gamma || x^i-x^i ||^2)$

The labelled data influence their neighbors.
Propagate through the graph

• Define the smoothness of the labels on the graph
$S=\frac{1}{2}\sum_{i,j}w_{i,j}(y^i-y^j)^2$ For all data(no matter labelled or not)
$\mathbf{y}$: (R+U)-dim vector.
$\mathbf{y}=[\cdots y^i \cdots y^j \cdots]^{T}$.
$L$:(R+U) x (R+U) matrix (Graph Laplacian).
$L = D - W$, $W$就是边权建出的图，$D$就是$W$每行的值求和放在$(i,i)$的位置
Then $S=\mathbf{y}^{T}L\mathbf{y}$,$y$ depenting on network parameters
原来Loss Function $L=\sum_{x^r}C(y^r,\hat{y}^r)+\lambda S$, As a regularization term

### Semi-supervised Learning - Better Representation

• Find the latent factors behind the obvservation.
• The latent factors (usually simpler) are better representations.

## Unsupervised Learning - Linear Methods

### Unsupercised Learning

• Clustering & Dimension Reduction(化繁为简)
only having function input.
• Generation(无中生有)
only having function output.

### Clustering

K-means

• Clustering $X=\{x^1, \cdots, x^n, \cdots, x^{N}\}$ into K clusters
• Initalize cluster center $c^i, i=1,2,\cdots, K$(K random $x^n$ from X)
• Repeat
For all $x^n$ in X: $b^n_i=\left\{\begin{matrix}1 &x^n\text{ is most “close” to }c^i \\ 0 &Otherwise \end{matrix}\right.$
Updating all $c^i$: $c^i=\sum_{x^n}b^n_ix^n/\sum_{x^n}b^n_i$

Hierarchical Agglomerative Clustering(HAC)

• Step1: build a tree
建树过程是这样的，$n$个data，选两个最像的做平均连起来变成1个data，这样就有$n-1$个data，继续做下去就可以建一棵树
• Step2: pick a threshold
在树上切一刀可以决定分成多少个Cluster

### Distributed Representation

Clustering: an object must belong to one cluster
Distributed representation: 用一个vector每一维表示一个attribute

### Dimension Reduction

Feature selection： 把没用的Dimension去掉
Principle component analysis(PCA): $z=Wx$

### Principle component analysis(PCA)

Principle component analysis(PCA)主成分分析
$z=Wx$

$var(z_i)=\sum_{z_i}(z_i-\bar{z_i})^2, ||w^i||_ 2=1$
$z_i$的variance越大越好

$W=\begin{bmatrix}(w^1)^{T} \\ (w^2)^{T} \\ \vdots \end{bmatrix}$.
$W$就是个 Orthogonal matrix 正交矩阵

Lagrange multiplier
$z_1=w^1\cdot x$
$\bar{z_1}=\sum z_1=\sum w^1\cdot x = w^1\sum x=w^1 \cdot x$.
$var(z_1)=\sum_{z_1}(z_1-\bar{z_1})^2=\sum_{x}(w^1\cdot x-w^1\cdot \bar{x})^2=\sum (w^1\cdot (x-\bar{x}))^2$.
$\because (a\cdot b)^2=(a^{T}b)^2=a^{T}ba^{T}b=a^{T}b(a^{T}b)^{T}=a^{T}bb^{T}a$.
$\therefore var(z_1)=\sum (w^1)^{T}(x-\bar{x})(x-\bar{x})^{T}w^1=(w^1)^{T}\sum (x-\bar{x})(x-\bar{x})^{T} w^1=(w^1)^{T}Cov(x)w^1$.
$Cov(x)$是$x$的covariance matrix
Let $S=Cov(x)$, Symmetric, positive-semidefinite,(non-negative eigenvalues)
Find $w^1$ maximizing $(w^1)^{T}Sw^1$.
Constraint: $||w^1||_ 2=(w^1)^{T}w^1=1$
Using Largrange multiplier
$g(w^1)=(w^1)^{T}Sw^1-\alpha ((w^1)^{T}w^1-1)$.

$(w^1)^{T}Sw^1=\alpha (w^1)^{T}w^1=\alpha$ Choose the maximum one.

Find $w^2$ maximizing $(w^2)^{T}Sw^2$, $(w^2)^{T}w^2=1,(w^2)^{T}w^1=0$.
$g(w^2)=(w^2)^{T}Sw^2-\alpha ((w^2)^{T}w^2-1)-\beta((w^2)^{T}w^1-0)$.

$Sw^2-\alpha w^2 - \beta w^1=0$
$(w^2)^{T}Sw^2-\alpha ((w^2)^{T}w^2-1)-\beta((w^2)^{T}w^1-0)=0$.
$((w^1)^{T}Sw^2)^{T}=(w^2)^{T}S^{T}w^1=(w^2)^{T}Sw^1=\lambda_1(w^2)^{T}w^1=0$.

PCA - decorrelation
$z=Wx,Cov(z)=D$ $Cov(z)$是一个Diagonal matrix
$Cov(z)=\sum(z-\bar{z})(z-\bar{z})^{T}=WSW^{T}, S=Cov(x)$
$=WS[w^1 \cdots w^K]=W[Sw^1 \cdots Sw^K]=W[\lambda_1w^1 \cdots \lambda_Kw^K]=[\lambda_1Ww^2 \cdots \lambda_KWw^zk]=[\lambda_1e_1 \cdots \lambda_Ke_K]$, $e_K$是第$K$维是1其他是0

### PCA - Another Point of View

Basic Component: 就是一些基本的组成，每一个都是一个vector, $u^K$
$x\approx c_1u^1+c_2u^2+\cdots + c_Ku^K+\bar{x}$

$x-\bar{x}\approx c_1u^1+c_2u^2+\cdots + c_Ku^K=\hat{x}$.
Reconstruction error: $||(x-\bar{x})-\hat{x}||_ 2$.
Find $\{u^1,\cdots, u^K\}$ minimizing the error.

PCA求得$w^1,\cdots, w^K$其实就是$u^1,\cdots, u^K$，证明好像是线代相关SVD…感觉我们学的线代内容好基础…看不太懂233

PCA looks like a neural network with one hidden layer(linear activation function). Autoencoder
$\hat{x}=\sum_{k=1}^{K}c_kw^k=x-\bar{x}$.
To minimize reconstruction error: $c_k=(x-\bar{x})\cdot w^k$.

It can be deep. Deep autoencoder.

### Weakness of PCA

• Unsupervised
• Linear

LDA = linear discriminant analysis， 是supervised

### What happens to PCA?

PCA involves adding up and subtracting some components(images)

• Then the components may not be “parts of digits”
Non-negative matrix factiorization(NMF)
• Forcing $a_1,a_2,…$ be non-negative
• Forcing $w^1,w^2,…$ be non-negative
• More like “parts of digits”

### Matrix Factorization

Minimizing $L=\sum(r^i\cdot r^j-n_{ij})^2$
Find $r^i,r^j$ by Gradient Descent.

Considering the induvial characteristics
$r^A\cdot r^1\approx 5\rightarrow r^A\cdot b_A+b_1\approx 5$.
$b_A$: otakus A likes to by figures.
$b_1$: how popular character 1 is.

Minimizing $L=\sum_{(i,j)}(r^i\cdot r^j + b_i + b_j -n_{ij})^2$.
Find $r^i,r^j,b_i,b_j$ by gradient descent(can add regularization)

• Ref: Matrix Factorization Techniques for Recommender Systems.

Matrix Factorization for Topic analysis

## Unsupervised Learning: Word Embedding

### Word Embedding

Machine learn the meaning of words from reading a lot of documents without supervision.
Genarating Word Vector is unsupervised.

A word can be understood by its context, You shall know a word by the company it keeps.

### How to exploit the context

Count based

• If two words $w_i$ and $w_j$ frequently co-occur, $V(w_i)$ and $V(w_j)$ would be close to each other
• E.g. Glove Vector
• $V(w_i)\cdot V(w_j)$inner product和$N_{i,j}$ Number of times $w_i$ and $w_j$ in the same document越接近越好

#### Prediction based

• 1-of-N encoding of the word $w_{i-1}$
• Take out the input of the neurons in the first layer
• Use it to represent a word $w$
• Word vector, word embedding feature: $V(w)$
• output: The probability for each word as the next word $w_i$

Sharing Parameters
The weights with the $w_i,w_j$ should be the same.
Or, one word would have two word vectors. 我觉得应该是因为出现的位置不同嘛

The length of $x_{i-1}$ and $x_{i-2}$ are both $|V|$, The length of $z$ is $|Z|$.
$z=W_1x_{i-2}+W_2x_{i-1}$, the weight matrix $W_1$ and $W_2$ are both $|Z|\times |V|$ matrices.
Let $W_1=W_2=W$, $z=W(x_{i-2}+x_{i-1})$

How to make $w_i$ equal to $w_j$
Given $w_i$ and $w_j$ the same initialization
$w_i\leftarrow w_i-\eta \frac{\partial C}{\partial w_i}-\eta \frac{\partial C}{\partial w_j},w_j\leftarrow w_j-\eta \frac{\partial C}{\partial w_j}-\eta \frac{\partial C}{\partial w_i}$

Prediction-based Training

Prediction-based - Various Architectures

• Continuous bag of word(CBOW) model
用前后的匹配中间的 predicting the word given its context.
• Skip-gram
predicting the context given a word

Beyond Bag of Word

• To understand the meaning of a word sequence, the order of the words can not be ignored.

## Unsupervised Learning: Neighbor Embedding

### Manifold Learning

Manifold Learning 流形学习

Locally Linear Embedding (LLE)
$w_{ij}$ respesents the relation between $x^i$ and $x^j$
Find a set of $w_{ij}$ minimizing $\sum_{i} || x^i-\sum_{j}w_{ij}x^j||_ 2$
Then find the dimension reduction results $z^i$ and $z^j$ based on $w_{ij}$.
Keep $w_{ij}$ unchanged, find a set of $z^i$ minimizing $\sum_{i} || z^i-\sum_{j}w_{ij}z^j||_ 2$

### Laplacian Eigenmaps

Graph-based approach
Distance defined by graph approximate the distance on manifold
Construct the data points as a graph
Dimension Reduction: If $x^1$ and $x^2$ are close in a high density region, $z^1$ and $z^2$ are close to each other.
smoothness: $S=\frac{1}{2}\sum_{i,j}w_{i,j}(z^i-z^j)^2$, constraints to $z$: If the dim of $z$ is $M$,$Span\{z^1, z^2, … z^N\} = R^M$，找的是Laplacian Matrix 的eignenvector

### T-distributed Stochastic Neighbor Embedding(t-SNE)

T-distributed Stochastic Neighbor Embedding(t-SNE) t-分布领域嵌入算法
Problem of the previous approaches

• Similar data are close, but different data may collapse
只保证了相似的会很接近，但没保证不相似要远一些

KL divergence
Similarity Measure
$S(x^i,x^j)=\exp (-||x^i-x^j||_ {2})$.

## Recurrent Neural Network

The output of hidden layer are stored in the memory.
Memory can be considered as another input.

All the weights are “1”, no bias.
All activation functions are linear

Changing the sequence order will change the output.

Elman Network. 就是每层的output当成这层的input
Jordan Network. 最后的output当成input

### Learning

RNN也是比较难Train的
The error surface is rough.
RNN的Total Loss是非常陡峭的，Error Surface有些地方非常平坦，有些地方非常陡峭

Long Short-term Memory(LSTM)

• Memory and input are added.
• The influence never disappears unless forget gate is closed.

Gated Recurrent Unit(GRU)

Clockwise RNN
Structurally Constrained Trcurrent Network(SCRN)
Vanilla RNN Initialized with Identity matrix + ReLU activation function.

### More Application

Many to One : Input is a vector sequence, but output is only one vector

• Sentiment Analysis
• Key Term Extraction

Many to Many : Both input and output are both sequences, but the output is shorter.

• Speech Recognition
但是这个，在每一小段时间中识别，最后去除重复的不能识别叠字
• Connectionist Temporal Classification(CTC)
在之间可以添加符号null，可以解决叠字问题
• Sequence to sequence learning
Both input and output are both sequences with different lengths.
比如做Machine Translation
• Beyond Sequence
Syntactic parsing
• Sequence-to-sequence Auto-encoder - Text
To understand the meaning of a word sequence, the order of the words can not be ignored.
• Sequence-to-sequence Auto-encoder - Speech
Dimension reduction for a sequence with variable length audio segments(word-level) -> fixed-length vector

### Attention-based Model

Neural Turing Machine

### Integrated Together

deep learning和structured learning结合起来。input features 先通过RNN/LSTM，然后RNN/LSTM的output再做为HMM/svm的input。用RNN/LSTM的output来定义HMM/structured SVM的evaluation function，如此就可以同时享有deep learning的好处，也可以有structure learning的好处。

## Ensemble

### Framework of Ensemble

Get a set of classifiers. They should be diverse.
Aggregate the classifiers(properly).

### Bagging

Decision Tree 就比较容易overfit，需要做bagging

Random Forest

Descision Tree is easy to achieve 0% error rate on training data, if each training example has its own leaf..
Random forest: Bagging of decision tree

• Resampling trainging data is not sufficient.
• Randomly restrict the features/questions used in each split.
Out-of-bag validation for bagging

### Boosting

Improving Weak Classifiers

Math Warning，说实话这段没打听懂

Majority Vote

## Deep Reinforcement Learning

Environment->observation/state & reward -> Agent -> action -> Environment

Reward delay
Agent’s actions affect the subsequent data it receives