强化学习笔记

Policy Gradient

Policy of Actor

Policy $\pi$ is a network with parameter $\theta$

Input: the observation of machine represented as a vector or a matrix.
Output: each action corresponds to a neuron in output layer.

actor, env, reward

a round is an episode
total reward $R=\sum_{t} r_{t}$

Trajectory $\tau =\{s_1, a_1, s_2, a_2,\cdots, s_{T}, a_{T}\}$
$p_{\theta}(\tau) = p(s_1)p_{\theta}(a_1|s_1)p(s_2|s_1,a_1)p_{\theta}(a_2|s_2) \cdots = p(s_1)\Pi_{t=1}^{T}p_{\theta}(a_t|s_t)p(s_{t+1}|s_t,a_t)$
Expected Reward $\bar{R_{\theta}}=\sum_{\tau}R(\tau)p_{\theta}(\tau)$
$\triangledown R(\tau)=\sum_{\tau}R(\tau)\triangledown p_{\theta}(\tau)$

Baseline
就是把$R(\tau)$换成$R(\tau)-b$

Assign Suitable Credit
就是在时间点$t$把$R(\tau)-b$换成$\sum_{t’=t}^{T}r_{t’}$
因为离得越远影响越小可以进一步换成$\sum_{t’=t}^{T}\gamma^{t’-t}r_{t’},\gamma < 1$
这个$b$也是可以state-dependent的
Advantage Function $A^{\theta}(s_t, a_t)$就是$R(\tau)-b$的那一部分

How good it is if we take $a_t$ other than other actions at $s_t$.
Estimated by ‘critic’

From on-policy to off-policy

On-policy: The agent learned and the agent interacting with the environment is the same.
Off-policy: The agent learned and the agent interacting with the environment is different.

Importance Sampling
在两个分布中算E是相等的，$E_{x\sim p}[f(x)]=E_{x\sim q}[f(x)\frac{p(x)}{q(x)}]$的结果是一样的，因为都会乘上一个概率，然后就消掉了分母的$q(x)$
举个例子为什么成立，如果两个分布峰值分别在两侧，一侧$f(x)$为正，一侧为负，如果sample到左侧乘一个负值，会让系数肥肠的大。
也可能就是小概率会有一个大系数，大概率会有一个小系数，最终期望值是一样的，但是方差可能非常的大

PPO/TRPO

Trust Region Policy Optimization(TRPO)
PPO前身，把constrain放在了里面

Proximal Policy Optimization(PPO)
$J_{PPO}^{\theta ‘}(\theta)=J^{\theta ‘}(\theta)-\beta KL(\theta, \theta ‘)$
$J^{\theta ‘}(\theta)=E_{(s_t, a_t)\sim \pi_{\theta ‘}}[\frac{p_{\theta}(a_t|s_t)}{p_{\theta ‘}(a_t|s_t)}A^{\theta ‘}(s_t, a_t)]$

PPO2就是限制了一下比值的范围

Q-Learning

Critic并不直接采取action，而是评价一个action的好坏
不断训练，每次获得一个更好的$V^{\pi}$，也就更新模型，而且保证每次取最大收益期望的那个一定优于之前的模型。