wk+1=wk−αk∇wJ(wk),=∇wE[(vπ(S)−v^(S,wk))2]=E[∇w(vπ(S)−v^(S,wk))(−v^(S,wk))2]=−2E[(vπ(S)−v^(S,wk))∇wv^(S,wk)].=wk+2αkE[(vπ(S)−v^(S,wk))∇wv^(S,wk)],Algorithm 8.1: TD learning of state values with function approximationInitialization: A function v^(s,w) that is differentiable in w. Initial parameter w0.Goal: Learn the true state values of a given policy π.For each episode {st,rt+1,st+1}t generated by π, doFor each sample (st,rt+1,st+1), doIn the general case, wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)In the linear case, wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st)
wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).Algorithm 8.2: Sarsa with function approximationInitialization: Initial parameter w0. Initial policy π0.αt=α>0 for all t.ϵ∈(0,1).Goal: Learn an optimal policy that can lead the agent to the target state from an initial state s0.For each episode, doGenerate a0 at s0 following π0(s0)If st(t=0,1,2,…) is not the target state, doCollect the experience sample (rt+1,st+1,at+1) given (st,at): generate rt+1,st+1by interacting with the environment; generate at+1 following πt(st+1).Update q-value:wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt)Update policy:πt+1(a∣st)=1−∣A(st)∣ϵ(∣A(st)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)πt+1(a∣st)=∣A(st)∣ϵotherwisest←st+1,at←at+1
Q-learning 也是一样的思路:
wt+1=wt+αt[rt+1+γa∈A(st+1)maxq^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt).Algorithm 8.3: Q-learning with function approximation (on-policy version)Initialization: Initial parameter w0. Initial policy π0.αt=α>0 for all t.ϵ∈(0,1).Goal: Learn an optimal path that can lead the agent to the target state from an initial state s0.For each episode, doIf st(t=0,1,2,…) is not the target state, doCollect the experience sample (at,rt+1,st+1) given st: generate at following πt(st);generate rt+1,st+1 by interacting with the environment.Update q-value:wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt)Update policy:πt+1(a∣st)=1−∣A(st)∣ϵ(∣A(st)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)πt+1(a∣st)=∣A(st)∣ϵotherwise
Deep Q-learning
DQN 就是指使用了神经网络作为奖励函数的 Q-learning,整体上和上文中讲的 Q-learning with function approximation 相差不大。要求的目标函数:
∇wJ=−E[(R+γa∈A(S′)maxq^(S′,a,w−)−q^(S,A,w))∇wq^(S,A,wt)],Algorithm 8.3: Deep Q-learning (off-policy version)Initialization: A main network and a target network with the same initial parameter.Goal: Learn an optimal target network to approximate the optimal action values from the experiencesamples generated by a given behavior policy πb.Store the experience samples generated by πb in a replay buffer B={(s,a,r,s′)}For each iteration, doUniformly draw a mini-batch of samples from BFor each sample (s,a,r,s′), calculate the target value as yT=r+γmaxa∈A(s′)q^(s′,a,wT),where wT is the parameter of the target networkUpdate the main network to minimize (yT−q^(s,a,w))2 using the mini-batch of samplesSet wT=w every C iterations
Policy Gradiant
本文前部分讲了如何把奖励函数由表格变为函数,此前讲解的策略也大部分是通过查表选最大值的表格形式,深度 RL 也需要将策略转为函数形式:π(a∣s,θ),θ∈Rm。这个函数可能是接受一个 (s,a) 输出下一步策略,也可能是输入 s 输出动作空间中的全部策略。
随之而来的问题是如何权衡最优的策略?在表格中,最优策略可以找到:让每一个状态的 value 都最大。这可以做到,因为存在一个策略同时在所有状态上最优。但是在参数化的策略中,参数维度有限,你没法让每个状态的 value 都同时最大。这时候你需要一个 单一的数字来评价整个策略的好坏,然后对这个数字做梯度上升。这个数字就是 scalar metric J(θ)。更新策略也从直接修改表格值变为更新函数权重。
Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE)Initialization: Initial parameter θ;γ∈(0,1);α>0.Goal: Learn an optimal policy for maximizing J(θ).For each episode, doGenerate an episode {s0,a0,r1,…,sT−1,aT−1,rT} following π(θ).For t=0,1,…,T−1:Value update: qt(st,at)=k=t+1∑Tγk−t−1rkPolicy update: θ←θ+α∇θlnπ(at∣st,θ)qt(st,at)
Actor-critic
现在所学的策略大致分为两种: policy-based 和 value based。前者比如 REINFORCE,Policy Gradiant 算法是直接更新模型参数,后者比如 DQN,Q-learning 直接更新价值函数。
Algorithm 10.1: The simplest actor-critic algorithm (QAC)Initialization: A policy function π(a∣s,θ0) where θ0 is the initial parameter.A value function q(s,a,w0) where w0 is the initial parameter. αw,αθ>0.Goal: Learn an optimal policy to maximize J(θ).At time step t in each episode, doGenerate at following π(a∣st,θt), observe rt+1,st+1, and then generate at+1 following π(a∣st+1,θt).Actor (policy update):θt+1=θt+αθ∇θlnπ(at∣st,θt)q(st,at,wt)Critic (value update):wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)−q(st,at,wt)]∇wq(st,at,wt)
Algorithm 10.2: Advantage actor-critic (A2C) or TD actor-criticInitialization: A policy function π(a∣s,θ0) where θ0 is the initial parameter.A value function v(s,w0) where w0 is the initial parameter. αw,αθ>0.Goal: Learn an optimal policy to maximize J(θ).At time step t in each episode, doGenerate at following π(a∣st,θt) and then observe rt+1,st+1.Advantage (TD error):δt=rt+1+γv(st+1,wt)−v(st,wt)Actor (policy update):θt+1=θt+αθδt∇θlnπ(at∣st,θt)Critic (value update):wt+1=wt+αwδt∇wv(st,wt)
off-policy policy gradient
Policy based 算法基本上天生就是 On policy 的算法,如果没有针对自己的 Policy 求梯度的话,就不是针对自己的策略优化,这使得一个 Policy 只能针对现在的参数生成的轨迹求梯度,变了点参数又不能用了,使得 on-policy 的策略利用率较低。
从直觉上来说,因为 AC 是利用下一个动作的 q(a,c) 来优化动作的,所以只要两个策略对于同一个状态的动作状态价值函数 q(s,a) 是一样的话,在一个状态的到的导数应该是一样的,只不过因为取到这个动作的概率和这个状态在静态分布中的概率不同,所以乘上不同概率的权重即可:
Algorithm 10.3: Off-policy actor-critic based on importance samplingInitialization: A given behavior policy β(a∣s). A target policy π(a∣s,θ0) where θ0 is the initial parameter.A value function v(s,w0) where w0 is the initial parameter. αw,αθ>0.Goal: Learn an optimal policy to maximize J(θ).At time step t in each episode, doGenerate at following β(st) and then observe rt+1,st+1.Advantage (TD error):δt=rt+1+γv(st+1,wt)−v(st,wt)Actor (policy update):θt+1=θt+αθβ(at∣st)π(at∣st,θt)δt∇θlnπ(at∣st,θt)Critic (value update):wt+1=wt+αwβ(at∣st)π(at∣st,θt)δt∇wv(st,wt)