人工智能原理复习6
Markov Decision Processes
An MDP is defined by:
A set of states s ∈ S
A set of actions a ∈ A
A transition function T(s, a, s’)
(1)Probability that a from s leads to s’, i.e., P(s’| s, a)
(2)Also called the model or the dynamics
A reward function R(s, a, s’),Sometimes just R(s) or R(s’)
A start state
Maybe a terminal state
What is Markov about MDPs?
“马尔可夫”一般意味着在当前状态下,未来和过去是独立的
对于马尔可夫决策过程,“马尔可夫”意味着行动结果仅取决于当前状态
这就像搜索一样,后继函数只能依赖于当前状态(而不是历史)
Policies
对于 MDP,我们想要一个最优的政策
Π*:S → A
策略:为每个状态给出一个动作
最优策略是如果遵循则能够最大化预期效用的策略
MDP Search Trees

The value (utility) of a state s:
V*(s) = expected utility starting in s and acting optimally
从 s 开始且表现最佳的预期效用
The value (utility) of a q-state (s,a):
Q*(s,a) = expected utility starting out having taken action a from
state s and (thereafter) acting optimally
从状态 s 采取行动 a
开始并且(此后)采取最佳行动的预期效用,Q和V相比,多了一个行动a,策略抽取的过程
The optimal policy:
Π*(s) = optimal action from state s
从状态s开始的最佳行动,这是一个表示策略的参数
Recursive definition of value

这个公式一定要会!
Value Iteration
https://pic.imgdb.cn/item/667b8afdd9c307b7e9e0f811.png
Problems with Value Iteration
Value iteration repeats the Bellman updates:

o Problem 1: It’s slow – O(S2A) per iteration
o Problem 2: The “max” at each state rarely changes
o Problem 3: The policy often converges long before the values
Policy Iteration 策略迭代
Alternative approach for optimal values:
Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimalutilities!) until convergence
Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values
o Repeat steps until policy converges
o This is policy iteration
o It’s still optimal!
o Can converge (much) faster under some conditions
o 第 1 步:策略评估:计算某些固定策略的效用(不是最优效用!),直至收敛
o 第 2 步:策略改进:使用一步前瞻更新策略,并将所得收敛(但不是最优!)效用作为未来值
o 重复步骤直到策略收敛
它仍然是最佳的!
在某些条件下可以更快地收敛

Comparison
Both value iteration
and policy iteration
compute the same thing (all optimal values)
In value iteration:
(1)Every iteration updates both the values and (implicitly) the policy
(2)We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
(1)We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)
(2)After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
(3)The new policy will be better (or we’re done)
Both are dynamic programs for solving MDPs
Summary: MDP Algorithms
So you want to….
o Compute optimal values: use value iteration or policy iteration
o Compute values for a particular policy: use policy evaluation
o Turn your values into a policy: use policy extraction (one-step lookahead)
These all look the same!
o They basically are – they are all variations of Bellman updates
o They all use one-step lookahead expectimax fragments
o They differ only in whether we plug in a fixed policy or max over actions

Reinforcement Learning
That wasn’t planning, it was learning!, Specifically, reinforcement learning
There was an MDP, but you couldn’t solve it with just computation, You needed to actually act to figure it out
o 具体来说,强化学习
o 有一个 MDP,但你无法仅通过计算来解决它
o 你需要实际行动来解决它
Important ideas in reinforcement learning that came up
Exploration: you have to try unknown actions to get information
Exploitation: eventually, you have to use what you know
Regret: even if you learn intelligently, you make mistakes
Sampling: because of chance, you have to try things repeatedly
Difficulty: learning can be much harder than solving a known MDP
o 探索:你必须尝试未知的行动来获取信息
o 利用:最终,你必须使用你所知道的
o 遗憾:即使你聪明地学习,你也会犯错误
o 采样:因为偶然,你必须反复尝试
o 难度:学习比解决已知的 MDP 困难得多
New twist: don’t know T or R
I.e. we don’t know which states are good or what the actions do
Must actually try actions and states out to learn

Basic idea:
Receive feedback in the form of rewards
Agent’s utility is defined by the reward function
Must (learn to) act so as to maximize expected rewards
All learning is based on observed samples of outcomes!
Offline (MDPs) vs. Online (RL)
Model-Based Learning(概率模型)
基于模型的强化学习
Estimate T and R using historical data (Episodes).
利用重复尝试的概率来实现
计算的值是 R 的估计值和 T 的估计值,这里是考虑从状态 s 经过动作 a 到状态 s` 这一个过程


Model-Free Learning
无模型的强化学习又可分为:被动强化学习和主动强化学习
1、Passive reinforcement learning
Goal: learn the state values
(1)Direct evaluation 直接评估
Goal: Compute values for each state under Π
idea:平均观察到的样本值
直接评估计算的是V值,这里不包含具体的行动
sample:



直接评估的优点:
·不需要知道T和R
·使用样本转换来计算正确的平均值
直接评估的缺点:
·浪费了状态转换之间的信息
·每个状态必须单独学习,需要的时间较长
(2)Temporal difference learning 时序差分学习
Big idea: learn from every experience!
我们仍然评估V,但是对于更新的公式进行了优化:

差分更多的理解是一个前后状态转变的过程
充分考虑到状态转换前后两个状态直接的联系
与直接评估相比,能够以更少的次数更快的速度来学习真实状态的值
TD的改进空间是如何将状态的值转换为策略,这一步涉及到策略抽取的问题
因此我们需要计算的是Q值而不是V值,因为在Q值中包含了具体的行动


Q值迭代过程,需要注意与值迭代过程更新公式的不同,细节之处
2、Active reinforcement learning----Q-learning
https://blog.csdn.net/qq_30615903/article/details/80739243
这个是专门讲Q-learning 的,讲的挺好,你需要认真看
Q-learning也是基于样本的Q值进行迭代
Q-learning
是强化学习算法中value-based的算法,Q即为Q(S,a),就是某一时刻的s状态下采取动作a能够获得的收益的期望,环境会根据agent的动作反馈出相应的回报reward,所以算法的主要思想就是将state与action构建一张Q-table来存储Q值,然后根据Q值来选取能够获得最大收益的动作

更新公式如下:

根据下一个状态s’中选取最大的Q (s′,a′)值乘以衰变γ加上真实回报值最为Q现实,而根据过往Q表里面的Q(s,a)作为Q估计。