欢迎光临散文网 会员登陆 & 注册

强化学习2023版第一讲 德梅萃·P. 博赛卡斯(Dimitri P. Bert

2023-02-12 08:32 作者:听听我的脑洞  | 我要投稿


06:47


On-Line Play algorithm.

Online tree search.

So, search all the moves and determine the final values. Determine the move based on the final values.

以果决行。


11:34


Off-Line Training in AlphaZero: Approximation Policy Iteration (PI)

a value neural net through training

a policy neural net through training


16:04


on-line player plays better than the off-line-trained player.


Central role of Newton's method?

mathematical connection?


23:27


跳了这部分。


40:00


Reference page.


40:25


Terminology.

RL uses Max/Value

DP uses Min/Cost

  • Reward of a stage = (Opposite of ) cost of a stage
  • State value = (Opposite of) State cost
  • Value (or state-value) function = opposite of Cost function

Controlled system terminology

  • Agent = Decision maker or controller
  • Action = Decision or control
  • Environment = Dynamic system

Methods terminology

  • Learning = Solving a DP-related problem using simulation
  • Self-learning (or self-play in the context of games) = Solving a DP problem using simulation-based policy iteration.
  • Planning v.s. Learning distinction = Solving a DP problem with model based v.s. model-free simulation


44:59


Notations.

two types: transition probability/discrete-time system equation.


50:53


Finite Horizon Deterministic Optimal Control Model

a system ends at stage x_N.



54:40


A Special Case: Finite Number of States and Controls.

主要就是说也是shortest path...


59:05


Principle of Optimality:

THE TAIL OF AN OPTIMAL SEQUENCE IS OPTIMAL FOR THE TAIL SUBPROBLEM.

If there exists a better solution for the tail subproblem, then we will take that part instead of the current one. Hence, the principle of optimality holds.


01:04:18


From One Tail Subproblem to the Next.


I think for this part, it is to tell us that we can use backward method to solve the problem...


01:06:16


DP Algorithm: Solves all tail subproblems efficiently by using the Principle of Optimality.


中间讲了两个例了跳了。




01:25:24


General Discrete Optimization.



01:29:47


Connect DP to Reinforcement Learning..

Use approximation J^\tilda s instead of J^\star s. (off-line training)

Generate all the approximations.

Then, going forward, to find u^\tilda_k (on-line play)


01:33:17


Extentions:

Stochastic finite horizon problems: x_{k+1} is random

Infinite horizon problems: instead of ending at stage N...

Stochastic partial state information problems:

do not know the state information perfectly

MINIMAX/game problems


01:40:48


课程要求~跳啦



强化学习2023版第一讲 德梅萃·P. 博赛卡斯(Dimitri P. Bert的评论 (共 条)

分享到微博请遵守国家法律