训练玩马里奥的 RL 代理-翻译

2023-09-19 12:07 作者:LSC2049 0人读过 | 我要投稿

Authors: Yuansong Feng, Suraj Subramanian, Howard Wang, Steven Guo.
作者：冯元松，苏拉杰·苏布拉马尼安，霍华德·王，史蒂文·郭。

This tutorial walks you through the fundamentals of Deep Reinforcement Learning. At the end, you will implement an AI-powered Mario (using Double Deep Q-Networks) that can play the game by itself.
本教程将引导您了解深度强化学习的基础知识。最后，您将实现一个人工智能驱动的马里奥（使用双深度Q网络），可以自己玩游戏。

Although no prior knowledge of RL is necessary for this tutorial, you can familiarize yourself with these RL concepts, and have this handy cheatsheet as your companion. The full code is available here.
尽管本教程不需要 RL 的先验知识，但您可以熟悉这些 RL 概念，并将此方便的备忘单作为您的伴侣。完整的代码可在此处获得。

RL Definitions 强化学习定义

Environment The world that an agent interacts with and learns from.
环境代理与之交互并从中学习的世界。

Action a : How the Agent responds to the Environment. The set of all possible Actions is called action-space.
动作 a ：代理如何响应环境。所有可能的动作的集合称为动作空间。

State s : The current characteristic of the Environment. The set of all possible States the Environment can be in is called state-space.
状态 s ：环境的当前特征。环境可以处于的所有可能状态的集合称为状态空间。

Reward r : Reward is the key feedback from Environment to Agent. It is what drives the Agent to learn and to change its future action. An aggregation of rewards over multiple time steps is called Return.
奖励：奖励 r 是从环境到代理的关键反馈。它驱使代理学习并改变其未来行动。多个时间步长的奖励聚合称为返回。

Optimal Action-Value function )Q∗(s,a) : Gives the expected return if you start in state s, take an arbitrary action a, and then for each future time step take the action that maximizes returns. Q can be said to stand for the “quality” of the action in a state. We try to approximate this function.
最优动作-价值函数 Q∗(s,a) ：如果你从状态 s 开始，采取任意动作 a ，然后对未来的每个时间步长采取最大化回报的行动，则给出预期回报。 Q 可以说代表着一种状态下动作的“质量”。我们尝试近似这个函数。

Environment 环境

Initialize Environment 初始化环境

In Mario, the environment consists of tubes, mushrooms and other components.
在马里奥中，环境由管子、蘑菇和其他组件组成。

When Mario makes an action, the environment responds with the changed (next) state, reward and other info.
当马里奥采取行动时，环境会用更改的（下一个）状态、奖励和其他信息进行响应。

Preprocess Environment 预处理环境

Environment data is returned to the agent in next_state. As you saw above, each state is represented by a [3, 240, 256] size array. Often that is more information than our agent needs; for instance, Mario’s actions do not depend on the color of the pipes or the sky!
环境数据将在下一个状态next_state返回到代理。正如您在上面看到的，每个状态都由一个 [3, 240, 256] 大小数组表示。通常，这些信息比我们的代理需要的要多;例如，马里奥的行动不依赖于管道的颜色或天空！

We use Wrappers to preprocess environment data before sending it to the agent.
我们使用包装器在将环境数据发送到代理之前对其进行预处理。

GrayScaleObservation is a common wrapper to transform an RGB image to grayscale; doing so reduces the size of the state representation without losing useful information. Now the size of each state: [1, 240, 256]
GrayScaleObservation 是将 RGB 图像转换为灰度的常见包装器;这样做可以减小状态表示形式的大小，而不会丢失有用的信息。现在每个状态的大小： [1, 240, 256]

ResizeObservation downsamples each observation into a square image. New size: [1, 84, 84]
ResizeObservation 将每个观测值下采样为正方形图像。新尺寸： [1, 84, 84]

SkipFrame is a custom wrapper that inherits from gym.Wrapper and implements the step() function. Because consecutive frames don’t vary much, we can skip n-intermediate frames without losing much information. The n-th frame aggregates rewards accumulated over each skipped frame.
SkipFrame 是从函数继承并 gym.Wrapper 实现 step() 函数的自定义包装器。因为连续帧变化不大，我们可以跳过 n 中间帧而不会丢失太多信息。第 n 帧聚合每个跳过帧累积的奖励。

FrameStack is a wrapper that allows us to squash consecutive frames of the environment into a single observation point to feed to our learning model. This way, we can identify if Mario was landing or jumping based on the direction of his movement in the previous several frames.
FrameStack 是一个包装器，允许我们将环境的连续帧压缩到单个观察点中，以提供给我们的学习模型。这样，我们可以根据马里奥在前几帧中的运动方向来确定他是在着陆还是跳跃。

After applying the above wrappers to the environment, the final wrapped state consists of 4 gray-scaled consecutive frames stacked together, as shown above in the image on the left. Each time Mario makes an action, the environment responds with a state of this structure. The structure is represented by a 3-D array of size [4, 84, 84].
将上述包装器应用于环境后，最终包装状态由堆叠在一起的 4 个灰度连续帧组成，如上图左侧所示。每次马里奥做出动作时，环境都会以这种结构的状态做出响应。该结构由大小 [4, 84, 84] 的 3-D 数组表示。

Agent 代理

We create a class Mario to represent our agent in the game. Mario should be able to:
我们创建一个类 Mario 来表示游戏中的代理。马里奥应该能够：

Act according to the optimal action policy based on the current state (of the environment).
根据基于当前状态（环境）的最佳操作策略进行操作。
Remember experiences. Experience = (current state, current action, reward, next state). Mario caches and later recalls his experiences to update his action policy.
记住经历。经验 =（当前状态、当前操作、奖励、下一个状态）。马里奥缓存并随后回忆起他的经历以更新他的行动策略。
Learn a better action policy over time
随着时间的推移，了解更好的行动策略

In the following sections, we will populate Mario’s parameters and define his functions.
在以下部分中，我们将填充马里奥的参数并定义他的函数。

Act

For any given state, an agent can choose to do the most optimal action (exploit) or a random action (explore).
对于任何给定状态，代理可以选择执行最佳操作（利用）或随机操作（探索）。

Mario randomly explores with a chance of self.exploration_rate; when he chooses to exploit, he relies on MarioNet (implemented in Learn section) to provide the most optimal action.
马里奥随机探索，有几率; self.exploration_rate 当他选择利用时，他依靠 MarioNet （在 Learn 部分中实现）来提供最佳操作。

Cache and Recall 缓存和调用

These two functions serve as Mario’s “memory” process.
这两个功能充当马里奥的“记忆”过程。

cache(): Each time Mario performs an action, he stores the experience to his memory. His experience includes the current state, action performed, reward from the action, the next state, and whether the game is done.
cache() ：每次马里奥执行一个动作时，他都会将存储到 experience 他的记忆中。他的经验包括当前状态、执行的操作、操作的奖励、下一个状态以及游戏是否完成。

recall(): Mario randomly samples a batch of experiences from his memory, and uses that to learn the game.
recall() ：马里奥从他的记忆中随机抽取一批经验，并用它来学习游戏。

Learn 学习

Mario uses the DDQN algorithm under the hood. DDQN uses two ConvNets - Qonline and Qtarget - that independently approximate the optimal action-value function.
马里奥在引擎盖下使用DDQN算法。DDQN 使用两个 ConvNet - Qonline 和 Qtarget - 独立地近似最优动作值函数。

In our implementation, we share feature generator features across Qonline and Qtarget, but maintain separate FC classifiers for each. θtarget (the parameters of Qtarget) is frozen to prevent updating by backprop. Instead, it is periodically synced with θonline (more on this later).
在我们的实现中，我们在和Qtarget 之间 Qonline 共享特征生成器 features ，但为每个分类器维护单独的 FC 分类器。 θtarget （）的Qtarget 参数被冻结以防止反向传播更新。相反，它会定期与（稍后会详细介绍）同步θonline 。

Neural Network 神经网络

TD Estimate & TD Target
TD估算和TD目标

Two values are involved in learning:
学习涉及两个价值观：

TD Estimate - the predicted optimal ∗Q∗ for a given state s
TD估算 - 给定状态下 s 的预测最佳值 ∗Q∗

TDe=Qonline∗(s,a)

TD Target - aggregation of current reward and the estimated ∗Q∗ in the next state ′s′
TD目标 - 当前奖励和下一状态下′s′ 的估计 ∗Q∗ 值汇总

a′=argmaxaQonline(s′,a)TDt=r+γQtarget∗(s′,a′)

Because we don’t know what next action ′a′ will be, we use the action ′a′ maximizes Qonline in the next state ′s′.
因为我们不知道下一个动作会是什么，所以我们在下一个状态下使用动作最大化 Qonline 。 ′s′

Notice we use the @torch.no_grad() decorator on td_target() to disable gradient calculations here (because we don’t need to backpropagate on θtarget).
请注意，我们使用 @torch.no_grad（）装饰器来 td_target() 禁用梯度计算（因为我们不需要反向传播 θtarget ）。

Updating the model 更新模型

As Mario samples inputs from his replay buffer, we compute TDt and TDe and backpropagate this loss downQonline to update its parameters θonline (α is the learning rate lr passed to the optimizer)
当Mario从他的重放缓冲区采样输入时，我们计算TDt 并TDe 反向传播此损失Qonline 以更新其参数θonline （α 是传递给的 optimizer 学习率 lr ）

θonline←θonline+α∇(TDe−TDt)

θtarget does not update through backpropagation. Instead, we periodically copy θonline to θtarget
θtarget 不通过反向传播进行更新。相反，我们会定期复制到θonline θtarget

θtarget←θonline

Save checkpoint 保存检查点

Putting it all together
将一切整合在一起

Logging 日志

Let’s play! 我们来玩吧！

In this example we run the training loop for 40 episodes, but for Mario to truly learn the ways of his world, we suggest running the loop for at least 40,000 episodes!
在这个例子中，我们运行了 40 集的训练循环，但为了让马里奥真正了解他的世界的方式，我们建议运行至少 40,000 集的循环！

Conclusion 结论

In this tutorial, we saw how we can use PyTorch to train a game-playing AI. You can use the same methods to train an AI to play any of the games at the OpenAI gym. Hope you enjoyed this tutorial, feel free to reach us at our github!
在本教程中，我们了解了如何使用 PyTorch 来训练玩游戏的 AI。您可以使用相同的方法来训练 AI 在 OpenAI Gym玩任何游戏。希望您喜欢本教程，请随时通过我们的 github 与我们联系！

Total running time of the script: ( 1 minutes 50.444 seconds)
脚本的总运行时间：（1 分 50.444 秒）

原文：https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html

Github：https://github.com/yfeng997/MadMario

标签：