Key Concepts 关键概念

2023-09-17 09:49 作者:LSC2049 0人读过 | 我要投稿

Key Concepts 关键概念

On this page, we’ll cover the key concepts to help you understand how RLlib works and how to use it. In RLlib, you use Algorithm’s to learn how to solve problem environments. The algorithms use policies to select actions. Given a policy, rollouts throughout an environment produce sample batches (or trajectories) of experiences. You can also customize the training_steps of your RL experiments.
在此页面上，我们将介绍关键概念，以帮助您了解 RLlib 的工作原理以及如何使用它。在 RLlib 中，您可以使用 Algorithm 来学习如何解决问题 environments 。算法用于 policies 选择操作。给定一个政策， rollouts 贯穿一个 environment （或 trajectories ）经验的产生 sample batches 。您还可以自定义 RL 实验的 training_steps。

Environments 环境

Solving a problem in RL begins with an environment. In the simplest definition of RL:
解决 RL 中的问题始于环境。在RL的最简单定义中：

An agent interacts with an environment and receives a reward.
代理与环境交互并获得奖励。

An environment in RL is the agent’s world, it is a simulation of the problem to be solved.
RL中的环境是代理的世界，它是要解决的问题的模拟。

An RLlib environment consists of:

RLlib 环境包括：

all possible actions (action space)
所有可能的操作（操作空间）
a complete description of the environment, nothing hidden (state space)
环境的完整描述，没有任何隐藏（状态空间）
an observation by the agent of certain parts of the state (observation space)
代理对状态某些部分的观察（观察空间）
reward, which is the only feedback the agent receives per action.
奖励，这是代理每次操作收到的唯一反馈。

The model that tries to maximize the expected sum over all future rewards is called a policy. The policy is a function mapping the environment’s observations to an action to take, usually written π (s(t)) -> a(t). Below is a diagram of the RL iterative learning process.
尝试最大化所有未来奖励的预期总和的模型称为策略。策略是一个将环境的观察映射到要采取的行动的函数，通常写π （s（t）） -> a（t）。下面是 RL 迭代学习过程的示意图。

The RL simulation feedback loop repeatedly collects data, for one (single-agent case) or multiple (multi-agent case) policies, trains the policies on these collected data, and makes sure the policies’ weights are kept in sync. Thereby, the collected environment data contains observations, taken actions, received rewards and so-called done flags, indicating the boundaries of different episodes the agents play through in the simulation.

RL 模拟反馈循环针对一个（单代理案例）或多个（多代理案例）策略重复收集数据，根据这些收集的数据训练策略，并确保策略的权重保持同步。因此，收集的环境数据包含观察、采取的行动、获得的奖励和所谓的完成标志，指示代理在模拟中经历的不同情节的边界。

The simulation iterations of action -> reward -> next state -> train -> repeat, until the end state, is called an episode, or in RLlib, a rollout. The most common API to define environments is the Farama-Foundation Gymnasium API, which we also use in most of our examples.
动作 - >奖励 - >下一个状态 - >训练 - >重复的模拟迭代，直到结束状态，称为一集，或在 RLlib 中称为推出。定义环境的最常见 API 是 Farama-Foundation Gymnasium API，我们在大多数示例中也使用它。

Algorithms 算法

Algorithms bring all RLlib components together, making learning of different tasks accessible via RLlib’s Python API and its command line interface (CLI). Each Algorithm class is managed by its respective AlgorithmConfig, for example to configure a PPO instance, you should use the PPOConfig class. An Algorithm sets up its rollout workers and optimizers, and collects training metrics. Algorithms also implement the Tune Trainable API for easy experiment management.
算法将所有RLlib组件组合在一起，可以通过RLlib的Python API及其命令行界面（CLI）学习不同的任务。每个 Algorithm 类都由其各自的 AlgorithmConfig 管理，例如，要配置实例 PPO ，您应该使用该 PPOConfig 类。An Algorithm 设置其推出工作线程和优化器，并收集训练指标。 Algorithms 还要实现 Tune 可训练 API，以便轻松管理实验。

You have three ways to interact with an algorithm. You can use the basic Python API or the command line to train it, or you can use Ray Tune to tune hyperparameters of your reinforcement learning algorithm. The following example shows three equivalent ways of interacting with PPO, which implements the proximal policy optimization algorithm in RLlib.
您可以通过三种方式与算法进行交互。您可以使用基本的 Python API 或命令行来训练它，也可以使用 Ray Tune 来调整强化学习算法的超参数。以下示例显示了与交互 PPO 的三种等效方式，该方式在 RLlib 中实现了近端策略优化算法。

Basic RLlib Algorithm 基本 RLlib 算法

RLlib Algorithms and Tune RLlib 算法和调谐

RLlib Algorithm classes coordinate the distributed workflow of running rollouts and optimizing policies. Algorithm classes leverage parallel iterators to implement the desired computation pattern. The following figure shows synchronous sampling, the simplest of these patterns:
RLlib 算法类协调运行推出和优化策略的分布式工作流。算法类利用并行迭代器来实现所需的计算模式。下图显示了同步采样，这是这些模式中最简单的一种：

Synchronous Sampling (e.g., A2C, PG, PPO)
同步采样（例如，A2C、PG、PPO）#

RLlib uses Ray actors to scale training from a single core to many thousands of cores in a cluster. You can configure the parallelism used for training by changing the num_workers parameter. Check out our scaling guide for more details here.
RLlib 使用 Ray Actor 将训练从单个内核扩展到集群中的数千个内核。您可以通过更改 num_workers 参数来配置用于训练的并行度。在此处查看我们的扩展指南以获取更多详细信息。

RL Modules 强化学习模块

RLModules are framework-specific neural network containers. In a nutshell, they carry the neural networks and define how to use them during three phases that occur in reinforcement learning: Exploration, inference and training. A minimal RL Module can contain a single neural network and define its exploration-, inference- and training logic to only map observations to actions. Since RL Modules can map observations to actions, they naturally implement reinforcement learning policies in RLlib and can therefore be found in the RolloutWorker, where their exploration and inference logic is used to sample from an environment. The second place in RLlib where RL Modules commonly occur is the Learner, where their training logic is used in training the neural network. RL Modules extend to the multi-agent case, where a single MultiAgentRLModule contains multiple RL Modules. The following figure is a rough sketch of how the above can look in practice:
RLModules 是特定于框架的神经网络容器。简而言之，它们携带神经网络，并定义如何在强化学习中发生的三个阶段使用它们：探索、推理和训练。最小 RL 模块可以包含单个神经网络，并定义其探索、推理和训练逻辑，以仅将观察结果映射到操作。由于 RL 模块可以将观察映射到操作，因此它们自然会在 RLlib 中实现强化学习策略，因此可以在 RolloutWorker 中找到，其中它们的探索和推理逻辑用于从环境中采样。RLlib中经常出现RL模块的第二个地方是 Learner ，它们的训练逻辑用于训练神经网络。RL 模块扩展到多代理案例，其中单个 MultiAgentRLModule 包含多个 RL 模块。下图是上述内容在实践中的粗略草图：

Note 注意

RL Modules are currently in alpha stage. They are wrapped in legacy Policy objects to be used in RolloutWorker for sampling. This should be transparent to the user, but the following Policy Evaluation section still refers to these legacy Policy objects.
RL 模块目前处于 alpha 阶段。它们包装在要用于 RolloutWorker 采样的旧 Policy 对象中。这应该对用户透明，但以下策略评估部分仍引用这些旧策略对象。

Policy Evaluation 政策评估

Given an environment and policy, policy evaluation produces batches of experiences. This is your classic “environment interaction loop”. Efficient policy evaluation can be burdensome to get right, especially when leveraging vectorization, RNNs, or when operating in a multi-agent environment. RLlib provides a RolloutWorker class that manages all of this, and this class is used in most RLlib algorithms.
给定环境和政策，政策评估会产生一批批的经验。这是您的经典“环境交互循环”。高效的策略评估可能很难正确进行，尤其是在利用矢量化、RNN 或在多智能体环境中运行时。RLlib 提供了一个管理所有这些的 RolloutWorker 类，该类用于大多数 RLlib 算法。

You can use rollout workers standalone to produce batches of experiences. This can be done by calling worker.sample() on a worker instance, or worker.sample.remote() in parallel on worker instances created as Ray actors (see WorkerSet).
您可以独立使用推出辅助角色来生成批量体验。这可以通过调用 worker.sample() 工作线程实例来完成，也可以在 worker.sample.remote() 创建为 Ray actor 的工作线程实例上并行完成（请参阅 WorkerSet）。

Here is an example of creating a set of rollout workers and using them gather experiences in parallel. The trajectories are concatenated, the policy learns on the trajectory batch, and then we broadcast the policy weights to the workers for the next round of rollouts:
下面是创建一组推出辅助角色并使用它们并行收集体验的示例。轨迹被串联起来，策略在轨迹批次上学习，然后我们将策略权重广播给工作人员，以便下一轮推出：

Sample Batches 样品批次

Whether running in a single process or a large cluster, all data in RLlib is interchanged in the form of sample batches. Sample batches encode one or more fragments of a trajectory. Typically, RLlib collects batches of size rollout_fragment_length from rollout workers, and concatenates one or more of these batches into a batch of size train_batch_size that is the input to SGD.
无论是在单个进程中还是在大型集群中运行，RLlib 中的所有数据都以样本批次的形式进行交换。样本批次对轨迹的一个或多个片段进行编码。通常，RLlib 从推出工作线程收集大小 rollout_fragment_length 的批次，并将其中一个或多个批次连接成作为 SGD 输入 train_batch_size 的大小批次。

A typical sample batch looks something like the following when summarized. Since all values are kept in arrays, this allows for efficient encoding and transmission across the network:
汇总时，典型的示例批次如下所示。由于所有值都保存在数组中，因此可以有效地编码和通过网络传输：

In multi-agent mode, sample batches are collected separately for each individual policy. These batches are wrapped up together in a MultiAgentBatch, serving as a container for the individual agents’ sample batches.
在多代理模式下，将为每个单独的策略单独收集样本批次。这些批次一起包装在一个 MultiAgentBatch 中，作为各个试剂样品批次的容器。

Training Step Method (`Algorithm.training_step()`)
训练步骤方法（ `Algorithm.training_step()` ）

Note 注意

It’s important to have a good understanding of the basic ray core methods before reading this section. Furthermore, we utilize concepts such as the SampleBatch (and its more advanced sibling: the MultiAgentBatch), RolloutWorker, and Algorithm, which can be read about on this page and the rollout worker reference docs.
在阅读本节之前，充分了解基本的射线核心方法非常重要。此外，我们还利用了 SampleBatch 诸如（及其更高级的兄弟 MultiAgentBatch ：）、 RolloutWorker 和 Algorithm 等概念，这些概念可以在此页面和推出辅助角色参考文档中阅读。

Finally, developers who are looking to implement custom algorithms should familiarize themselves with the Policy and Model classes.
最后，希望实现自定义算法的开发人员应熟悉策略和模型类。

What is it? 这是什么？

The training_step() method of the Algorithm class defines the repeatable execution logic that sits at the core of any algorithm. Think of it as the python implementation of an algorithm’s pseudocode you can find in research papers. You can use training_step() to express how you want to coordinate the collection of samples from the environment(s), the movement of this data to other parts of the algorithm, and the updates and management of your policy’s weights across the different distributed components.
Algorithm 该类 training_step() 的方法定义了位于任何算法核心的可重复执行逻辑。可以将其视为算法伪代码的python实现，您可以在研究论文中找到。您可以使用 training_step() 来表示您希望如何协调从环境中收集样本、将此数据移动到算法的其他部分，以及跨不同分布式组件更新和管理策略权重。

In short, a developer will need to override/modify the ``training_step`` method if they want to make custom changes to an existing algorithm, write their own algo from scratch, or implement some algorithm from a paper.
简而言之，如果开发人员想要对现有算法进行自定义更改，从头开始编写自己的算法或从论文中实现某些算法，则需要覆盖/修改“training_step”方法。

When is `training_step()` invoked?
`training_step()` 何时调用？

The Algorithm’s training_step() method is called:
Algorithm 的方法 training_step() 称为：

when the train() method of Algorithm is called (e.g. “manually” by a user that has constructed an Algorithm instance).
train() 当调用方法 Algorithm 时（例如，由已构造 Algorithm 实例的用户“手动”调用）。

when an RLlib Algorithm is being run by Ray Tune. training_step() will be continuously called till the ray tune stop criteria is met.
当 RLlib 算法由 Ray Tune 运行时。 training_step() 将持续调用，直到满足光线调谐停止条件。

Key Subconcepts 关键子概念

In the following, using the example of VPG (“vanilla policy gradient”), we will try to illustrate how to use the training_step() method to implement this algorithm in RLlib. The “vanilla policy gradient” algo can be thought of as a sequence of repeating steps, or dataflow, of:
在下文中，以 VPG（“香草策略梯度”）为例，我们将尝试说明如何使用该方法 training_step() 在 RLlib 中实现此算法。“香草策略梯度”算法可以被认为是一系列重复步骤或数据流：

Sampling (to collect data from an env)
采样（从环境中收集数据）

Updating the Policy (to learn a behavior)
更新策略（了解行为）

Broadcasting the updated Policy’s weights (to make sure all distributed units have the same weights again)
广播更新的策略权重（以确保所有分布式单元再次具有相同的权重）

Metrics reporting (returning relevant stats from all the above operations with regards to performance and runtime)
指标报告（返回上述所有操作中有关性能和运行时的相关统计信息）

An example implementation of VPG could look like the following:
VPG 的示例实现可能如下所示：

Note 注意

Note that the training_step method is deep learning framework agnostic. This means that you should not write PyTorch- or TensorFlow specific code inside this module, allowing for a strict separation of concerns and enabling us to use the same training_step() method for both TF- and PyTorch versions of your algorithms. DL framework specific code should only be added to the Policy (e.g. in its loss function(s)) and Model (e.g. tf.keras or torch.nn neural network code) classes.
请注意，该方法 training_step 与深度学习框架无关。这意味着您不应该在此模块中编写特定于 PyTorch 或 TensorFlow 的代码，从而允许严格分离关注点，并使我们能够对算法的 TF 和 PyTorch 版本使用相同的 training_step() 方法。DL 框架特定的代码只能添加到策略（例如，在其损失函数中）和模型（例如 tf.keras 或 torch.nn 神经网络代码）类中。

et’s further break down our above training_step() code. In the first step, we collect trajectory data from the environment(s):
让我们进一步分解上面的 training_step() 代码。第一步，我们从环境中收集轨迹数据：

Here, self.workers is a set of RolloutWorkers that are created in the Algorithm’s setup() method (prior to calling training_step()). This WorkerSet is covered in greater depth on the WorkerSet documentation page. The utility function synchronous_parallel_sample can be used for parallel sampling in a blocking fashion across multiple rollout workers (returns once all rollout workers are done sampling). It returns one final MultiAgentBatch resulting from concatenating n smaller MultiAgentBatches (exactly one from each remote rollout worker).
这里， self.workers 是在 Algorithm 的方法 setup() 中创建的一组 RolloutWorkers （在调用 training_step() 之前）。 WorkerSet 工作线程集文档页面上对此进行了更深入的介绍。实用程序函数 synchronous_parallel_sample 可用于以阻塞方式跨多个部署工作线程进行并行采样（在所有部署工作线程完成采样后返回）。它返回一个由连接 n 个较小的 MultiAgentBatch（正好一个来自每个远程推出工作线程）生成的最后一个 MultiAgentBatch。

The train_batch is then passed to another utility function: train_one_step.
然后 train_batch 传递给另一个实用程序函数： train_one_step 。

Methods like train_one_step and multi_gpu_train_one_step are used for training our Policy. Further documentation with examples can be found on the train ops documentation page.
像 train_one_step 和 multi_gpu_train_one_step 用于培训我们的政策的方法。有关示例的更多文档，请参阅训练操作文档页面。

The training updates on the policy are only applied to its version inside self.workers.local_worker. Note that each WorkerSet has n remote workers and exactly one “local worker” and that each worker (remote and local ones) holds a copy of the policy.
策略上的培训更新仅适用于中的 self.workers.local_worker 版本。请注意，每个 WorkerSet 都有 n 个远程工作线程和一个“本地工作线程”，并且每个工作线程（远程和本地工作线程）都持有策略的副本。

Now that we updated the local policy (the copy in self.workers.local_worker), we need to make sure that the copies in all remote workers (self.workers.remote_workers) have their weights synchronized (from the local one):
现在我们更新了本地策略（中 self.workers.local_worker 的副本），我们需要确保所有远程工作人员（ self.workers.remote_workers ）中的副本都同步了它们的权重（从本地副本）：

By calling self.workers.sync_weights(), weights are broadcasted from the local worker to the remote workers. See rollout worker reference docs for further details.
通过调用 self.workers.sync_weights() ，权重从本地工作人员广播到远程工作人员。有关更多详细信息，请参阅推出辅助角色参考文档。

A dictionary is expected to be returned that contains the results of the training update. It maps keys of type str to values that are of type float or to dictionaries of the same form, allowing for a nested structure.
应返回包含训练更新结果的字典。它将类型的键映射到类型的 str float 值或相同形式的字典，从而允许嵌套结构。

For example, a results dictionary could map policy_ids to learning and sampling statistics for that policy:
例如，结果字典可以将policy_ids映射到该策略的学习和采样统计信息：

Training Step Method Utilities
训练步骤方法实用程序

RLlib provides a collection of utilities that abstract away common tasks in RL training. In particular, if you would like to work with the various training_step methods or implement your own, it’s recommended to familiarize yourself first with these following concepts here:
RLlib 提供了一组实用程序，这些实用程序抽象出 RL 训练中的常见任务。特别是，如果您想使用各种 training_step 方法或实现自己的方法，建议首先熟悉以下概念：

Sample Batch: SampleBatch and MultiAgentBatch are the two types that we use for storing trajectory data in RLlib. All of our RLlib abstractions (policies, replay buffers, etc.) operate on these two types.
样本批处理： SampleBatch 是我们 MultiAgentBatch 用于在 RLlib 中存储轨迹数据的两种类型。我们所有的 RLlib 抽象（策略、重放缓冲区等）都对这两种类型进行操作。

Rollout Workers: Rollout workers are an abstraction that wraps a policy (or policies in the case of multi-agent) and an environment. From a high level, we can use rollout workers to collect experiences from the environment by calling their sample() method and we can train their policies by calling their learn_on_batch() method. By default, in RLlib, we create a set of workers that can be used for sampling and training. We create a WorkerSet object inside of setup which is called when an RLlib algorithm is created. The WorkerSet has a local_worker and remote_workers if num_workers > 0 in the experiment config. In RLlib we typically use local_worker for training and remote_workers for sampling.
推出工作线程：部署工作线程是包装策略（如果是多代理）和环境的抽象。从高级别来看，我们可以使用推出工作程序通过调用其方法从环境中收集经验，并且可以通过调用其 learn_on_batch() 方法来训练其 sample() 策略。默认情况下，在 RLlib 中，我们创建一组可用于采样和训练的工作线程。我们创建一个对象，在创建 RLlib 算法时调用 setup 该 WorkerSet 对象。在 WorkerSet 实验配置中有一个 local_worker 和 remote_workers 如果 num_workers > 0 。在 RLlib 中，我们通常用于 local_worker 训练和 remote_workers 采样。

Train Ops: These are methods that improve the policy and update workers. The most basic operator, train_one_step, takes in as input a batch of experiences and emits a ResultDict with metrics as output. For training with GPUs, use multi_gpu_train_one_step. These methods use the learn_on_batch method of rollout workers to complete the training update.
培训操作：这些是改进策略和更新工作人员的方法。最基本的运算符 train_one_step 将一批体验作为输入，并发出带有指标的作为 ResultDict 输出。要使用 GPU 进行训练，请使用 multi_gpu_train_one_step 。这些方法使用推出辅助角色 learn_on_batch 的方法完成培训更新。

Replay Buffers: RLlib provides a collection of replay buffers that can be used for storing and sampling experiences.
重播缓冲区：RLlib 提供可用于存储和采样体验的重播缓冲区集合。

标签：

Key Concepts 关键概念

Key Concepts 关键概念

Environments 环境

Algorithms 算法

RL Modules 强化学习模块

Policy Evaluation 政策评估

Sample Batches 样品批次

Training Step Method (`Algorithm.training_step()`)
训练步骤方法（ `Algorithm.training_step()` ）

What is it? 这是什么？

When is `training_step()` invoked?
`training_step()` 何时调用？

Key Subconcepts 关键子概念

Training Step Method Utilities
训练步骤方法实用程序