《休息站和分岔路》Rest Stop and an Off Ramp

2023-02-24 14:30 作者:学的很杂的一个人 0人读过 | 我要投稿

来源：https://e2eml.school/transformers.html#softmax
中英双语版，由各类翻译程序和少量自己理解的意思做中文注释

相关文章汇总在文集：Transformers from Scratch（中文注释）
--------------------------------------------------------------------------------------------------------------------

Congratulations on making it this far. You can stop if you want.

恭喜你走到了这一步。如果你愿意，你可以停下来。

The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side.

带跳过的选择性二阶模型是一种有用的方法，可以思考transformers的作用，至少在解码器端是这样。

It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.

它捕捉到了像OpenAI的GPT-3这样的生成语言模型在做什么。

It doesn't tell the complete story, but it represents the central thrust of it.
它没有讲述完整的故事，但它代表了它的核心。

The next sections cover more of the gap between this intuitive explanation and how transformers are implemented.

下一节将介绍这一直观解释与transformers如何实现之间的更多差距。

These are largely driven by three practical considerations.
这主要是由三个实际考虑因素驱动的。

1、Computers are especially good at matrix multiplications.

1、计算机特别擅长矩阵乘法。

There is an entire industry around building computer hardware specifically for fast matrix multiplications.

整个行业都围绕着专门用于快速矩阵乘法的计算机硬件。

Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.

任何可以表示为矩阵乘法的计算都可以变得非常高效。

It's a bullet train.

这是一列子弹头火车。

If you can get your baggage into it, it will get you where you want to go real fast.

如果你能把行李放进去，它会让你很快到达你想去的地方。

2、Each step needs to be differentiable.

2、每个步骤都需要是可微的。

So far we've just been working with toy examples, and have had the luxury of hand-picking all the transition probabilities and mask values—the model parameters.

到目前为止，我们只是在研究玩具的例子，并且已经有了手工挑选所有转换概率和模型参数的掩码值的奢侈体验。

In practice, these have to be learned via backpropagation, which depends on each computation step being differentiable.

在实践中，这些必须通过反向传播来学习，这取决于每个计算步骤是可微的。

This means that for any small change in a parameter, we can calculate the corresponding change in the model error or loss.

这意味着，对于参数的任何微小变化，我们都可以计算模型误差或损失的相应变化。

3、The gradient needs to be smooth and well conditioned.

3、斜率需要平滑且条件良好。

The combination of all the derivatives for all the parameters is the loss gradient.

所有参数的所有导数的组合就是损耗梯度。

In practice, getting backpropagation to behave well requires gradients that are smooth, that is, the slope doesn’t change very quickly as you make small steps in any direction.

在实践中，要使反向传播表现良好，需要平滑的梯度，也就是说，当你在任何方向上迈出小步时，斜率都不会很快改变。

They also behave much better when the gradient is well conditioned, that is, it’s not radically larger in one direction than another.

当梯度条件良好时，它们也表现得更好，也就是说，它在一个方向上不会比另一个方向大得多。

If you picture a loss function as a landscape, The Grand Canyon would be a poorly conditioned one.

如果你把损失函数想象成一个景观，大峡谷将是一个条件很差的峡谷。

Depending on whether you are traveling along the bottom, or up the side, you will have very different slopes to travel.

根据您是沿着底部还是向上行驶，您将有非常不同的斜坡要行驶。

By contrast, the rolling hills of the classic Windows screensaver would have a well conditioned gradient.

相比之下，经典Windows屏幕保护程序的起伏山丘会有一个良好的渐变。

If the science of architecting neural networks is creating differentiable building blocks, the art of them is stacking the pieces in such a way that the gradient doesn’t change too quickly and is roughly of the same magnitude in every direction.
如果构建神经网络的科学是创建可区分的构建块，那么它们的艺术就是以这样的方式堆叠这些块，即梯度不会变化太快，并且在每个方向上都大致相同。

标签：transformers

《休息站和分岔路》Rest Stop and an Off Ramp

《休息站和分岔路》Rest Stop and an Off Ramp的评论 (共条)

你可能也喜欢这些文章

最新发布的文章

《休息站和分岔路》Rest Stop and an Off Ramp

本文作者的其他文章

《休息站和分岔路》Rest Stop and an Off Ramp的评论 (共 条)

你可能也喜欢这些文章

最新发布的文章

《休息站和分岔路》Rest Stop and an Off Ramp的评论 (共条)