《掩码》Masking

2023-02-22 22:04 作者:学的很杂的一个人 0人读过 | 我要投稿

来源：https://e2eml.school/transformers.html#softmax
中英双语版，由各类翻译程序和少量自己理解的意思做中文注释

相关文章汇总在文集：Transformers from Scratch（中文注释）

--------------------------------------------------------------------------------------------------------------------

On more careful consideration, this is unsatisfying.

仔细想想，这是令人不满意的。

The difference between a vote total of 4 and 5 is relatively small.

投票总数为4票和5票之间的差距相对较小。

It suggests that the model isn't as confident as it could be.

这表明该模型并没有想象中的那么自信。

And in a larger, more organic language model it's easy to imagine that such a slight difference could be lost in the statistical noise.

在一个更大的更有机的语言模型中，很容易想象这种微小的差异可能会在统计噪声中丢失。

We can sharpen the prediction by weeding out all the uninformative feature votes.

我们可以通过剔除所有无信息的特征投票来增强预测。

With the exception of battery, ran and program, ran. It's helpful to remember at this point that we pull the relevant rows out of the transition matrix by multiplying it with a vector showing which features are currently active.

除“battery“, ”ran“ and ”program, ran“. 外，在这一点上，我们需要记住的是，我们将转换矩阵中的相关行与显示当前活动的特征的向量相乘。

For this example so far, we've been using the implied feature vector shown here.

到目前为止，对于此示例，我们一直在使用此处显示的隐含特征向量。

It includes a one for each feature that is a combination of ran with each of the words that come before it.

它包括一个1对每个特征，它是ran与前面的每个单词的组合。

Any words that come after it don't get included in the feature set. (In the next word prediction problem these haven't been seen yet, and so it's not fair to use them predict what comes next.)

之后出现的任何单词都不会包含在特征集中。（在下一个单词预测问题中，这些还没有被发现，因此使用它们来预测下一个词是不公平的。）

And this doesn't include all the other possible word combinations. We can safely ignore these for this example because they will all be zero.

这不包括所有其他可能的单词组合。在这个例子中，我们可以安全地忽略这些，因为它们都是零。

To improve our results, we can additionally force the unhelpful features to zero by creating a mask.

为了改善我们的结果，我们还可以通过创建一个掩码，将无用的特性强制为零。

It's a vector full of ones except for the positions you'd like to hide or mask, and those are set to zero.
这是一个都是1的向量，你想要隐藏或屏蔽的位置都设置为0。

In our case we'd like to mask everything except for battery, ran and program, ran, the only two features that have been of any help.

在我们的案例中，我们希望屏蔽除battery, ran 和 program, ran之外的所有特征，这是唯一有帮助的两个特征。

To apply the mask, we multiply the two vectors element by element.

要应用这个掩码，我们将两个向量逐个元素相乘。

Any feature activity value in an unmasked position will be multiplied by one and left unchanged.

未掩码位置中的任何特征活动值都将乘以1并保持不变。

Any feature activity value in a masked position will be multiplied by zero, and thus forced to zero.

掩码位置中的任何特征活动值都将乘以0，从而强制为0。

The mask has the effect of hiding a lot of the transition matrix.

掩码具有隐藏大量转换矩阵的效果。

It hides the combination of ran with everything except battery and program, leaving just the features that matter.

它隐藏了ran与除了battery和program外的所有特征的组合，只留下重要的功能。

After masking the unhelpful features, the next word predictions become much stronger.

在掩盖了这些无用的特征之后，下一个单词的预测会变得更加强烈。

When the word battery occurs earlier in the sentence, the word after ran is predicted to be down with a weight of 1 and please with a weight of 0.

当单词battery出现在句子的前面时，单词ran之后的权重为1，please的权重为0。

What was a weight difference of 25 percent has become a difference of infinity percent.

由原来25%的权重差异变成了无穷大的差异。

There is no doubt what word comes next.

毫无疑问，下一个词是什么。

The same strong prediction occurs for please when program occurs early on.

同样强烈的预测发生在“please”，当“program”早期出现时。

This process of selective masking is the attention called out in the title of the original paper on transformers.

这种选择性掩蔽的过程是transformers原始论文标题中所提到的注意力。

So far, what we've descibed is a just an approximation of how attention is implemented in the paper.

到目前为止，我们所描述的只是本文中注意力如何实现的一个近似。

It captures the important concepts, but the details are different. We'll close that gap later.

它抓住了重要的概念，但细节不同。我们稍后会缩小差距。

标签：

《掩码》Masking

《掩码》Masking的评论 (共条)

你可能也喜欢这些文章

最新发布的文章

《掩码》Masking

本文作者的其他文章

《掩码》Masking的评论 (共 条)

你可能也喜欢这些文章

最新发布的文章

《掩码》Masking的评论 (共条)