欢迎光临散文网 会员登陆 & 注册

《带跳过的二阶序列模型 》Second order sequence model with skips

2023-02-22 12:02 作者:学的很杂的一个人  | 我要投稿

来源:https://e2eml.school/transformers.html#softmax
中英双语版,由各类翻译程序和少量自己理解的意思做中文注释

相关文章汇总在文集:Transformers from Scratch(中文注释)

--------------------------------------------------------------------------------------------------------------------

 A second order model works well when we only have to look back two words to decide what word comes next.

当我们只需要回顾两个单词来决定下一个单词时,二阶模型效果很好。

What about when we have to look back further?

当我们不得不进一步回顾时呢?

Imagine we are building yet another language model.

想象一下,我们正在构建另一种语言模型。

This one only has to represent two sentences, each equally likely to occur.

这个只需要代表两个句子,每个句子发生的可能性相等。

    Check the program log and find out whether it ran please.

(请检查程序日志,了解它是否运行。)

    Check the battery log and find out whether it ran down please.
(请检查电池日志,看看它是否耗尽了。)

In this example, in order to determine which word should come after ran, we would have to look back 8 words into the past.

在这个例子中,为了确定哪个单词应该出现在ran之后,我们必须回顾过去的8个单词。

If we want to improve on our second order language model, we can of course consider third- and higher order models.

如果我们想改进我们的二阶语言模型,我们当然可以考虑三阶和更高阶的模型。

However, with a significant vocabulary size this takes a combination of creativity and brute force to execute.

然而,对于一个大的词汇量,这需要创造力和大量的计算来实现。

A naive implementation of an eighth order model would have N^8 rows, a ridiculous number for any reasonable vocubulary.
八阶模型实现将会有N^8行,这对于任何合理的词汇表来说都是一个荒谬的数字。

Instead, we can do something sly and make a second order model, but consider the combinations of the most recent word with each of the words that came before.

相反,我们可以采取一些巧妙的方法,制作一个二阶模型,但考虑最近的单词与之前每个单词的组合。

It's still second order, because we're only considering two words at a time, but it allows us to reach back further and capture long range dependencies.

它仍然是二阶的,因为我们每次只考虑两个单词,但它允许我们回溯更远并捕获长距离的依赖关系。

The difference between this second-order-with-skips and a full umpteenth-order model is that we discard most of the word order information and combinations of preceeeding words.

 这个二阶跳跃模型和完整的高阶模型之间的区别在于,我们舍弃了大部分单词顺序信息和前面单词的组合。

What remains is still pretty powerful.
剩下的信息仍然非常有力。

Markov chains fail us entirely now, but we can still represent the link between each pair of preceding words and the words that follow.

现在马尔可夫链已经对我们失效了,但我们仍然可以表示每一对前面的单词和后面的单词之间的关联。

Here we've dispensed with numerical weights, and instead are showing only the arrows associated with non-zero weights.

在这里,我们摒弃了数值权重,而是只显示与非零权重相关的箭头。

Larger weights are shown with heavier lines.

权重越大,线条越粗。

Here's what it might look like in a transition matrix.    

下面是它在转换矩阵中的样子。

This view only shows the rows relevant to predicting the word that comes after ran. 

此视图仅显示与预测 ran 之后的单词相关的行。

It shows instances where the most recent word (ran) is preceded by each of the other words in the vocabulary. 

它显示了最近单词 (ran) 前面是词汇表中其他每个单词的实例。

Only the relevant values are shown. All the empty cells are zeros.

只显示相关值。所有空单元格为零。

The first thing that becomes apparent is that, when trying to predict the word that comes after ran, we no longer look at just one line, but rather a whole set of them. 

首先显而易见的是,当试图预测 ran 后面的单词时,我们不再只看一行,而是看一整组。

We've moved out of the Markov realm now. 

我们现在已经离开马尔可夫的范围。

Each row no longer represents the state of the sequence at a particular point. 

每一行不再表示序列在特定点的状态。

Instead, each row represents one of many features that may describe the sequence at a particular point. 

相反,每行表示可能描述特定点序列的众多特征之一。

The combination of the most recent word with each of the words that came before makes for a collection of applicable rows, maybe a large collection. 

将最近的单词与前面的每个单词组合在一起,可以生成适用行的集合,也许是一个大型集合。

Because of this change in meaning, each value in the matrix no longer represents a probability, but rather a vote. Votes will be summed and compared to determine next word .

由于这种含义的变化,矩阵中的每个值不再代表一个概率,而是一个投票。投票将被汇总并进行比较,以确定下一个单词预测。

The next thing that becomes apparent is that most of the features don't matter. 

接下来显而易见的是,大多数功能都无关紧要。

Most of the words appear in both sentences, and so the fact that they have been seen is of no help in predicting what comes next. 

大多数单词都出现在这两个句子中,因此事实上它们被看到也无助于预测接下来会发生什么。

They all have a value of .5. 

它们的值均为 .5。

The only two exceptions are battery and program. 

唯一的两个例外是电池和程序。

They have some 1 and 0 weights associated with the. 

它们有一些 1 和 0 与 相关的权重。

The feature battery, ran indicates that ran was the most recent word and that battery occurred somewhere earlier in the sentence. 

“battery”特征,ran表示ran是最近的单词,并且“battery”出现在句子的前面的某个地方。

This feature has a weight of 1 associated with down and a weight of 0 associated with please.

此特征具有与向下关联的权重1和与”please“关联的权重0。

Similarly, the feature program, ran has the opposite set of weights. 

类似地,特征“program”,“ran”具有相反的权重集。

This structure shows that it is the presence of these two words earlier in the sentence that is decisive in predicting which word comes next.

这种结构表明,正是这两个词在句子的前面出现,才是预测下一个单词的决定性因素。

To convert this set of word-pair features into a next word estimate, the values of all the relevant rows need to be summed.

要将这组单词对特征转换为下一个单词估计,需要对所有相关行的值求和。

Adding down the column, the sequence Check the program log and find out whether it ran generates sums of 0 for all the words, except a 4 for down and a 5 for please. 

将列向下加,序列检查“program log”并找出它是否“ran”为所有单词生成0的和,除了4表示向下和5表示“please”。

The sequence Check the battery log and find out whether it ran does the same, except with a 5 for down and a 4 for please. 

顺序检查“battery log;并找出它是否“ran”也一样,除了5表示向下,4表示“please”。

By choosing the word with the highest vote total as the next word prediction, this model gets us the right answer, despite having an eight word deep dependency.

通过选择投票总数最高的单词作为下一个单词预测,这个模型得到了正确的答案,尽管有八个单词的深度依赖。

《带跳过的二阶序列模型 》Second order sequence model with skips的评论 (共 条)

分享到微博请遵守国家法律