《二阶序列模型》Second order sequence model
来源:https://e2eml.school/transformers.html#softmax
中英双语版,由各类翻译程序和少量自己理解的意思做中文注释
----------------------------------------------------------------------------------------------------------------------
Predicting the next word based on only the current word is hard.
仅根据当前单词预测下一个单词是很困难的。
That's like predicting the rest of a tune after being given just the first note.
这就像在只给出第一个音符后预测曲调的其余部分。
Our chances are a lot better if we can at least get two notes to go on.
如果我们至少能得到两个音符输入,那我们的预测就会好很多。
We can see how this works in another toy language model for our computer commands.
我们可以在计算机命令的另一个玩具语言模型中看到它是如何工作的。
We expect that this one will only ever see two sentences, in a 40/60 proportion.
我们预计这个只会看到两个句子,比例为40/60。
Check whether the battery ran down please.(请检查电池是否耗尽。)
Check whether the program ran please.(请检查程序是否运行。)
A Markov chain illustrates a first order model for this.
用马尔可夫链的一阶模型。

Here we can see that if our model looked at the two most recent words, instead of just one, that it could do a better job.
在这里,我们可以看到,如果我们的模型查看最近的两个单词,而不仅仅是一个,它可以做得更好。
When it
encounters battery ran, it knows that the next word will be down, and when it sees program ran
the next word will be please.
当它遇到“battery ran”时,它知道下一个单词是“down”,当它看到“program ran”时,下一个单词会“please.”。
This eliminates one of the branches in the model, reducing uncertainty and increasing confidence.
这消除了模型中的一个分支,减少了不确定性并增加了置信度。
Looking back two words turns this into a second order Markov model.
回顾两个词,这变成了二阶马尔可夫模型。
It gives more context on which to base next word predictions.
它提供了更多上下文,作为下一个单词预测的基础。
Second order Markov chains are more challenging to draw, but here are the connections that demonstrate their value.
二阶马尔可夫链的绘制更具挑战性,但以下的连接可证明其价值。

To highlight the difference between the two, here is the first order transition matrix,
为了突出两者之间的区别,这里是一阶转移矩阵,

and here is the second order transition matrix.
这是二阶转移矩阵。

Notice how the second order matrix has a separate row for every combination of words (most of which are not shown here).
请注意,二阶矩阵对于每个单词组合都有单独的行(此处未显示其中大部分)。
That means that if we start with a vocabulary size of N then the transition matrix has N^2 rows.
这意味着如果我们从 N 的词汇大小开始,那么转换矩阵有 N^2 行。
What this buys us is more confidence.
这给我们带来的是更多的置信度。
There are more ones and fewer fractions in the second order model.
在二阶模型中有更多的1和更少的分数。
There's only one row with fractions in it, one branch in our model.
只有一行包含分数,在我们的模型中有一个分支。
Intuitively, looking at two words instead of just one gives more context, more information on which to base a next word guess.
直观地说,看两个单词而不仅仅是一个单词会给出更多的上下文和更多的信息,从而为下一个单词的猜测提供依据。