英文:
What is the training data input to the transformers (attention is all you need)?
问题
抱歉,我只返回翻译好的部分,不提供其他内容:
A. 作为训练数据,您是同时输入这两个句子吗,即(Le chat mange, The cat eats)
?
还是
B. 使用((Le chat mange, ), The), ((Le chat mange, The), cat), ((Le chat mange, The cat), eats)
作为训练数据?
如果是A,听起来我需要等待网络在训练过程中逐个生成单词,这样无法并行化。所以我猜应该是B?
英文:
Sorry I'm new to NLP. Please bear with me. Say I have two sentences:
French: Le chat mange.
English: The cat eats.
In the following text, I will denote a training data as a tuple (x, y)
, where x
is the input data and y
is the annotation.
When I train a transformer network, do I A. input these two sentences synchronously as training data, i.e. (Le chat mange, The cat eats)
? Or do I B. use
((Le chat mange, ), The), ((Le chat mange, The), cat), ((Le chat mange, The cat), eats)
as training data?
If it's A, sounds like I have to wait for the network to produce the words one by one during training, which would not be parallelizable. So I guess it should be B?
答案1
得分: 0
我明白了。这个源句子的“移位”是通过应用论文中提到的“掩码”来完成的。
这个掩码看起来像这样
M=[0, 0, ..., 0
1, 0, ..., 0
1, 1, ..., 0]
在自注意力中,由于矩阵 QK^T
(忽略缩放因子)表示“查询”和“键”之间的交叉相关性,当应用掩码时:M o (QK^T)
(o
表示逐元素乘法),“当前查询” Q[i,:]
与“未来”键 K[i+k,:]
之间的相关性,对于 k=1,...,N-i
都被忽略。
英文:
I figured it out. This "shifting" of the source sentence is done by applying the "mask" mentioned in the paper.
The mask looks like this
M=[0, 0, ..., 0
1, 0, ..., 0
1, 1, ..., 0]
In self attention, since the matrix QK^T
(scaling factor ignored) represents the cross-correlation between the "queries" and the "keys", when the mask is applied: M o (QK^T)
(o
denotes elementwise multiplication), the correlations between the "current query" Q[i,:]
and "future" keys K[i+k,:]
, for k=1,...,N-i
are ignored.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论