bert 为什么隐藏层大小必须是注意力头数的倍数?

ff29svar  于 4个月前  发布在  其他
关注(0)|答案(4)|浏览(47)

if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads))
why hidden size must be a multiple of the number of attention head?
from line 804, modeling.py

cngwdvgl

cngwdvgl1#

原因是在下一行代码中:
attention_head_size = int(hidden_size / num_attention_heads)
因为查询和值的维度可能不同(查询和键的维度应该相等),所以在注意力层中使用了attention_head_size,它被设置为向量查询和键的维度。

vsaztqbk

vsaztqbk2#

查询维度和键的维度应该相等,这是因为它们需要相乘来计算注意力吗?

请问hidden_size、num_attention_heads和query vector size是如何相互关联的?它们是否不能独立?

jm2pwxwz

jm2pwxwz3#

dimension of query and key should be equal => is this because these need to be multiplied to calculate attention?
May I know how hidden_size , num_attention_heads and query vector size are linked with each other? Can these not be independent?
1.Multiply query and key to calculate similarity(dimension of query and key should be equal).
2.Num_attention_heads is independent with query vector size,I think he did it for convenience,but its not necessary.
3.you can find details in this website: https://jalammar.github.io/illustrated-transformer/

d8tt03nd

d8tt03nd4#

我也有同样的问题,在重新阅读了The Illustrated Transformer之后,似乎已经解决了。我在帖子中的一张图片上做了一些注解。
我认为现在很清楚了,隐藏大小必须是注意力头数的倍数。如果我错了,请纠正我。

相关问题