Is there a mistake at 09:50? "You have 12 or 24 attention heads"; shouldn't that be 12 or 24 layers with as many attention heads as the length of tokens in the input / output sequence? Also, this is VERY well done lecture series! We will probably have our own NLU course at our university based on these materials! This is a huge service for the next generation of natural language related data scientists!
I think from the original paper both the number of attention heads and the number of layers are arbitrary and need not to be affected by the length of tokens. Though paper gives its choices.
This is the most notable video about attention I have come across so far. Thank you for uploading this
true
Is there a mistake at 09:50?
"You have 12 or 24 attention heads"; shouldn't that be 12 or 24 layers with as many attention heads as the length of tokens in the input / output sequence?
Also, this is VERY well done lecture series! We will probably have our own NLU course at our university based on these materials! This is a huge service for the next generation of natural language related data scientists!
I think from the original paper both the number of attention heads and the number of layers are arbitrary and need not to be affected by the length of tokens. Though paper gives its choices.