The presenter says, these models "do not use RNNs" (correct), instead "they use CNNs" (incorrect, no use of convolution kernels). They use simple linear transformations of the type XW(transpose) + b
IMHO, that's debatable. Indeed, think of what happens when you apply the same dense layer to each input in a sequence? Well you're effectively running a 1D convolutional layer with kernel size 1. If you're familiar with Keras, try building a model with: TimeDistributed(Dense(10, activation="relu")) then replace it with this: Conv1D(10, kernel_size=1, activation="relu") You'll see that it gives precisely the same result (assuming you use the same random seeds). Since the Transformer architecture applies the same dense layers across all time steps, you can think of the whole architecture as a stack of 1D-Convolutional layers with kernel size 1 (then of course there's the important Multihead attention part, which is a different beast altogether). Granted, it's not the most typical CNN architecture, which usually use fairly few convolutional layers with kernel size 1, but still, it's not really an error to say the Transformer is based on convolutions. I think Martin's goal was mostly to highlight the fact that, contrary to RNNs, every time step gets processed in parallel. Just my $.02! :))
what in gods name are you talking about?? what is an LSTM chain?? I came here because I need to know im writing the correct content for my website and I haven't a fucking clue what the hell you are on about.
@10:05 - Excellent explanation of Byte-Pair Encodings , thanks.
The presenter says, these models "do not use RNNs" (correct), instead "they use CNNs" (incorrect, no use of convolution kernels). They use simple linear transformations of the type XW(transpose) + b
You can model convolution operations with transformers.
IMHO, that's debatable. Indeed, think of what happens when you apply the same dense layer to each input in a sequence? Well you're effectively running a 1D convolutional layer with kernel size 1. If you're familiar with Keras, try building a model with:
TimeDistributed(Dense(10, activation="relu"))
then replace it with this:
Conv1D(10, kernel_size=1, activation="relu")
You'll see that it gives precisely the same result (assuming you use the same random seeds).
Since the Transformer architecture applies the same dense layers across all time steps, you can think of the whole architecture as a stack of 1D-Convolutional layers with kernel size 1 (then of course there's the important Multihead attention part, which is a different beast altogether).
Granted, it's not the most typical CNN architecture, which usually use fairly few convolutional layers with kernel size 1, but still, it's not really an error to say the Transformer is based on convolutions. I think Martin's goal was mostly to highlight the fact that, contrary to RNNs, every time step gets processed in parallel.
Just my $.02! :))
Thank you. This summary/introduction is very very helpful.
Can we have link to slides please
wow engineers sg sure haz come a long way ha! great talk
Where to get this PPT; Please share the link
Could I extract word embeddings from BERT and use them for unsupervised learning, e.g. topic modeling? :)
Very good updates for nlp enthusiasts
Very clear and helpful talk
Does bpe also works well for non english languages like chinese and french?
BERT uses wordpiece. Albert uses sentence piece
How is it CNN based?
Very Informative.
very nice speech!
thank you!
Nice !
what in gods name are you talking about?? what is an LSTM chain?? I came here because I need to know im writing the correct content for my website and I haven't a fucking clue what the hell you are on about.