this is one of your best NLP videos to me, a very quick but clear recap of language models, RNN, word vectors, attention. All to explain the bert revolution. This is awesome! and I would love a series of recap videos like this. Kudos!
This is my second comment on your videos. I am really thankful to you for creating such an informatory video on BERT. Now I can go through the paper with some confidence.
Hey, what BERT claims is infact very similar to the working of a Transformer Encoder layer as described in the "Attention Is All You Need Paper". The encoder submodel is allowed to peek into future tokens as well.
That's not a secret, indeed they describe the architecture in the paper as a transformer encoder. The novelty is in using this transformer encoder for language model pre-training.
Thank you. In 10:51 I think although in ELMo they're concatenating left and right side, when making a prediction if there is a softmax the back-propagating error to left should be effected by right side and vice versa. However, I understand what you mean by they're not that coupled.
Does bert take in fixed length sentences for the question and paragraph task ? if not then how is the variable length input is handled? basically what is the size of data fed into the network
Very nice explanation. Can you please elaborate the token embedding used in Bert. Are these the same 300 dimensions vector from glove or these embedding are trained from scratch in Bert. How are we getting the base embedding is something I am not able to understand. Thanks in advance for clarifying.
To understand a technical paper, a basic level of tech foundation is required. There are explanation videos out there targeted to laymen, but this video is for an audience who either can already read the paper and wants a summary instead, or those who knows what is going on but gets thrown off by academic language and jargons in paper.
have a question? When you train a BERT model, lets say for a named-entity recognition task like "Subscribe to Pewdiepie", does BERT model automatically map the words 'Subscribe', 'to', 'Pewdiepie' to its already trained word embeddings read off the corpus? If it does, it means the BERT model comes with its huge bag of word embeddings.
Dear Yannic, Could you please share with me how to use BERT for fine-tuning in a regression task? My data looks like: input: a sentence with length 30 words output: a score in [0, 5]. Is it good to use BERT for a dataset like this? I found some document said transfer learning is effective for a new dataset which is the same with the source task/dataset. Thank you!
It depends. If your sentences are natural language (and preferably English), then it can make sense. Take a pre-trained BERT and use the CLS vector to put into a regression head. Maybe huggingface has pre-built modules for exactly that already.
@@YannicKilcher Thank you for your prompt response! I still have some questions. 1) Do I need to put [SEP] at the end of my sentence or just only put [CLS] at the beginning? I see some tutorial, they put [SEP] at the end and some did not, for a classification task (here I think we don't need to put [SEP]). 2). I did not see any pre-built module for regression on huggingface, just only classification, question-answering... available! Do you mean I use the 'general' BertModel as you used in your tutorial and modify it for the regression task? I am sorry if it is a silly question, because I am just take some walk into DL and I do not usually use Pytorch. Could you please show me more detail for this part? Thank you so much!
@@tamvominh3272 1) you just need to try out these things. 2) in that case, just take the standard bert encoder, take the CLS output and run it through a linear layer with a regression loss.
@@tamvominh3272 HuggingFace has a model with a classification head built-in. Follow their tutorials and examples. Let the tokenizer do the work. Very handy
Nice video. This is an excerpt from the paper which I am not able to understand "Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context". Can you please help me ? I am not able to understand how can word see itself after incorporating bidirectionality? Thanks.
Consider the sequence "A B C" and try to reconstruct the tokens with bidirectional contexts and two hidden layers. The embedding of C in hidden layer 1 will have attention to the input B and the embedding of B in hidden layer 2 will have attention to the layer 1 embedding of C. So the embedding of B has direct access to the input token B, which makes the reconstruction task trivial.
Thank you so much for the amazing paper explanation! Is the speaking at 16:28 means they pre-train two tasks at the same time(predict the mask "and" the isNext label) or have an order training(pre-train the task 1 then the task 2).
Why the inputs (word, segment, and position embeddings) are sum together instead of concatenated to a vector ? Doesn't the summation lead to ambiguity/info loss ?
Hi! Thanks a lot for this video! I was searching for the information about out-of-vocab words - and I found it in your talk :) However, only one moment remains unclear: How do we tokenize out-of-vocab words? I mean how do we divide words into characters or word-pieces? What algorithm is used to divide "subscribe" into "sub + s + c + r + i + b + e" and not "sub + scribe"? I understand that it depends on the vocabulary but how it is exactly performed? Thanks a lot again) (BTW I subscribed :)) )
But why not take the encoding of the full sentence for topic modeling? Why stop at word embedding? You will lose all the context in the sentence/paragraph
Hi, firstly thanks for posting this one. I have a question, let say I want to pre -train BERT, so I have some text, but how to generate the word embedding as input part(token embedding part ), is that firstly generate randomly ? for example, we only have two word "yes or no" then After one hot encoding we can say yes ->10 and no ->01, then we have a sentence called "yes no", hje sentence will enter bert model, how to initialize word embedding those two words? also , if we want to fine tune model, is that means we have a pretrain embedding(for example such as the same way in glove word embedding ) ? or randomly as pre-train model ? in other words, does fine tune bert has pre-train embedding?
A pre-train model learns the embeddings of the vocabulary provided to it. These learnt embeddings are then used as initialised embeddings in fine tuning phase, along with classification layer weights
Hello Sir, Thanks for the video. I have a question. What confuses me when I see BERT or GPT in picture at 14.16 sec why transformers are shown so many times in a layer format? When I read about the transformer, it takes all the words of a sequence at a time and pass through layers of Encoders (Attention + Fully connected layer). In Bert also we are passing all the words to transformer. Right? Then why are we showing so many transformers (in circle)? Is BERT collection of many transformers? (combination of encoders + decoders)
@@YannicKilcher Thanks for your kind reply. But why do we show series of transformers in the picture? Shouldn't it be one transformer within which we have series of encoders (attention + feed forward)
I lost the interest after 10m , as you explain useless parts and not the core. Might make sense to focus on the topic and assume some things are known.
Its so kind of you to introduce these papers to us in such a decent way. Thanks a lot.
“the man went to MASK store” makes a lot of sense these days.
Holy crap I had to laugh at this 😁
LOL
I watch the ads entirely just to show my support for this amazing channel
It's better if you just click it 10 times :))
this is one of your best NLP videos to me, a very quick but clear recap of language models, RNN, word vectors, attention. All to explain the bert revolution. This is awesome! and I would love a series of recap videos like this. Kudos!
I like the way he knows he isnt the best at explaining stuff but still tried 110% to explain ! Thanks man for the amazing papers.
This is my second comment on your videos. I am really thankful to you for creating such an informatory video on BERT. Now I can go through the paper with some confidence.
Thanks for the feedback. Glad it helped
"the problem is that a character in itself doesn't really have a meaning"
f
Fantastic overview. Really appreciate your patient and detailed walk-through of the paper.
special thanks for tokenization detour and deeper dive into finetuning/evaluation tasks!
Thank you a lot I search a lot and I read the paper but I have difficulty to understand it until I watch your video you make everything easy
Hey, what BERT claims is infact very similar to the working of a Transformer Encoder layer as described in the "Attention Is All You Need Paper". The encoder submodel is allowed to peek into future tokens as well.
That's not a secret, indeed they describe the architecture in the paper as a transformer encoder. The novelty is in using this transformer encoder for language model pre-training.
Thank you. In 10:51 I think although in ELMo they're concatenating left and right side, when making a prediction if there is a softmax the back-propagating error to left should be effected by right side and vice versa. However, I understand what you mean by they're not that coupled.
Does bert take in fixed length sentences for the question and paragraph task ? if not then how is the variable length input is handled? basically what is the size of data fed into the network
The total state is fixed length, with padding or cropping if needed.
Amazing Explanation
Very nice explanation. Can you please elaborate the token embedding used in Bert. Are these the same 300 dimensions vector from glove or these embedding are trained from scratch in Bert. How are we getting the base embedding is something I am not able to understand. Thanks in advance for clarifying.
Thank for the illustrations
Thank you so much for your efforts.
I cannot say that I understood Bert from this video.
To understand a technical paper, a basic level of tech foundation is required. There are explanation videos out there targeted to laymen, but this video is for an audience who either can already read the paper and wants a summary instead, or those who knows what is going on but gets thrown off by academic language and jargons in paper.
At 25:10 you're taking about character level tokens. Does that refers to "Enriching word vectors with sub subword representation" paper?
I'm referring to wordpieces, which refers to the sub-words, yes.
somehow the image in figure 1 comparison is different on the arvix 2019 paper?
arxiv papers are pre-publication, not necessarily final versions. the text is a bit different too.
Great explanation! Thank you!
Sir ,wonderful and clear explanation..i have douth that qa system with bert technique is supervised or unsupervised...is bert is pre training model
Thanks for doing your part :)
Absolutely LOVED the video.
Nicely! Thanks a lot.
Nicely done!
have a question?
When you train a BERT model, lets say for a named-entity recognition task like "Subscribe to Pewdiepie", does BERT model automatically map the words 'Subscribe', 'to', 'Pewdiepie' to its already trained word embeddings read off the corpus? If it does, it means the BERT model comes with its huge bag of word embeddings.
If you are using Pytorch, it comes with a BERT Tokenizer! I am not sure if Tensorflow has this.
It splits into word pieces, and worst case into characters
Dear Yannic,
Could you please share with me how to use BERT for fine-tuning in a regression task? My data looks like:
input: a sentence with length 30 words
output: a score in [0, 5].
Is it good to use BERT for a dataset like this? I found some document said transfer learning is effective for a new dataset which is the same with the source task/dataset.
Thank you!
It depends. If your sentences are natural language (and preferably English), then it can make sense. Take a pre-trained BERT and use the CLS vector to put into a regression head. Maybe huggingface has pre-built modules for exactly that already.
@@YannicKilcher Thank you for your prompt response! I still have some questions.
1) Do I need to put [SEP] at the end of my sentence or just only put [CLS] at the beginning? I see some tutorial, they put [SEP] at the end and some did not, for a classification task (here I think we don't need to put [SEP]).
2). I did not see any pre-built module for regression on huggingface, just only classification, question-answering... available! Do you mean I use the 'general' BertModel as you used in your tutorial and modify it for the regression task? I am sorry if it is a silly question, because I am just take some walk into DL and I do not usually use Pytorch. Could you please show me more detail for this part?
Thank you so much!
@@tamvominh3272 1) you just need to try out these things. 2) in that case, just take the standard bert encoder, take the CLS output and run it through a linear layer with a regression loss.
@@YannicKilcher Thank you so much for your help!
@@tamvominh3272 HuggingFace has a model with a classification head built-in. Follow their tutorials and examples. Let the tokenizer do the work. Very handy
Nice video. This is an excerpt from the paper which I am not able to understand "Unfortunately,
standard conditional language models can only be
trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially
predict the target word in a multi-layered context". Can you please help me ? I am not able to understand how can word see itself after incorporating bidirectionality?
Thanks.
Consider the sequence "A B C" and try to reconstruct the tokens with bidirectional contexts and two hidden layers. The embedding of C in hidden layer 1 will have attention to the input B and the embedding of B in hidden layer 2 will have attention to the layer 1 embedding of C. So the embedding of B has direct access to the input token B, which makes the reconstruction task trivial.
@@YannicKilcher Oh, I get it now. Thanks.
BERT was ready for the pandemic way before it even started.
Thank you so much for the amazing paper explanation!
Is the speaking at 16:28 means they pre-train two tasks at the same time(predict the mask "and" the isNext label) or have an order training(pre-train the task 1 then the task 2).
Bert is so cool!
Why the inputs (word, segment, and position embeddings) are sum together instead of concatenated to a vector ? Doesn't the summation lead to ambiguity/info loss ?
How does BERT handle various sized inputs?
Usually you pad them
Can we downlaod the pretrained BERT model and use it on our GPU machines ?
github.com/google-research/bert has what you need
you can also have sentence vectors with github.com/hanxiao/bert-as-service
HuggingFace has what you want
Hi! Thanks a lot for this video! I was searching for the information about out-of-vocab words - and I found it in your talk :) However, only one moment remains unclear: How do we tokenize out-of-vocab words? I mean how do we divide words into characters or word-pieces? What algorithm is used to divide "subscribe" into "sub + s + c + r + i + b + e" and not "sub + scribe"? I understand that it depends on the vocabulary but how it is exactly performed? Thanks a lot again) (BTW I subscribed :)) )
That's usually determined by a heuristic. It tries to split it into as few tokens as possible, given some vocabulary.
sir is it possible to apply BERT model for Wikipedia Tagging? And could we combine BERT with other classifier?
What do you mean by tagging?
a big thank you
Can someone please explain how the language modeling task used to train OpenAI GPT is unsupervised as mentioned at 12:43 ? Thanks
All training signal comes from the input data itself, there are no external labels.
Thanks a lot !
ElMo does left and right also. Why does it not do as well as BERT? Because Bert uses attention maybe...any thoughts?
He explained that very well in the video. To me, BERT is actually not the traditional bidirectional like elmo, it is more like omni directional!
yeah but is not at the same time, they go from left to right and then from right to left make concatenation so it's a shallow way
Could I extract word embeddings from BERT and use them for unsupervised learning, e.g. topic modeling? :)
Sure
But why not take the encoding of the full sentence for topic modeling? Why stop at word embedding? You will lose all the context in the sentence/paragraph
Hi, firstly thanks for posting this one.
I have a question, let say I want to pre -train BERT, so I have some text, but how to generate the word embedding as input part(token embedding part ), is that firstly generate randomly ? for example, we only have two word "yes or no" then After one hot encoding we can say yes ->10 and no ->01, then we have a sentence called "yes no", hje sentence will enter bert model, how to initialize word embedding those two words?
also , if we want to fine tune model, is that means we have a pretrain embedding(for example such as the same way in glove word embedding ) ? or randomly as pre-train model ?
in other words, does fine tune bert has pre-train embedding?
A pre-train model learns the embeddings of the vocabulary provided to it. These learnt embeddings are then used as initialised embeddings in fine tuning phase, along with classification layer weights
Elmo and Bert...
What next?
Kermit?
Loved It :)
Hello Sir, Thanks for the video. I have a question. What confuses me when I see BERT or GPT in picture at 14.16 sec why transformers are shown so many times in a layer format? When I read about the transformer, it takes all the words of a sequence at a time and pass through layers of Encoders (Attention + Fully connected layer). In Bert also we are passing all the words to transformer. Right? Then why are we showing so many transformers (in circle)? Is BERT collection of many transformers? (combination of encoders + decoders)
A transformer is multiple layers of attention and feed forward stacked on top of each other
@@YannicKilcher Thanks for your kind reply. But why do we show series of transformers in the picture? Shouldn't it be one transformer within which we have series of encoders (attention + feed forward)
Lol loved the explanation and the pewdiepie reference. Hope to learn a lot more from your paper explanations.
it would be good if you ran your audio through a compressor, its a little hard to understand
Thanks, will try
HYPE!
I just hope Amazon doesn't come out with ERNIE...
Manchiga chepinav anna
I've subscribed to PewDiePie!
21:16
9 year olds teaching NLP
Kobe Bryant said the best type of success is to have a child’s heart of passion. Constantly asking questions.
These transormers are a bad idea, they will never lead to any commercial product.
Best part was when you explained word table on subscribe to pewdiepie
I lost the interest after 10m , as you explain useless parts and not the core. Might make sense to focus on the topic and assume some things are known.