BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Yannic Kilcher

Просмотров 106 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 ноя 2024

Комментарии • 94

@zhangc5723 5 лет назад ⁺⁵³
Its so kind of you to introduce these papers to us in such a decent way. Thanks a lot.
@kevinnejad1072 4 года назад ⁺¹⁵⁴
“the man went to MASK store” makes a lot of sense these days.
@YannicKilcher 4 года назад ⁺¹⁸
Holy crap I had to laugh at this 😁
@LouisChiaki 4 года назад
LOL
@teodorflorianafrim4220 4 года назад ⁺¹⁶
I watch the ads entirely just to show my support for this amazing channel
@PhucLe-qs7nx 4 года назад ⁺⁴
It's better if you just click it 10 times :))
@marcobuiani2628 3 года назад ⁺⁷
this is one of your best NLP videos to me, a very quick but clear recap of language models, RNN, word vectors, attention. All to explain the bert revolution. This is awesome! and I would love a series of recap videos like this. Kudos!
@niduttbhuptani9301 4 года назад ⁺⁴
I like the way he knows he isnt the best at explaining stuff but still tried 110% to explain ! Thanks man for the amazing papers.
@shrikanthsingh8243 5 лет назад ⁺⁷
This is my second comment on your videos. I am really thankful to you for creating such an informatory video on BERT. Now I can go through the paper with some confidence.
@YannicKilcher 5 лет назад ⁺¹
Thanks for the feedback. Glad it helped
@sofia.eris.bauhaus 3 года назад ⁺²⁶
"the problem is that a character in itself doesn't really have a meaning"
f
@ramiyer3841 5 лет назад ⁺⁶
Fantastic overview. Really appreciate your patient and detailed walk-through of the paper.
@Alex-ms1yd Год назад
special thanks for tokenization detour and deeper dive into finetuning/evaluation tasks!
@sasna8800 4 года назад
Thank you a lot I search a lot and I read the paper but I have difficulty to understand it until I watch your video you make everything easy
@tuhinmukherjee8141 3 года назад ⁺¹
Hey, what BERT claims is infact very similar to the working of a Transformer Encoder layer as described in the "Attention Is All You Need Paper". The encoder submodel is allowed to peek into future tokens as well.
@gorgolyt 3 года назад
That's not a secret, indeed they describe the architecture in the paper as a transformer encoder. The novelty is in using this transformer encoder for language model pre-training.
@tempvariable 4 года назад
Thank you. In 10:51 I think although in ELMo they're concatenating left and right side, when making a prediction if there is a softmax the back-propagating error to left should be effected by right side and vice versa. However, I understand what you mean by they're not that coupled.
@thak456 4 года назад ⁺³
Does bert take in fixed length sentences for the question and paragraph task ? if not then how is the variable length input is handled? basically what is the size of data fed into the network
@YannicKilcher 4 года назад ⁺¹
The total state is fixed length, with padding or cropping if needed.
@goelnikhils Год назад
Amazing Explanation
@saurabhgoel203 5 лет назад ⁺²
Very nice explanation. Can you please elaborate the token embedding used in Bert. Are these the same 300 dimensions vector from glove or these embedding are trained from scratch in Bert. How are we getting the base embedding is something I am not able to understand. Thanks in advance for clarifying.
@paulntalo1425 3 года назад
Thank for the illustrations
@ahmedbahaaeldin750 5 лет назад ⁺⁷
Thank you so much for your efforts.
@antonispolykratis3283 4 года назад
I cannot say that I understood Bert from this video.
@bryanye4490 4 года назад ⁺³
To understand a technical paper, a basic level of tech foundation is required. There are explanation videos out there targeted to laymen, but this video is for an audience who either can already read the paper and wants a summary instead, or those who knows what is going on but gets thrown off by academic language and jargons in paper.
@vinayreddy8683 4 года назад ⁺²
At 25:10 you're taking about character level tokens. Does that refers to "Enriching word vectors with sub subword representation" paper?
@YannicKilcher 4 года назад
I'm referring to wordpieces, which refers to the sub-words, yes.
@elnazsn 4 года назад ⁺²
somehow the image in figure 1 comparison is different on the arvix 2019 paper?
@gorgolyt 3 года назад
arxiv papers are pre-publication, not necessarily final versions. the text is a bit different too.
@Konstantin-qk6hv 4 года назад ⁺¹
Great explanation! Thank you!
@dr.deepayogish5398 5 лет назад ⁺¹
Sir ,wonderful and clear explanation..i have douth that qa system with bert technique is supervised or unsupervised...is bert is pre training model
@StevenWernerCS 5 лет назад ⁺²
Thanks for doing your part :)
@asifalhye5062 4 года назад ⁺¹
Absolutely LOVED the video.
@王健-i6d 11 месяцев назад
Nicely! Thanks a lot.
@TechVizTheDataScienceGuy 4 года назад
Nicely done!
@fahds2583 4 года назад ⁺¹
have a question?
When you train a BERT model, lets say for a named-entity recognition task like "Subscribe to Pewdiepie", does BERT model automatically map the words 'Subscribe', 'to', 'Pewdiepie' to its already trained word embeddings read off the corpus? If it does, it means the BERT model comes with its huge bag of word embeddings.
@tankimwai1885 4 года назад ⁺¹
If you are using Pytorch, it comes with a BERT Tokenizer! I am not sure if Tensorflow has this.
@YannicKilcher 4 года назад
It splits into word pieces, and worst case into characters
@tamvominh3272 4 года назад ⁺²
Dear Yannic,
Could you please share with me how to use BERT for fine-tuning in a regression task? My data looks like:
input: a sentence with length 30 words
output: a score in [0, 5].
Is it good to use BERT for a dataset like this? I found some document said transfer learning is effective for a new dataset which is the same with the source task/dataset.
Thank you!
@YannicKilcher 4 года назад
It depends. If your sentences are natural language (and preferably English), then it can make sense. Take a pre-trained BERT and use the CLS vector to put into a regression head. Maybe huggingface has pre-built modules for exactly that already.
@tamvominh3272 4 года назад
@@YannicKilcher Thank you for your prompt response! I still have some questions.
1) Do I need to put [SEP] at the end of my sentence or just only put [CLS] at the beginning? I see some tutorial, they put [SEP] at the end and some did not, for a classification task (here I think we don't need to put [SEP]).
2). I did not see any pre-built module for regression on huggingface, just only classification, question-answering... available! Do you mean I use the 'general' BertModel as you used in your tutorial and modify it for the regression task? I am sorry if it is a silly question, because I am just take some walk into DL and I do not usually use Pytorch. Could you please show me more detail for this part?
Thank you so much!
@YannicKilcher 4 года назад ⁺¹
@@tamvominh3272 1) you just need to try out these things. 2) in that case, just take the standard bert encoder, take the CLS output and run it through a linear layer with a regression loss.
@tamvominh3272 4 года назад
@@YannicKilcher Thank you so much for your help!
@snippletrap 4 года назад ⁺²
@@tamvominh3272 HuggingFace has a model with a classification head built-in. Follow their tutorials and examples. Let the tokenizer do the work. Very handy
@xiquandong1183 5 лет назад ⁺⁴
Nice video. This is an excerpt from the paper which I am not able to understand "Unfortunately,
standard conditional language models can only be
trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially
predict the target word in a multi-layered context". Can you please help me ? I am not able to understand how can word see itself after incorporating bidirectionality?
Thanks.
@YannicKilcher 5 лет назад ⁺⁶
Consider the sequence "A B C" and try to reconstruct the tokens with bidirectional contexts and two hidden layers. The embedding of C in hidden layer 1 will have attention to the input B and the embedding of B in hidden layer 2 will have attention to the layer 1 embedding of C. So the embedding of B has direct access to the input token B, which makes the reconstruction task trivial.
@xiquandong1183 5 лет назад
@@YannicKilcher Oh, I get it now. Thanks.
@susmitislam1910 3 года назад
BERT was ready for the pandemic way before it even started.
@tusov8899 2 года назад
Thank you so much for the amazing paper explanation!
Is the speaking at 16:28 means they pre-train two tasks at the same time(predict the mask "and" the isNext label) or have an order training(pre-train the task 1 then the task 2).
@tae898 3 года назад
Bert is so cool!
@honglu679 3 года назад ⁺²
Why the inputs (word, segment, and position embeddings) are sum together instead of concatenated to a vector ? Doesn't the summation lead to ambiguity/info loss ?
@mustafasuve3109 4 года назад ⁺¹
How does BERT handle various sized inputs?
@YannicKilcher 3 года назад
Usually you pad them
@purneshdasari5667 5 лет назад ⁺²
Can we downlaod the pretrained BERT model and use it on our GPU machines ?
@pr3st0n2 5 лет назад ⁺¹
github.com/google-research/bert has what you need
@iedmrc99 5 лет назад ⁺¹
you can also have sentence vectors with github.com/hanxiao/bert-as-service
@snippletrap 4 года назад ⁺¹
HuggingFace has what you want
@marybaxart8998 4 года назад ⁺¹
Hi! Thanks a lot for this video! I was searching for the information about out-of-vocab words - and I found it in your talk :) However, only one moment remains unclear: How do we tokenize out-of-vocab words? I mean how do we divide words into characters or word-pieces? What algorithm is used to divide "subscribe" into "sub + s + c + r + i + b + e" and not "sub + scribe"? I understand that it depends on the vocabulary but how it is exactly performed? Thanks a lot again) (BTW I subscribed :)) )
@YannicKilcher 4 года назад
That's usually determined by a heuristic. It tries to split it into as few tokens as possible, given some vocabulary.
@purviprajapati8413 5 лет назад ⁺¹
sir is it possible to apply BERT model for Wikipedia Tagging? And could we combine BERT with other classifier?
@YannicKilcher 5 лет назад
What do you mean by tagging?
@panoss4149 5 лет назад ⁺²
a big thank you
@vg9311 5 лет назад
Can someone please explain how the language modeling task used to train OpenAI GPT is unsupervised as mentioned at 12:43 ? Thanks
@YannicKilcher 5 лет назад ⁺¹
All training signal comes from the input data itself, there are no external labels.
@HelloPython 5 лет назад ⁺¹
Thanks a lot !
@Kerrosene 5 лет назад
ElMo does left and right also. Why does it not do as well as BERT? Because Bert uses attention maybe...any thoughts?
@zxzhaixiang 5 лет назад
He explained that very well in the video. To me, BERT is actually not the traditional bidirectional like elmo, it is more like omni directional!
@NoName-iz8td 4 года назад
yeah but is not at the same time, they go from left to right and then from right to left make concatenation so it's a shallow way
@monart4210 4 года назад
Could I extract word embeddings from BERT and use them for unsupervised learning, e.g. topic modeling? :)
@YannicKilcher 4 года назад
Sure
@bryanye4490 4 года назад
But why not take the encoding of the full sentence for topic modeling? Why stop at word embedding? You will lose all the context in the sentence/paragraph
@ximingdong503 3 года назад ⁺¹
Hi, firstly thanks for posting this one.
I have a question, let say I want to pre -train BERT, so I have some text, but how to generate the word embedding as input part(token embedding part ), is that firstly generate randomly ? for example, we only have two word "yes or no" then After one hot encoding we can say yes ->10 and no ->01, then we have a sentence called "yes no", hje sentence will enter bert model, how to initialize word embedding those two words?
also , if we want to fine tune model, is that means we have a pretrain embedding(for example such as the same way in glove word embedding ) ? or randomly as pre-train model ?
in other words, does fine tune bert has pre-train embedding?
@dARKf3n1Xx 3 года назад
A pre-train model learns the embeddings of the vocabulary provided to it. These learnt embeddings are then used as initialised embeddings in fine tuning phase, along with classification layer weights
@Tggfredcsawsdfgbhhhhu 3 года назад ⁺¹
Elmo and Bert...
What next?
Kermit?
@swarajshinde3950 4 года назад
Loved It :)
@sagaradoshi 3 года назад
Hello Sir, Thanks for the video. I have a question. What confuses me when I see BERT or GPT in picture at 14.16 sec why transformers are shown so many times in a layer format? When I read about the transformer, it takes all the words of a sequence at a time and pass through layers of Encoders (Attention + Fully connected layer). In Bert also we are passing all the words to transformer. Right? Then why are we showing so many transformers (in circle)? Is BERT collection of many transformers? (combination of encoders + decoders)
@YannicKilcher 3 года назад
A transformer is multiple layers of attention and feed forward stacked on top of each other
@sagaradoshi 3 года назад
@@YannicKilcher Thanks for your kind reply. But why do we show series of transformers in the picture? Shouldn't it be one transformer within which we have series of encoders (attention + feed forward)
@1animorph 5 лет назад ⁺¹
Lol loved the explanation and the pewdiepie reference. Hope to learn a lot more from your paper explanations.
@JoshFlorii 5 лет назад ⁺²
it would be good if you ran your audio through a compressor, its a little hard to understand
@YannicKilcher 5 лет назад
Thanks, will try
@seanspicer516 5 лет назад ⁺²
HYPE!
@traindiesel7005 3 года назад
I just hope Amazon doesn't come out with ERNIE...
@SAINIVEDH 3 года назад
Manchiga chepinav anna
@vslaykovsky 3 года назад
I've subscribed to PewDiePie!
@JonathanCGroberg 5 лет назад ⁺⁴
21:16
@kristjan2838 5 лет назад ⁺²
9 year olds teaching NLP
@pixel7038 5 лет назад ⁺²
Kobe Bryant said the best type of success is to have a child’s heart of passion. Constantly asking questions.
@WeaponBalalaika 4 месяца назад
These transormers are a bad idea, they will never lead to any commercial product.
@Tuguldur 5 лет назад
Best part was when you explained word table on subscribe to pewdiepie
@blanamaxima 3 года назад ⁺¹
I lost the interest after 10m , as you explain useless parts and not the core. Might make sense to focus on the topic and assume some things are known.

Следующие

Автовоспроизведение

XLNet: Generalized Autoregressive Pretraining for Language Understanding