BERT Research - Ep. 2 - WordPiece Embeddings

Поделиться
HTML-код
  • Опубликовано: 17 ноя 2024

Комментарии • 126

  • @amo6002
    @amo6002 5 лет назад +79

    Chris, I am extremely impressed. You are doing a great job explaining what's going on and breaking things down so things make sense. Please keep up the good work.

  • @SumanDebnathMTAIE
    @SumanDebnathMTAIE 3 года назад +3

    Chris, you will have a special place in heaven, we all are so blessed to have you.

  • @TheSiddhartha2u
    @TheSiddhartha2u 4 года назад +2

    Hi Chris,
    I am an enthusiast, who also started learning Deep Learning and Related area sometime back. I have come accross your videos in youtube. Primarily I was searching something related to BERT. You are really explaining them in a simple way. You are really doing a great job. Thank you very much for helping me and other enthusiastic people like me.

  • @VasudevSharma01
    @VasudevSharma01 Год назад

    This channel does not have enough visibility, this is top notch content and almost feels like we are doing a group activity.
    Every subscriber over here is worth 100x more than any popular media subscriber.

  • @Breaking_Bold
    @Breaking_Bold 9 месяцев назад

    Excellent ..very good...Chris i hope i found this video channel sooner.

  • @kyungtaelim4412
    @kyungtaelim4412 4 года назад +4

    What a nice explanation! As I am an NLP researcher, this is a wonderful lecture even for the people who are familiar with BERT.

  • @JamesBond-ux1uo
    @JamesBond-ux1uo 2 года назад +1

    thanks, Chris for great explanation.

  • @vinodp8577
    @vinodp8577 4 года назад +1

    I think the reason why words are broken into chunks with the 1st chunk not having the ## prefixed is to be able to decode a bert output. Say we are doing a Q&A like task, if every token the model gave out was ##subword token, then we cannot uniquely decode the output. If we had only the 2nd chunk onwards with the ## prefix, you can uniquely decode bert's output. Great video. Thank you!

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад

      Thanks Vinod, that’s a good point! By differentiating, you have (mostly) what you need to reconstruct the original sentence. I know that SentencePiece took this a step deeper by representing white space.
      As for Q&A, BERT on its own isn’t able to generate text sequences at its output-its the same tokens on the input and output. But maybe you know of a way to add something on top of bert to generate text? Thanks!

  • @mahmoudtarek6859
    @mahmoudtarek6859 3 года назад

    I hope you reach 10 million subs. Great Instructor. God bless you

  • @nospecialmeaning2
    @nospecialmeaning2 4 года назад +1

    I like how additional pieces of information are thrown around - I am super new to this, and I hadn't known what Fasttext was! Thank you so much!

  • @sahibsingh1563
    @sahibsingh1563 4 года назад +17

    Real Gem of work.
    This tutorial really flabergasted me :)

  • @briancase6180
    @briancase6180 3 года назад

    Finally! Someone explains how the embeddings work. For some reason, I just never saw that explained. I figured it out from other explanations of how the embeddings are combined, but I still wondered if I understood correctly. Now I know! Thanks!!

  • @ms1153
    @ms1153 3 года назад

    Thank you very much for your effort to bring light to this field. I spent weeks digging on the web looking for good BERT/Transformers explanations. My feeling is that there are lot of videos/tutorials explained by people who actually don't understand what they are trying to explain. After find you and buying your material I started to understand BERT and Transformers. MANY THANKS.

  • @deeplearningai5523
    @deeplearningai5523 3 года назад +3

    looking at number of likes, it seems not many learners look hard for quality stuff, this whole series not just teaches the topic but also give a good idea about investigative research and thus building knowledge, thank you for such an amazing work.

  • @mohamedramadanmohamed3744
    @mohamedramadanmohamed3744 4 года назад

    No better than this series for such an important topic like BERT, that is amazing.
    Chris, thanks and please keep going.

  • @thalanayarmuthukumar5472
    @thalanayarmuthukumar5472 4 года назад

    Explained very simply. It feels we are with you in your learning journey. And things explained very simply. Thanks

  • @isaacho4181
    @isaacho4181 4 года назад +1

    Clear explanation. Rewatched many times.

  • @wiseconcepts774
    @wiseconcepts774 Год назад

    Awesome videos, helpful to understand the applications properly. Keep up the excellent work Chris

  • @DarkerThanBlack89
    @DarkerThanBlack89 4 года назад +5

    Really amazing work!! I am so glad I found this. You would make a great teacher for sure! Thank you for all your work.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад +3

      Thank you! It's a lot of work to put this stuff together, so I really appreciate your encouragement :)

  • @nikhilcherian8931
    @nikhilcherian8931 4 года назад

    Awesome video series, covering a lot of areas and analogies. I am also researching about BERT and your videos provided a nice picture of BERT.

  • @saeideh1223
    @saeideh1223 2 года назад +1

    This is amazing. Thank you for your understandable explanation.

  • @emmaliu6524
    @emmaliu6524 4 года назад

    Thank you so much for sharing! This is incredibly useful for someone searching for tutorials to learn on BERT.

  • @shardulkulkarni3999
    @shardulkulkarni3999 2 года назад

    Very informative and crisp clear. I find you very unique and confident in what you teach. I've been surfing the web and watching youtube videos on NLP for about a week now, and bluntly put I did not understand jack shit with whats already there. Someone who explains the theory part, doesn't teach you about the code, while someone who puts out the code doesn't explain the theory part, as a result I cannot understand what is really going on. I really really love your videos and hope that you keep on making them.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  2 года назад

      Loved reading this--thanks so much, Shardul! 😁

  • @computervisionetc
    @computervisionetc 4 года назад +1

    Chris, the excellent quality of this tutorial "flabbergasts" me (just kidding, all your tutorials are excellent)

  • @roushanraj8530
    @roushanraj8530 4 года назад

    Chris, you are awesome sir, you make it very simple and easy to understand 💯✔✔💯

  • @qian2718
    @qian2718 4 года назад +2

    Thank you so much! These videos really save my life

  • @notengonickname
    @notengonickname 4 года назад

    Thank you... just what I needed to start my journey in BERT

  • @СергейПащенко-р5ж
    @СергейПащенко-р5ж 4 года назад

    Thanks a lot for your lessons, Chris. Hello from Ukraine)

  • @AG-en1ht
    @AG-en1ht 4 года назад +1

    You make really great videos for beginners. Thank you very much!

  • @alassanndiallo
    @alassanndiallo 3 года назад

    Amazing Professor! A very impressive way to teach.

  • @mahadevanpadmanabhan9314
    @mahadevanpadmanabhan9314 4 года назад

    Very well done.Lot of effort analyzing the vocab.txt

  • @akashkadel7281
    @akashkadel7281 4 года назад

    Hey Chris, This video helped me understand the concepts with so much ease. You are doing an amazing work for people new to advanced NLP. Thank you so much :)

  • @majinfu
    @majinfu 4 года назад

    Thank you for your great explanation and exploration! That helps me a lot to understand the tokenizer of BERT! Many thanks!

  • @dec13666
    @dec13666 3 года назад

    A very kroxldyphivc video!
    Keep it up with the good work!

  • @rahulpadhy7544
    @rahulpadhy7544 Год назад

    Just awesome, simply too good!!

  • @prasanthkara
    @prasanthkara 4 года назад

    Chris, Very informative video especially for beginners. Thanks a lot

  • @vladimirbosinceanu5778
    @vladimirbosinceanu5778 3 года назад

    This is amazing Chris. A huge thank you!

  • @sushasuresh1687
    @sushasuresh1687 4 года назад

    Love love love your videos. Thank you !!

  • @scottmeredith4578
    @scottmeredith4578 4 года назад

    Very good video. For your name further research, you could check against the Social Security Death Index file, every first and last name registered in USA social security database from start of SS up to year 2013, several million unique names, broken into first and last.

  • @vijayendrasdm
    @vijayendrasdm 4 года назад +3

    Hi Chris
    Great video. I have a question.
    At 12.10 you say "The tokens at the beginning of the word can be redundant with the token for whole word. "
    So in the bedding example (bed,##ding): Does it mean token for bed and bedding would be similar?
    I see, there is token for ##ding in vocab, so , the token for bedding must be combination of "token for bed"+"token for ##ding"?

    • @bavalpreetsingh4664
      @bavalpreetsingh4664 3 года назад

      i am glad that someone has raised the question, same doubt is coming in my mind as well?

    • @bavalpreetsingh4664
      @bavalpreetsingh4664 3 года назад

      @ChrisMcCormickAI please let us know asap , thank you :)

  • @xiquandong1183
    @xiquandong1183 4 года назад +2

    Hey, this was an excellent video. You said at the beginning that we have a dictionary of token indices and embeddings. How are these embeddings obtained during pre-training ? Are they initialized with some context-independent embedding like GLoVE or do we get the embeddings by having a weight matrix of size (30,000 x 768) where each input is represented as an one-hot vector and then use those trained parameters to get the embeddings ?

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад +2

      Hi Rajesh, thanks for your comment.
      When the Google team trained BERT on the Book Corpus (the "pre-training" step), I have to assume that they started with randomly initialized embeddings. Because they used a WordPiece model, with a vocabulary selected on the statistics of their training set, I don't think they could have used other pre-trained embeddings even if they wanted to. So the embeddings would have been randomly initialized, and then learned as part of training the whole BERT model. Does that answer your question?
      Thanks,
      Chris

  • @sarthakdargan6447
    @sarthakdargan6447 4 года назад

    Chris, you are doing a really really great work. Big Fan!. Will definitely look for more awsum content.

  • @fdfdfd65
    @fdfdfd65 4 года назад

    Hello Chris. I'm wondering how to get the hidden states of a word splitted by the tokenizer. For example, in the case of word 'embedding' that is splitted in: em ##bed ##ding. Each token would be a different hidden state vector right? Do you know some way of combine these hidden states in a single one that represents the original whole word in order to do some manipulations? My objective is to compare simarity between words like 'embeddings' and 'embedded'. Thanks for your attention. Congrats for the EXCELLENT content of your youtube channel. Sorry for my poor english skills.

  • @gayatrivenugopal3
    @gayatrivenugopal3 4 года назад

    Thank you for the video; very informative.

  • @j045ua
    @j045ua 4 года назад +4

    Hey Chris! Great content! :D

  • @gulabpatel1480
    @gulabpatel1480 2 года назад

    Really great job!! I have once doubt, why we still have [UNK] and [unused] tokens when subword embedding is used to handle the UNK words?

    • @nickryan368
      @nickryan368 2 года назад +1

      Good question! [unused] tokens are deliberately left as empty slots so that, if you need to, after training you can initialize a custom word using one of these open slots. [UNK] is for characters or subwords that appeared so infrequent that they did not make it into the vocabulary. The vocabulary is limited to a certain number of words/subwords/characters, and if you train on a large corpus of internet text you are likely to run into some very rare symbols/emojis/unicode characters that aren't used frequently enough to merit a slot in the vocabulary and can't be decomposed into any of the other more common subwords existing in the vocabulary.

  • @ZhonghaiWang
    @ZhonghaiWang 3 года назад

    thanks for such great content Chris!

  • @turkibaghlaf4565
    @turkibaghlaf4565 4 года назад

    Thank you so much for this great content Chris !

  • @breezhang4660
    @breezhang4660 4 года назад

    Thanks much for this great video! How funny the name list does not have your last name - you are a uniquely great e-educator! You mentioned you wonder why "##" is not added to the first subword. I was looking around the web pages, I see others' work where the first subword of a word is labeled, but the rest "##-word" ones do not have labels (or only default labels). Then they can, if they want, give the same label to the ##-words at least, especially for labeling the subwords of someone's last name;)

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад +1

      Thanks Bree! When you mention "labelling" subwords, are you referring to marking them with the '##' (e.g., 'em' is "labeled" as '##em'), or are you referring to something else, like classification labels for something like Named Entity Recognition?

    • @breezhang4660
      @breezhang4660 4 года назад

      @@ChrisMcCormickAI Yes, I was referring to the later one, the classification tasks like labeling identities. I thought it seems like one useful way of not tokenizing the first subword with ##. So even the one-word last name gets falsely subworded, the labels remain roughly correct.

  • @akshatgupta1710
    @akshatgupta1710 3 года назад

    Nice video! One question: Why does BERT have hindi and chinese alphabets when the words are not there in the vocabulary? What are some applications/uses of having them in the vocabulary?

  • @manikanthreddy8906
    @manikanthreddy8906 4 года назад +1

    Thanks Chris. That really helped.

  • @adityachhabra2629
    @adityachhabra2629 4 года назад

    super informative! please continue the good work

  • @maikeguderlei2146
    @maikeguderlei2146 5 лет назад

    Hi Chris,
    thanks so much for the input and insights on how Bert embedding works. I think there is a small mistake though, as the values for the special tokens are one too high. The PAD token in Bert is embedded as 0, the UNK token as 100 and so forth as python usually starts enumerating with 0 and not with 1 which gets lost when saving the tokens with their embeddings to a txt file. At least these are the values that I get when I try out tokenizer.vocab ["PAD"] etc.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  5 лет назад +2

      Thanks for pointing that out, Maike! In the Colab Notebook I mention that the indeces I'm reporting are 1-indexed rather than 0-indexed. I did this because I was referring to the line numbers in the "vocabulary.txt" file. Probably would have been better to keep things 0-indexed like you're saying, though.

  • @ppujari
    @ppujari 4 года назад

    This is one of the excellent tutorial on BERT. I have a question: In the word embedding look up table, what are the examples of 5 features look like for "I" (say)?

  • @arjunkoneru5461
    @arjunkoneru5461 4 года назад

    I think they do it because by adding ## they can reconstruct the original tokens back.

  • @lukt
    @lukt 2 года назад

    Thank you for this awesome explanation! :) I just have some trouble wrapping my head around, why e.g. splitting "embedding" into "em" bed" and "ding" would make any sense at all? Yes the subtokens might be in the vocabulary, but "bed", "em" and "ding" have little to do with the meaning of the word "embedding" and should therefore be quire far from "embedding" in the vector space, or am I missing something?

    • @nickryan368
      @nickryan368 2 года назад +1

      Good question! Note that "embedding" gets split into "em" "##bed" and "##ding", so "bed" is a totally different token than "##bed," the first is one you sleep on and the second is "the middle or end subtoken of some larger word." That helps the problem somewhat. Also note that the embeddings are learned in the context of the sentence, so while river "bank" and financial "bank" have the same embedding, this embedding "understands" what the meaning should be in different contexts. But the short answer is simply that these embeddings are trained for a very long time on a lot of data. It is unintuitive that they should work in different contexts, and that when combined they should form a word that has a distinct meaning from the individual parts, but it's mostly a result of the sheer amount of pretraining. Hope that helps.

    • @lukt
      @lukt 2 года назад

      @@nickryan368 Ahh I see! That makes a whole lot more sense now! I missed the fact that ##bed and "bed" are in fact different tokens. I guess the training process differentiates the two even further (by learning the context in which they stand). Thank you for taking the time to write a detailed response! :)

  • @techwithshrikar3236
    @techwithshrikar3236 4 года назад

    Great content. Normally what distance measure do you use for similarity?

  • @felipeacunagonzalez4844
    @felipeacunagonzalez4844 4 года назад

    Thank you Chris, you are great

  • @vivekmehta4862
    @vivekmehta4862 4 года назад

    Hi ChrisMcCormick ! This video gave me an easy start, thanks for it, but I really want how to get the feature vector for any single word? This is not explained in the video.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад

      Hi Vivek, I just responded to another of your comments with this, but for anyone else reading--Nick and I wrote a blog post / Colab Notebook showing how to do this here: mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

    • @vivekmehta4862
      @vivekmehta4862 4 года назад

      @@ChrisMcCormickAI Ok Chris, thank you so much. By the way, do you have any idea of fine tuning the word embeddings with the help of document clustering task. Because I have a set of unlabelled documents.

  • @venkatramanirajgopal7364
    @venkatramanirajgopal7364 4 года назад

    Hi Chris. Thank you for this video. I notice lots of single words from different languages "Chinese, Tamil, Japanese, Bengali". Is BERT base uncased trained on multiple language corpus ?

  • @vivekmittal3478
    @vivekmittal3478 4 года назад +1

    great series! Keep it up!

  • @eliminatorism
    @eliminatorism 4 года назад

    You are amazing. Thank you so much!

  • @monart4210
    @monart4210 4 года назад

    I really hope you didn't actually mention this as I watched (and enjoyed!;) ) the full video two weeks ago, but I don't remember you mentioning it... I am a bit confused as to how to then receive embeddings for the actual words instead of WordPieces. I am using different kinds of embeddings for topic modeling and for topic exploration (nearest neighbours,...) I want to use actual words and not "em", "bed", "ding" instead of embedding.:) Does applying BERT not make sense in such an unsupervised task and should I rather stick to ELMO which gives me an embedding per word? Maybe you could give me some feedback :)

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад

      Yeah, I never thought about that! I would probably try just averaging the embeddings for the different word pieces together to create a single embedding.
      I'm curious about your application--I've used word embeddings from word2vec to find "similar terms", but this isn't possible with BERT embeddings, since they are context-dependent (there's not a single static vocab of embeddings, as there in word2vec). How will you make use of these "contextualized" embeddings?
      Thanks,
      Chris

  • @shyambv
    @shyambv 5 лет назад

    Thanks Chris for the creating this video. Is it possible to add additional vocabulary to pretrained BERT embeddings? Lot of the words which I wanted to use are not available in BERT embeddings.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  5 лет назад +1

      Hi Shyam, it's not possible to modify the BERT vocabulary. However, the BERT tokenizer will break down any word that it doesn't have in its vocabulary into a set of subwords (and individual characters, if necessary!).
      Something I am interested to learn more about is how big of a problem this is for datasets with a lot of "jargon" (application-specific language). When you fine-tune BERT on your task, the subword embedding vectors may be adjusted to better match your task. But in general, I *suspect* that BERT will not perform well on applications with lots of jargon.
      What's your dataset? I'd be curious to hear how it goes for you applying BERT!

  • @prakashkafle454
    @prakashkafle454 3 года назад

    Like word 2 vec how bert make embedding of every word ?? Waiting for details step . If I will pass 512 as a sequence length what will be the dimensions?

  • @RedionXhepa
    @RedionXhepa 4 года назад

    Great video. Thanks a lot!

  • @anaghajose691
    @anaghajose691 4 года назад

    great job!! i have one doubt,why we are using neural network?BERT Itself can give output right?

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад

      Thanks, Anagha! Can you clarify your question for me--which neural network are you referring to? Thanks!

  • @mikemihay
    @mikemihay 5 лет назад

    Hi Chris, do you know where I can find a "pseudocode" of how Bert works? And also for Word2vec? For word2vec I found your repo where you commented the original c code, but I feel overwelmed to go thru 1000 lines of code :)

  • @kelvinjose
    @kelvinjose 4 года назад

    Hi, where does the initial lookup table exist? Or how does the whole lookup process work?

  • @dinasamir2778
    @dinasamir2778 4 года назад

    Thanks alot for the great explanation

  • @ais3153
    @ais3153 3 года назад

    As I could understand that BERT has only 30000 words with 786 features each... but BERT represents each word specific contextualized vector (i.e., depending on the context of the word)? could someone explain that to me?

  • @mouleshm210
    @mouleshm210 3 года назад

    Thanks u made my day :)

  • @DebangaRajNeog
    @DebangaRajNeog 4 года назад

    Great tutorial.

  • @sumeetseth22
    @sumeetseth22 4 года назад

    Thanks Chris!

  • @HassanAmin77
    @HassanAmin77 4 года назад

    Chris I need slides for your lectures in ppt or pdf format, to help with developing some lectures on BERT. Is it possible to share these.

  • @shaofengli5605
    @shaofengli5605 4 года назад

    Great Job!

  • @Julia-ej4jz
    @Julia-ej4jz 2 года назад

    1) It is interesting why em-bed-ding was spiltted this way. These wordpieces have nothing to do with the meaning of the whole word. It seems that uncautious splitting only deteriorates the results. 2) How different senses are handled in this dictionary? For example, "like" as a verb and as a conjunction. What about "like"-noun and "like"-verb? Do they correspond to diferent entries in the dictionary?

  • @ZohanSyahFatomi
    @ZohanSyahFatomi Год назад

    JUST WOW WOW WOW

  • @yashwatwani4574
    @yashwatwani4574 4 года назад

    where can I find the vocabulary.txt file

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад

      Hi yash, you can generate it by running the `Inspect BERT Vocabulary` notebook here:
      colab.research.google.com/drive/1MXg4G-twzdDGqrjGVymI86dJdm3kXADA
      But since you asked, I also uploaded the file and hosted it here:
      drive.google.com/open?id=12jxEvIxAmLXsskVzVhsC49sLAgZi-h8Q

  • @tervancovan
    @tervancovan 4 года назад

    I loooove it

  • @seonhighlightsvods9193
    @seonhighlightsvods9193 3 года назад

    thanks :)

  • @Aliabbashassan3402
    @Aliabbashassan3402 4 года назад

    hi ... is Bert work with Arabic language??

  • @ilanaizelman3993
    @ilanaizelman3993 4 года назад

    Amazing

  • @jean-baptistedelabroise5391
    @jean-baptistedelabroise5391 3 года назад

    feel like BERt vocabulary is not really well built in the end. feels like you should have only single digit numbers. I feel like nouns might take too much space in this voc also. To modify the voc I found this paper www.aclweb.org/anthology/2020.findings-emnlp.129.pdf it seems very interesting but I did not test their method yet.

  • @DrOsbert
    @DrOsbert 4 года назад

    Oops! You took more than 2 min for word emb 😆😜

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 года назад +1

      :D It's true, I have a hard time not going into depth on everything.

  • @vinayreddy8683
    @vinayreddy8683 4 года назад

    BERT only has 30000 tokens!!!!
    You might be wrong on that.

  • @alimansour951
    @alimansour951 3 года назад

    Thank you very much for this very useful video!