Chris, I am extremely impressed. You are doing a great job explaining what's going on and breaking things down so things make sense. Please keep up the good work.
Hi Chris, I am an enthusiast, who also started learning Deep Learning and Related area sometime back. I have come accross your videos in youtube. Primarily I was searching something related to BERT. You are really explaining them in a simple way. You are really doing a great job. Thank you very much for helping me and other enthusiastic people like me.
This channel does not have enough visibility, this is top notch content and almost feels like we are doing a group activity. Every subscriber over here is worth 100x more than any popular media subscriber.
Finally! Someone explains how the embeddings work. For some reason, I just never saw that explained. I figured it out from other explanations of how the embeddings are combined, but I still wondered if I understood correctly. Now I know! Thanks!!
Thank you very much for your effort to bring light to this field. I spent weeks digging on the web looking for good BERT/Transformers explanations. My feeling is that there are lot of videos/tutorials explained by people who actually don't understand what they are trying to explain. After find you and buying your material I started to understand BERT and Transformers. MANY THANKS.
looking at number of likes, it seems not many learners look hard for quality stuff, this whole series not just teaches the topic but also give a good idea about investigative research and thus building knowledge, thank you for such an amazing work.
I think the reason why words are broken into chunks with the 1st chunk not having the ## prefixed is to be able to decode a bert output. Say we are doing a Q&A like task, if every token the model gave out was ##subword token, then we cannot uniquely decode the output. If we had only the 2nd chunk onwards with the ## prefix, you can uniquely decode bert's output. Great video. Thank you!
Thanks Vinod, that’s a good point! By differentiating, you have (mostly) what you need to reconstruct the original sentence. I know that SentencePiece took this a step deeper by representing white space. As for Q&A, BERT on its own isn’t able to generate text sequences at its output-its the same tokens on the input and output. But maybe you know of a way to add something on top of bert to generate text? Thanks!
Very informative and crisp clear. I find you very unique and confident in what you teach. I've been surfing the web and watching youtube videos on NLP for about a week now, and bluntly put I did not understand jack shit with whats already there. Someone who explains the theory part, doesn't teach you about the code, while someone who puts out the code doesn't explain the theory part, as a result I cannot understand what is really going on. I really really love your videos and hope that you keep on making them.
Hey Chris, This video helped me understand the concepts with so much ease. You are doing an amazing work for people new to advanced NLP. Thank you so much :)
Hey, this was an excellent video. You said at the beginning that we have a dictionary of token indices and embeddings. How are these embeddings obtained during pre-training ? Are they initialized with some context-independent embedding like GLoVE or do we get the embeddings by having a weight matrix of size (30,000 x 768) where each input is represented as an one-hot vector and then use those trained parameters to get the embeddings ?
Hi Rajesh, thanks for your comment. When the Google team trained BERT on the Book Corpus (the "pre-training" step), I have to assume that they started with randomly initialized embeddings. Because they used a WordPiece model, with a vocabulary selected on the statistics of their training set, I don't think they could have used other pre-trained embeddings even if they wanted to. So the embeddings would have been randomly initialized, and then learned as part of training the whole BERT model. Does that answer your question? Thanks, Chris
Thanks much for this great video! How funny the name list does not have your last name - you are a uniquely great e-educator! You mentioned you wonder why "##" is not added to the first subword. I was looking around the web pages, I see others' work where the first subword of a word is labeled, but the rest "##-word" ones do not have labels (or only default labels). Then they can, if they want, give the same label to the ##-words at least, especially for labeling the subwords of someone's last name;)
Thanks Bree! When you mention "labelling" subwords, are you referring to marking them with the '##' (e.g., 'em' is "labeled" as '##em'), or are you referring to something else, like classification labels for something like Named Entity Recognition?
@@ChrisMcCormickAI Yes, I was referring to the later one, the classification tasks like labeling identities. I thought it seems like one useful way of not tokenizing the first subword with ##. So even the one-word last name gets falsely subworded, the labels remain roughly correct.
Hi Chris Great video. I have a question. At 12.10 you say "The tokens at the beginning of the word can be redundant with the token for whole word. " So in the bedding example (bed,##ding): Does it mean token for bed and bedding would be similar? I see, there is token for ##ding in vocab, so , the token for bedding must be combination of "token for bed"+"token for ##ding"?
Hello Chris. I'm wondering how to get the hidden states of a word splitted by the tokenizer. For example, in the case of word 'embedding' that is splitted in: em ##bed ##ding. Each token would be a different hidden state vector right? Do you know some way of combine these hidden states in a single one that represents the original whole word in order to do some manipulations? My objective is to compare simarity between words like 'embeddings' and 'embedded'. Thanks for your attention. Congrats for the EXCELLENT content of your youtube channel. Sorry for my poor english skills.
Good question! [unused] tokens are deliberately left as empty slots so that, if you need to, after training you can initialize a custom word using one of these open slots. [UNK] is for characters or subwords that appeared so infrequent that they did not make it into the vocabulary. The vocabulary is limited to a certain number of words/subwords/characters, and if you train on a large corpus of internet text you are likely to run into some very rare symbols/emojis/unicode characters that aren't used frequently enough to merit a slot in the vocabulary and can't be decomposed into any of the other more common subwords existing in the vocabulary.
Nice video! One question: Why does BERT have hindi and chinese alphabets when the words are not there in the vocabulary? What are some applications/uses of having them in the vocabulary?
Very good video. For your name further research, you could check against the Social Security Death Index file, every first and last name registered in USA social security database from start of SS up to year 2013, several million unique names, broken into first and last.
Hi Chris, thanks so much for the input and insights on how Bert embedding works. I think there is a small mistake though, as the values for the special tokens are one too high. The PAD token in Bert is embedded as 0, the UNK token as 100 and so forth as python usually starts enumerating with 0 and not with 1 which gets lost when saving the tokens with their embeddings to a txt file. At least these are the values that I get when I try out tokenizer.vocab ["PAD"] etc.
Thanks for pointing that out, Maike! In the Colab Notebook I mention that the indeces I'm reporting are 1-indexed rather than 0-indexed. I did this because I was referring to the line numbers in the "vocabulary.txt" file. Probably would have been better to keep things 0-indexed like you're saying, though.
Like word 2 vec how bert make embedding of every word ?? Waiting for details step . If I will pass 512 as a sequence length what will be the dimensions?
This is one of the excellent tutorial on BERT. I have a question: In the word embedding look up table, what are the examples of 5 features look like for "I" (say)?
Hi Chris. Thank you for this video. I notice lots of single words from different languages "Chinese, Tamil, Japanese, Bengali". Is BERT base uncased trained on multiple language corpus ?
As I could understand that BERT has only 30000 words with 786 features each... but BERT represents each word specific contextualized vector (i.e., depending on the context of the word)? could someone explain that to me?
Thank you for this awesome explanation! :) I just have some trouble wrapping my head around, why e.g. splitting "embedding" into "em" bed" and "ding" would make any sense at all? Yes the subtokens might be in the vocabulary, but "bed", "em" and "ding" have little to do with the meaning of the word "embedding" and should therefore be quire far from "embedding" in the vector space, or am I missing something?
Good question! Note that "embedding" gets split into "em" "##bed" and "##ding", so "bed" is a totally different token than "##bed," the first is one you sleep on and the second is "the middle or end subtoken of some larger word." That helps the problem somewhat. Also note that the embeddings are learned in the context of the sentence, so while river "bank" and financial "bank" have the same embedding, this embedding "understands" what the meaning should be in different contexts. But the short answer is simply that these embeddings are trained for a very long time on a lot of data. It is unintuitive that they should work in different contexts, and that when combined they should form a word that has a distinct meaning from the individual parts, but it's mostly a result of the sheer amount of pretraining. Hope that helps.
@@nickryan368 Ahh I see! That makes a whole lot more sense now! I missed the fact that ##bed and "bed" are in fact different tokens. I guess the training process differentiates the two even further (by learning the context in which they stand). Thank you for taking the time to write a detailed response! :)
Hi Chris, do you know where I can find a "pseudocode" of how Bert works? And also for Word2vec? For word2vec I found your repo where you commented the original c code, but I feel overwelmed to go thru 1000 lines of code :)
Hi ChrisMcCormick ! This video gave me an easy start, thanks for it, but I really want how to get the feature vector for any single word? This is not explained in the video.
Hi Vivek, I just responded to another of your comments with this, but for anyone else reading--Nick and I wrote a blog post / Colab Notebook showing how to do this here: mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
@@ChrisMcCormickAI Ok Chris, thank you so much. By the way, do you have any idea of fine tuning the word embeddings with the help of document clustering task. Because I have a set of unlabelled documents.
Thanks Chris for the creating this video. Is it possible to add additional vocabulary to pretrained BERT embeddings? Lot of the words which I wanted to use are not available in BERT embeddings.
Hi Shyam, it's not possible to modify the BERT vocabulary. However, the BERT tokenizer will break down any word that it doesn't have in its vocabulary into a set of subwords (and individual characters, if necessary!). Something I am interested to learn more about is how big of a problem this is for datasets with a lot of "jargon" (application-specific language). When you fine-tune BERT on your task, the subword embedding vectors may be adjusted to better match your task. But in general, I *suspect* that BERT will not perform well on applications with lots of jargon. What's your dataset? I'd be curious to hear how it goes for you applying BERT!
I really hope you didn't actually mention this as I watched (and enjoyed!;) ) the full video two weeks ago, but I don't remember you mentioning it... I am a bit confused as to how to then receive embeddings for the actual words instead of WordPieces. I am using different kinds of embeddings for topic modeling and for topic exploration (nearest neighbours,...) I want to use actual words and not "em", "bed", "ding" instead of embedding.:) Does applying BERT not make sense in such an unsupervised task and should I rather stick to ELMO which gives me an embedding per word? Maybe you could give me some feedback :)
Yeah, I never thought about that! I would probably try just averaging the embeddings for the different word pieces together to create a single embedding. I'm curious about your application--I've used word embeddings from word2vec to find "similar terms", but this isn't possible with BERT embeddings, since they are context-dependent (there's not a single static vocab of embeddings, as there in word2vec). How will you make use of these "contextualized" embeddings? Thanks, Chris
1) It is interesting why em-bed-ding was spiltted this way. These wordpieces have nothing to do with the meaning of the whole word. It seems that uncautious splitting only deteriorates the results. 2) How different senses are handled in this dictionary? For example, "like" as a verb and as a conjunction. What about "like"-noun and "like"-verb? Do they correspond to diferent entries in the dictionary?
Hi yash, you can generate it by running the `Inspect BERT Vocabulary` notebook here: colab.research.google.com/drive/1MXg4G-twzdDGqrjGVymI86dJdm3kXADA But since you asked, I also uploaded the file and hosted it here: drive.google.com/open?id=12jxEvIxAmLXsskVzVhsC49sLAgZi-h8Q
feel like BERt vocabulary is not really well built in the end. feels like you should have only single digit numbers. I feel like nouns might take too much space in this voc also. To modify the voc I found this paper www.aclweb.org/anthology/2020.findings-emnlp.129.pdf it seems very interesting but I did not test their method yet.
Chris, I am extremely impressed. You are doing a great job explaining what's going on and breaking things down so things make sense. Please keep up the good work.
That's good to hear, thanks Laurin!
Chris, you will have a special place in heaven, we all are so blessed to have you.
Hi Chris,
I am an enthusiast, who also started learning Deep Learning and Related area sometime back. I have come accross your videos in youtube. Primarily I was searching something related to BERT. You are really explaining them in a simple way. You are really doing a great job. Thank you very much for helping me and other enthusiastic people like me.
What a nice explanation! As I am an NLP researcher, this is a wonderful lecture even for the people who are familiar with BERT.
This channel does not have enough visibility, this is top notch content and almost feels like we are doing a group activity.
Every subscriber over here is worth 100x more than any popular media subscriber.
I like how additional pieces of information are thrown around - I am super new to this, and I hadn't known what Fasttext was! Thank you so much!
Finally! Someone explains how the embeddings work. For some reason, I just never saw that explained. I figured it out from other explanations of how the embeddings are combined, but I still wondered if I understood correctly. Now I know! Thanks!!
Thank you very much for your effort to bring light to this field. I spent weeks digging on the web looking for good BERT/Transformers explanations. My feeling is that there are lot of videos/tutorials explained by people who actually don't understand what they are trying to explain. After find you and buying your material I started to understand BERT and Transformers. MANY THANKS.
looking at number of likes, it seems not many learners look hard for quality stuff, this whole series not just teaches the topic but also give a good idea about investigative research and thus building knowledge, thank you for such an amazing work.
No better than this series for such an important topic like BERT, that is amazing.
Chris, thanks and please keep going.
I think the reason why words are broken into chunks with the 1st chunk not having the ## prefixed is to be able to decode a bert output. Say we are doing a Q&A like task, if every token the model gave out was ##subword token, then we cannot uniquely decode the output. If we had only the 2nd chunk onwards with the ## prefix, you can uniquely decode bert's output. Great video. Thank you!
Thanks Vinod, that’s a good point! By differentiating, you have (mostly) what you need to reconstruct the original sentence. I know that SentencePiece took this a step deeper by representing white space.
As for Q&A, BERT on its own isn’t able to generate text sequences at its output-its the same tokens on the input and output. But maybe you know of a way to add something on top of bert to generate text? Thanks!
Real Gem of work.
This tutorial really flabergasted me :)
Thanks Sahib :)
*flabbergasted
Excellent ..very good...Chris i hope i found this video channel sooner.
I hope you reach 10 million subs. Great Instructor. God bless you
Very informative and crisp clear. I find you very unique and confident in what you teach. I've been surfing the web and watching youtube videos on NLP for about a week now, and bluntly put I did not understand jack shit with whats already there. Someone who explains the theory part, doesn't teach you about the code, while someone who puts out the code doesn't explain the theory part, as a result I cannot understand what is really going on. I really really love your videos and hope that you keep on making them.
Loved reading this--thanks so much, Shardul! 😁
Explained very simply. It feels we are with you in your learning journey. And things explained very simply. Thanks
Really amazing work!! I am so glad I found this. You would make a great teacher for sure! Thank you for all your work.
Thank you! It's a lot of work to put this stuff together, so I really appreciate your encouragement :)
thanks, Chris for great explanation.
Glad it was helpful!
Awesome videos, helpful to understand the applications properly. Keep up the excellent work Chris
Chris, the excellent quality of this tutorial "flabbergasts" me (just kidding, all your tutorials are excellent)
Ha! Thank you :)
Awesome video series, covering a lot of areas and analogies. I am also researching about BERT and your videos provided a nice picture of BERT.
Clear explanation. Rewatched many times.
Thank you so much for sharing! This is incredibly useful for someone searching for tutorials to learn on BERT.
Thank you so much! These videos really save my life
Awesome, good to hear!
Hey Chris, This video helped me understand the concepts with so much ease. You are doing an amazing work for people new to advanced NLP. Thank you so much :)
This is amazing. Thank you for your understandable explanation.
Thanks a lot for your lessons, Chris. Hello from Ukraine)
You make really great videos for beginners. Thank you very much!
Thanks, Antonio!
Thank you... just what I needed to start my journey in BERT
Thank you for your great explanation and exploration! That helps me a lot to understand the tokenizer of BERT! Many thanks!
Chris, you are awesome sir, you make it very simple and easy to understand 💯✔✔💯
Amazing Professor! A very impressive way to teach.
Very well done.Lot of effort analyzing the vocab.txt
Thanks Mahadevan!
This is amazing Chris. A huge thank you!
Chris, you are doing a really really great work. Big Fan!. Will definitely look for more awsum content.
Thanks Sarthak, appreciate the encouragement!
Chris, Very informative video especially for beginners. Thanks a lot
Great! Thanks for the encouragement!
A very kroxldyphivc video!
Keep it up with the good work!
Love love love your videos. Thank you !!
Hey Chris! Great content! :D
Thanks Joshua! Have any plans to use BERT?
Just awesome, simply too good!!
Thank you so much for this great content Chris !
My pleasure!
thanks for such great content Chris!
super informative! please continue the good work
Thanks Chris. That really helped.
great series! Keep it up!
Thanks Vivek!
Thank you for the video; very informative.
Hey, this was an excellent video. You said at the beginning that we have a dictionary of token indices and embeddings. How are these embeddings obtained during pre-training ? Are they initialized with some context-independent embedding like GLoVE or do we get the embeddings by having a weight matrix of size (30,000 x 768) where each input is represented as an one-hot vector and then use those trained parameters to get the embeddings ?
Hi Rajesh, thanks for your comment.
When the Google team trained BERT on the Book Corpus (the "pre-training" step), I have to assume that they started with randomly initialized embeddings. Because they used a WordPiece model, with a vocabulary selected on the statistics of their training set, I don't think they could have used other pre-trained embeddings even if they wanted to. So the embeddings would have been randomly initialized, and then learned as part of training the whole BERT model. Does that answer your question?
Thanks,
Chris
Thanks much for this great video! How funny the name list does not have your last name - you are a uniquely great e-educator! You mentioned you wonder why "##" is not added to the first subword. I was looking around the web pages, I see others' work where the first subword of a word is labeled, but the rest "##-word" ones do not have labels (or only default labels). Then they can, if they want, give the same label to the ##-words at least, especially for labeling the subwords of someone's last name;)
Thanks Bree! When you mention "labelling" subwords, are you referring to marking them with the '##' (e.g., 'em' is "labeled" as '##em'), or are you referring to something else, like classification labels for something like Named Entity Recognition?
@@ChrisMcCormickAI Yes, I was referring to the later one, the classification tasks like labeling identities. I thought it seems like one useful way of not tokenizing the first subword with ##. So even the one-word last name gets falsely subworded, the labels remain roughly correct.
Hi Chris
Great video. I have a question.
At 12.10 you say "The tokens at the beginning of the word can be redundant with the token for whole word. "
So in the bedding example (bed,##ding): Does it mean token for bed and bedding would be similar?
I see, there is token for ##ding in vocab, so , the token for bedding must be combination of "token for bed"+"token for ##ding"?
i am glad that someone has raised the question, same doubt is coming in my mind as well?
@ChrisMcCormickAI please let us know asap , thank you :)
Thank you Chris, you are great
Hello Chris. I'm wondering how to get the hidden states of a word splitted by the tokenizer. For example, in the case of word 'embedding' that is splitted in: em ##bed ##ding. Each token would be a different hidden state vector right? Do you know some way of combine these hidden states in a single one that represents the original whole word in order to do some manipulations? My objective is to compare simarity between words like 'embeddings' and 'embedded'. Thanks for your attention. Congrats for the EXCELLENT content of your youtube channel. Sorry for my poor english skills.
Really great job!! I have once doubt, why we still have [UNK] and [unused] tokens when subword embedding is used to handle the UNK words?
Good question! [unused] tokens are deliberately left as empty slots so that, if you need to, after training you can initialize a custom word using one of these open slots. [UNK] is for characters or subwords that appeared so infrequent that they did not make it into the vocabulary. The vocabulary is limited to a certain number of words/subwords/characters, and if you train on a large corpus of internet text you are likely to run into some very rare symbols/emojis/unicode characters that aren't used frequently enough to merit a slot in the vocabulary and can't be decomposed into any of the other more common subwords existing in the vocabulary.
Nice video! One question: Why does BERT have hindi and chinese alphabets when the words are not there in the vocabulary? What are some applications/uses of having them in the vocabulary?
You are amazing. Thank you so much!
Haha, thank you!
Very good video. For your name further research, you could check against the Social Security Death Index file, every first and last name registered in USA social security database from start of SS up to year 2013, several million unique names, broken into first and last.
Hi Chris,
thanks so much for the input and insights on how Bert embedding works. I think there is a small mistake though, as the values for the special tokens are one too high. The PAD token in Bert is embedded as 0, the UNK token as 100 and so forth as python usually starts enumerating with 0 and not with 1 which gets lost when saving the tokens with their embeddings to a txt file. At least these are the values that I get when I try out tokenizer.vocab ["PAD"] etc.
Thanks for pointing that out, Maike! In the Colab Notebook I mention that the indeces I'm reporting are 1-indexed rather than 0-indexed. I did this because I was referring to the line numbers in the "vocabulary.txt" file. Probably would have been better to keep things 0-indexed like you're saying, though.
Like word 2 vec how bert make embedding of every word ?? Waiting for details step . If I will pass 512 as a sequence length what will be the dimensions?
I think they do it because by adding ## they can reconstruct the original tokens back.
This is one of the excellent tutorial on BERT. I have a question: In the word embedding look up table, what are the examples of 5 features look like for "I" (say)?
Great content. Normally what distance measure do you use for similarity?
Hi Chris. Thank you for this video. I notice lots of single words from different languages "Chinese, Tamil, Japanese, Bengali". Is BERT base uncased trained on multiple language corpus ?
As I could understand that BERT has only 30000 words with 786 features each... but BERT represents each word specific contextualized vector (i.e., depending on the context of the word)? could someone explain that to me?
Thanks alot for the great explanation
You are welcome!
Thank you for this awesome explanation! :) I just have some trouble wrapping my head around, why e.g. splitting "embedding" into "em" bed" and "ding" would make any sense at all? Yes the subtokens might be in the vocabulary, but "bed", "em" and "ding" have little to do with the meaning of the word "embedding" and should therefore be quire far from "embedding" in the vector space, or am I missing something?
Good question! Note that "embedding" gets split into "em" "##bed" and "##ding", so "bed" is a totally different token than "##bed," the first is one you sleep on and the second is "the middle or end subtoken of some larger word." That helps the problem somewhat. Also note that the embeddings are learned in the context of the sentence, so while river "bank" and financial "bank" have the same embedding, this embedding "understands" what the meaning should be in different contexts. But the short answer is simply that these embeddings are trained for a very long time on a lot of data. It is unintuitive that they should work in different contexts, and that when combined they should form a word that has a distinct meaning from the individual parts, but it's mostly a result of the sheer amount of pretraining. Hope that helps.
@@nickryan368 Ahh I see! That makes a whole lot more sense now! I missed the fact that ##bed and "bed" are in fact different tokens. I guess the training process differentiates the two even further (by learning the context in which they stand). Thank you for taking the time to write a detailed response! :)
Hi, where does the initial lookup table exist? Or how does the whole lookup process work?
Great video. Thanks a lot!
Thanks for the encouragement!
Hi Chris, do you know where I can find a "pseudocode" of how Bert works? And also for Word2vec? For word2vec I found your repo where you commented the original c code, but I feel overwelmed to go thru 1000 lines of code :)
Thanks u made my day :)
Hi ChrisMcCormick ! This video gave me an easy start, thanks for it, but I really want how to get the feature vector for any single word? This is not explained in the video.
Hi Vivek, I just responded to another of your comments with this, but for anyone else reading--Nick and I wrote a blog post / Colab Notebook showing how to do this here: mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
@@ChrisMcCormickAI Ok Chris, thank you so much. By the way, do you have any idea of fine tuning the word embeddings with the help of document clustering task. Because I have a set of unlabelled documents.
Chris I need slides for your lectures in ppt or pdf format, to help with developing some lectures on BERT. Is it possible to share these.
Thanks Chris for the creating this video. Is it possible to add additional vocabulary to pretrained BERT embeddings? Lot of the words which I wanted to use are not available in BERT embeddings.
Hi Shyam, it's not possible to modify the BERT vocabulary. However, the BERT tokenizer will break down any word that it doesn't have in its vocabulary into a set of subwords (and individual characters, if necessary!).
Something I am interested to learn more about is how big of a problem this is for datasets with a lot of "jargon" (application-specific language). When you fine-tune BERT on your task, the subword embedding vectors may be adjusted to better match your task. But in general, I *suspect* that BERT will not perform well on applications with lots of jargon.
What's your dataset? I'd be curious to hear how it goes for you applying BERT!
I really hope you didn't actually mention this as I watched (and enjoyed!;) ) the full video two weeks ago, but I don't remember you mentioning it... I am a bit confused as to how to then receive embeddings for the actual words instead of WordPieces. I am using different kinds of embeddings for topic modeling and for topic exploration (nearest neighbours,...) I want to use actual words and not "em", "bed", "ding" instead of embedding.:) Does applying BERT not make sense in such an unsupervised task and should I rather stick to ELMO which gives me an embedding per word? Maybe you could give me some feedback :)
Yeah, I never thought about that! I would probably try just averaging the embeddings for the different word pieces together to create a single embedding.
I'm curious about your application--I've used word embeddings from word2vec to find "similar terms", but this isn't possible with BERT embeddings, since they are context-dependent (there's not a single static vocab of embeddings, as there in word2vec). How will you make use of these "contextualized" embeddings?
Thanks,
Chris
Great tutorial.
great job!! i have one doubt,why we are using neural network?BERT Itself can give output right?
Thanks, Anagha! Can you clarify your question for me--which neural network are you referring to? Thanks!
Great Job!
1) It is interesting why em-bed-ding was spiltted this way. These wordpieces have nothing to do with the meaning of the whole word. It seems that uncautious splitting only deteriorates the results. 2) How different senses are handled in this dictionary? For example, "like" as a verb and as a conjunction. What about "like"-noun and "like"-verb? Do they correspond to diferent entries in the dictionary?
Thanks Chris!
You bet!
hi ... is Bert work with Arabic language??
I loooove it
where can I find the vocabulary.txt file
Hi yash, you can generate it by running the `Inspect BERT Vocabulary` notebook here:
colab.research.google.com/drive/1MXg4G-twzdDGqrjGVymI86dJdm3kXADA
But since you asked, I also uploaded the file and hosted it here:
drive.google.com/open?id=12jxEvIxAmLXsskVzVhsC49sLAgZi-h8Q
JUST WOW WOW WOW
feel like BERt vocabulary is not really well built in the end. feels like you should have only single digit numbers. I feel like nouns might take too much space in this voc also. To modify the voc I found this paper www.aclweb.org/anthology/2020.findings-emnlp.129.pdf it seems very interesting but I did not test their method yet.
Amazing
Thanks Ilan :D
thanks :)
BERT only has 30000 tokens!!!!
You might be wrong on that.
Oops! You took more than 2 min for word emb 😆😜
:D It's true, I have a hard time not going into depth on everything.
Thank you very much for this very useful video!