Significance [0:08] Research Posts [1:27] BERT Mountain [3:02] can we skip over LSTMs [5:14] BERT Paper [5:50] BERT Repo [7:15] BERT Announcement Post [7:40] Attention is All You Need (Transformer) [8:11] The Annotated Transformer [8:42] Jay Alammar's Posts [10:28] Sequence Models on Coursera [11:23] Next Up [13:19]
Hi Chris, I read your articles on BERT before and have learned a ton from them. Can't believe you have videos as well. Thanks for sharing the knowledge!
A very no nonsense way of representing the work you are doing. It felt like I was with you and studying with you. Thanks. I am planning to go through the rest of your videos, in my journey to learn BERT
Token indices sequence length is longer than the specified maximum sequence length for this model (1312 > 512). Running this sequence through the model will result in indexing errors . Get this error message while doing news classification .
Hi Chris, thanks so much for the video. I actually got stuck on the same line in the BERT paper, where it says "we will omit an exhaustive explanation". From there I went down the BERT mountain and finally got to your video, so thanks a lot for picking me up on the way. Looking forward to the rest of the series!
I understand RNNs, LSTM, Bidirectional LSTMs and Attention I still found the BERT paper hard to read and had the exact same feeling of the mountain you drew. This video and the subsequent one is getting me much more confident about BERT, hoping to watch the 3rd video in the morning. Thanks for this contribution, your explanation is very concise.
@@ChrisMcCormickAI I think BERT research is perfect for now. Soon there gonna be a wide application of BERT in NLP area and a research like this one is perfect for who wants to understand all of it's aspects ( + general aspects as Word Embeddings, what exactly is Attention mechanism...). It would be great to talk about how can we adopt BERT to a certain domain with domain specific terms... Also, personally I would like to understand how to use BERT in order to compute similarity between 2 documents(tried already Cosine Similarity measure based on TF-IDF, Chi square, Keygraphs based keywords but still not happy with results) Thanks again!
Hahaha bursted into laughter at 3:47. Chris you are exactly right - I started researching BERT and then just keeps bouncing from topic to topic. (as a beginner to deep NNs)
Hi Chris, Firstly, thanks a lot for writing the most comprehensive blog post, extremely helpful. I have been following it to understand BERT more closely. Secondly, besides creating word and sentence vectors by using different pooling strategies and layers, could you please extend the blog post by showing how to compute the word attentions and their respective positions? Thanks!
You sir, are a legend in your own right! Keep up all this work you are doing! At some point it will be helpful if you can put a guide to effective science writing like you do! :)
Hi Chris, absolutely amazing series on Transformers. I have a question regarding how transformers handle the variable length inputs. Suppose I set the max_lenght for my sequences to be 32 and feed the input_id and attnetion_mask only for 32 tokens during training, given some tokens can be padded tokens since each sequence won't be exactly of 32 lengths. Now if we talk about bert the default max_lenght is 512 tokens so my question is does the transformer implicitly add 512-32 padded tokens to calculate MHA on 512 tokens as it will not attend to the tokens with the padded token ID? If that's the case then are we not updating the parameters directly attach to the remaining 512-32 positional vectors?
Hi Chris, thanks for the post. Feeling lucky I've found your videos. Currently, I'm going through what you've been through basically. Can't wait to watch the whole series. Have you tried Google's Natural Questions challenge yet? Thanks again.
Sir, I'm new to this field, my research topic is about automatically evaluating essay answers using Bert what should I learn in advance so that I pick up only the main points related only to my research in order not to be distracted by too much information and could please give me your email I want to consult you And thank you
Hello Chris, thanks for this beautiful series. You described the training tasks as fake/bogus tasks. I prefer to name them proxy tasks - as in proxy war, but for good purposes. :) What do you think?
Significance [0:08]
Research Posts [1:27]
BERT Mountain [3:02]
can we skip over LSTMs [5:14]
BERT Paper [5:50]
BERT Repo [7:15]
BERT Announcement Post [7:40]
Attention is All You Need (Transformer) [8:11]
The Annotated Transformer [8:42]
Jay Alammar's Posts [10:28]
Sequence Models on Coursera [11:23]
Next Up [13:19]
Hi Chris, I read your articles on BERT before and have learned a ton from them. Can't believe you have videos as well. Thanks for sharing the knowledge!
That's what I like to hear :) Thanks, Yikai!
A very no nonsense way of representing the work you are doing. It felt like I was with you and studying with you. Thanks. I am planning to go through the rest of your videos, in my journey to learn BERT
Thanks so much Thalanayar! I'm so glad the videos are helping you on your BERT journey! :D
Your explanation is super clear and I like the Bert mountain which shows what I need to understand first .
Thanks!
OMG, that BERT Mountain picture at the beginning is exactly what I've been conceptualizing!! I love this series of videos! Thanks a lot!
Hi Chris, I like the way you explain things. I like visual explanations and the BERT's mountain was "all I need" :D !, thanks a lot.
Thanks, appreciate the encouragement!
Token indices sequence length is longer than the specified maximum sequence length for this model (1312 > 512). Running this sequence through the model will result in indexing errors . Get this error message while doing news classification .
Feeling so lucky to find your website, resources, and channel. Thanks a lot!
thanks for making this video, I am enjoying the series. I would especially like to see hands-on demos as Jupyter notebooks! :)
Like the way you teach. Not many people are teaching nlp, so it's good to have a person like you.
Btw, 1000th subscriber.
Awesome series, I have a basic idea about how Attention Mechanism work but this builds on the concepts
Hi Chris, thanks so much for the video.
I actually got stuck on the same line in the BERT paper, where it says "we will omit an exhaustive explanation". From there I went down the BERT mountain and finally got to your video, so thanks a lot for picking me up on the way.
Looking forward to the rest of the series!
I understand RNNs, LSTM, Bidirectional LSTMs and Attention I still found the BERT paper hard to read and had the exact same feeling of the mountain you drew. This video and the subsequent one is getting me much more confident about BERT, hoping to watch the 3rd video in the morning. Thanks for this contribution, your explanation is very concise.
Glad I'm not the only one! Thanks for your comment :)
Hi,
Please keep going on this (hands on)series, I'm pretty sure you will help lots of people out there!!
Thanks giavo, I'll keep them coming!
Anything in particular that you'd like to see explained?
Thanks!
@@ChrisMcCormickAI I think BERT research is perfect for now. Soon there gonna be a wide application of BERT in NLP area and a research like this one is perfect for who wants to understand all of it's aspects ( + general aspects as Word Embeddings, what exactly is Attention mechanism...). It would be great to talk about how can we adopt BERT to a certain domain with domain specific terms...
Also, personally I would like to understand how to use BERT in order to compute similarity between 2 documents(tried already Cosine Similarity measure based on TF-IDF, Chi square, Keygraphs based keywords but still not happy with results)
Thanks again!
Hahaha bursted into laughter at 3:47. Chris you are exactly right - I started researching BERT and then just keeps bouncing from topic to topic. (as a beginner to deep NNs)
Nice :) Glad I'm not the only one!
Excellent work... it's very informative, especially the prerequisite domain knowledge area. Waiting to see more from you
Thank you a lot.
You are making it easier for us to understand hard topics..
Hi Chris,
Firstly, thanks a lot for writing the most comprehensive blog post, extremely helpful. I have been following it to understand BERT more closely.
Secondly, besides creating word and sentence vectors by using different pooling strategies and layers, could you please extend the blog post by showing how to compute the word attentions and their respective positions?
Thanks!
Great video, thanks for this looking forward to the rest of the series!
Thanks, Luke!
You sir, are a legend in your own right! Keep up all this work you are doing! At some point it will be helpful if you can put a guide to effective science writing like you do! :)
Thanks a lot for your videos. It's almost the end of 2020 and still there are no books on Amazon about BERT!
Wonderful Explanation Chris!!!!
What is the difference between attention and self-attention?
Hi Chris, absolutely amazing series on Transformers. I have a question regarding how transformers handle the variable length inputs. Suppose I set the max_lenght for my sequences to be 32 and feed the input_id and attnetion_mask only for 32 tokens during training, given some tokens can be padded tokens since each sequence won't be exactly of 32 lengths. Now if we talk about bert the default max_lenght is 512 tokens so my question is does the transformer implicitly add 512-32 padded tokens to calculate MHA on 512 tokens as it will not attend to the tokens with the padded token ID? If that's the case then are we not updating the parameters directly attach to the remaining 512-32 positional vectors?
10:16 "all these sound very discouraging" - says Chris :)
Cool Video Chris!
Thanks!
Hi Chris thanks for the wonderful video. I would like to know if the topics covered in the ebook are different from videos or not .Thank you
Hi Chris, I am back here in your first video again after an year. I guess this time I ll be able to follow you better.
So, Bert is a model different from other language models like word2vec or Glove, right?
Do you have a playlist for all the episodes regarding BERT? It'd be really organized and helpful.
Really great content! Does anyone know how can I contact Chris? I need to ask permittion to use and quote some of his work
great content Chris!
Thanks George!
I would appreciate if you could cover other models as well, these tutorials are good for a noob to start with.
What a great video! You earned a new subscriber.
Thanks! Glad to have you :D
Hi Chris, thanks for the post. Feeling lucky I've found your videos. Currently, I'm going through what you've been through basically. Can't wait to watch the whole series. Have you tried Google's Natural Questions challenge yet? Thanks again.
Hi Chris, This is really amazing explanation.
Can you please help me how to use this Bert model with lime to explain model ?
Hey, the link to your blog page is throwing a 404: Page Not Found error. Could you please help me with the problem
I gave you the 800th like, good work
Cool, thanks Aziz!
Nice Job, Thanks a lot for sharing
Thanks, King!
HI Chris,
Great video!
Do you have a medium /twitter channel to follow your latest works in Data Science?
What an amazing effort.Super.
Sir, I'm new to this field, my research topic is about automatically evaluating essay answers using Bert what should I learn in advance so
that I pick up only the main points related only to my research in order
not to be distracted by too much information and could please give me your email I want to consult you
And thank you
Hello Chris, thanks for this beautiful series. You described the training tasks as fake/bogus tasks. I prefer to name them proxy tasks - as in proxy war, but for good purposes. :)
What do you think?
could I use BERT model on language like Arabic?
Yes. Checkout Multi lingual Bert
Thank you!
Thank you sir!