Scaled Dot Product Attention | Why do we scale Self Attention?

CampusX

Просмотров 22 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 сен 2024

Комментарии • 239

@aj_ai 7 месяцев назад ⁺¹⁹
In our NIT everyone following your channel for Data Science content even professors also
Thanks, sir🖤
@ashutoshpatidar3288 5 месяцев назад
which nit are you from btw
@kayyalapavankumar2711 7 месяцев назад ⁺⁸²
In future This playlist will be the most viewed playlist for deep learning
@mentalgaming2739 6 месяцев назад
Yes
@ashutoshpatidar3288 5 месяцев назад ⁺¹
INDEED
@Amanullah-wy3ur 3 месяца назад ⁺¹
yes
@BJCR4HVJR4-sc8xj Месяц назад
Easily,waiting for that time
@ashishmalhotra2230 7 месяцев назад ⁺⁵
it is criminal to have this level of explanation for free. 😍
@ashutoshpatidar3288 5 месяцев назад
CRIME*
@KiyotakaAyanokoji1 6 месяцев назад
ye zyda variance ko solve krne ki approach thi na ki koi timepass explanation , so its good ki aap ne sikha diya ye , aage kahi aur kam aa jayenga aisa hi kuch . . . 😃
@ParthivShah 4 месяца назад ⁺²
Thank You Sir. You Are Great Teacher. my fav and best.
@samikshakolhe5086 3 месяца назад
Sir, you're absolutely gem. Incredible!! This type of intuition behind scaling really hats of to you. because of this topic most of my variance and softmax related concepts got clear.. Thank you so much sir. I always recommend everyone to watch your playlists for Data Science/ML concepts.
@mentalgaming2739 6 месяцев назад
No Sir This is Great Explanation Ever , the Content your are providing , I think No One is Providing this type of Content With Same Effort and Same Energy . And The Best Part is that you explanation makes the Topic Simple and Easy
@MAYANKJATAV-u6v 2 месяца назад
Amazing Explanation Bhaiya, Hats Off To You.
Aap hame bhi is knowledge ka source bataiye na, Ye sara knowledge kaha se mila? Kosni books me diya hai ye itna details me? Please, Bhaiya, bohot help hogi hamari.
@TemporaryForstudy 7 месяцев назад
nice, I knew that we divide by root d to handle the softmax but didn't know the concept of why root d.
@samrmit6253 2 месяца назад
brilliant!
@tiwari45621 7 месяцев назад ⁺¹
Maine is 20 days mai at least 5 se 6 bar refresh karke dekhti thi sir ne video upload kiya h kya
@moviesmines5026 7 месяцев назад ⁺¹
hi smriti you are learning dl ??
@faizulislam9047 7 месяцев назад
Sir, I want to buy dsmp 2.0 course and I'm from Bangladesh. Paypal isn’t available in our country. Can I pay through card? Or is there any other method? Please reply, I am very very interested in buying the course but I don't know how to make payment from Bangladesh for this course.
@KumR 7 месяцев назад ⁺³⁵
No one else can explain this concept this way. And if anyone does... he/she follows you. Please don't shorten content. We need this level.
@gopeshsahu 7 месяцев назад ⁺²⁷
Nitish!!! .. truly awesome!! .. outstanding!! ... remarkable !! ....rare to find a gem like you who illuminates the intricate world of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) but also makes it accessible to both novices and professionals alike is rare. Explanation of transformers with self-attention mechanisms is a standout. It's a concept that lies at the heart of many modern AI breakthroughs. This video is out of the world explanation .. why to divid QKV by root of d_k .. no one can explain this level of details .. truly truly truly hat-off ... outstanding beyond the expectation ... Keep doing Nitish and Keep this momentum !!!
@RushikeshShigwan 5 месяцев назад ⁺²
ye banda... dhanda band karayega coaching valo ka.......to give this heavenly made content .....💯
@tiwari45621 7 месяцев назад ⁺⁵
Sir, one small request please if it is possible please increase the frequency of video upload. I was waiting your lecture video since last 20 days.
@nakulmali1413 7 месяцев назад ⁺²
Hi sir first upon thanking you for all your videos and teaching. Sir i request you please upload one detail video on YoloV8, it's file structure how to create our own files in this architecture.
@LMessii10 2 месяца назад ⁺²
By far the best explanation anywhere. I can't believe how great you are as a teacher, teaching things from such a fundamental level with this astonishing clarity is god gifted. And Nitish sir, you are a god's gift to people like us. I am utterly in awe of your acumen and more so your teaching skills. Because not every great mind is a good teacher, you are a great mind AND a great teacher. Thank you for everything 🙏
@SreejanGupta 17 дней назад ⁺¹
thanku bhaiya bahut aache se concept bataya aapne
@SamyukthaKaditham 26 дней назад ⁺²
you are like a research book, covering every single detail with lot of patience.
I always end up your videos..
I felt like each of your video is a combination of many books..
thank you so much to sharing your knowledge sir..
@surajitsasmal8030 7 месяцев назад ⁺²
If you can please upload transformer lectures this week .it will be helpful.
@ShubhamKumar-sk6td 3 дня назад
Sir, I was working on detr transformer and I was unable to visualize how attention works. your playlist was like light in this darkness. Thanks alot for such awesome teaching sir please tell me your approach while reading research papers.
@Shisuiii69 Месяц назад ⁺¹
Bht he zbrdst sir ek apky lectures he hai jo detailed me maths ky sth prhne me enjoy krta hu .... Is trh ki detailed video sir bnaty rhe hr wo topic me jo apko lgta hai prhana chahiye ab future ky millions of Ai Engineer ap pe dependent hai
Love from 🇵🇰
@mohitchawla1878 Месяц назад
At 30 min, He said gradients would be small for smaller values and larger for large values. Where i am confused , Softmax gradienit formaula is : pi*(1-pi). I am writing it short but for general class. this would hold. Now if something is reaching to 0.99 then (1-0.99) would be small and same when something is reaching 1% then (1-0.01) would be big. so in nutshell both prob would be small. right?
Can someone explain it clearly why he said so?
@mayank7275 3 дня назад
jese jese apki video dekh rha hu bas dil se dua nikal rhi hai , sir kabhi comment nhi krta but apki video pe ruk nhi paata
this is hardwork
@waheedkhan-ey6vm 11 дней назад
respected sir i cant express how to appreciate you and your hard work ,you clear each and everything very broadly ,you r gem accept love respect from pakistan sir m below average student and i learn a lot from you just like you put everything in my mind
@myself4024 2 месяца назад
🎯 Key points for quick navigation:
00:00 *🎥 Introduction and Overview*
- Introduction to the video series and continuation of the self-attention concept.
- Emphasis on explaining the scaling concept in self-attention.
- Mention of the conceptual depth and importance of understanding this concept.
01:05 *🧠 Recap of Previous Video*
- Summary of the previous video on creating self-attention from first principles.
- Explanation of generating embeddings for words and creating query, key, and value matrices.
- Description of the dot product operations to obtain query, key, and value vectors.
03:18 *🔍 Applying Self-Attention*
- Steps to apply self-attention using query, key, and value matrices.
- Detailed process of dot product operations and applying softmax.
- Final calculation of the contextual embedding.
04:46 *📊 Mathematical Formulation*
- Compact mathematical representation of the self-attention process.
- Explanation of transforming key matrices and applying softmax.
- Final formula summarizing the attention calculation.
05:01 *📝 Comparison with Original Paper*
- Discussion of the formulation developed versus the original "Attention is All You Need" paper.
- Highlighting the difference in scaling the operation with the square root of the key dimension (Dk).
- Introduction to the concept of scaled dot product attention and its importance.
06:49 *🔄 Need for Scaling in Attention*
- Explanation of the need to scale in attention to avoid unstable gradients.
- Introduction to the concept of Dk (dimension of the key vector).
09:55 *📐 Dimension Calculation*
- Detailed explanation of calculating the dimension of key vectors.
- Example scenarios to simplify understanding: dimensions could be 3, 10, or 512.
- How embedding dimensions and matrix shapes affect the resulting dimensions.
11:06 *🧮 Dot Product Nature*
- Explanation of why scaling by the square root of Dk is necessary, linked to the nature of the dot product.
- Discussion on how the dot product operates between multiple vectors behind the scenes.
- How matrix dot products consist of multiple vector dot products.
13:45 *📊 Variance in Dot Product*
- Explanation of the variance in dot products and its dependence on vector dimensions.
- Calculation of mean and variance with multiple dot products.
- Low-dimensional vectors produce low variance; high-dimensional vectors produce high variance.
16:06 *🧮 Practical Examples*
- Comparison of variance in low-dimensional vs. high-dimensional vectors.
- Example of 2D and 3D vectors demonstrating variance differences.
- High-dimensional vectors show greater variance, leading to potential issues.
18:23 *🔬 Experimental Proof*
- Experiment demonstrating variance in dot products with varying vector dimensions.
- Histogram plots showing variance spread for different dimensional vectors.
- Higher dimension vectors result in larger variance, illustrating the scaling necessity.
22:00 *📈 High Variance Problem*
- Explaining why high variance in dot product calculations is problematic.
- High variance leads to significant differences in softmax outputs, creating large probability gaps.
- Larger numbers get much higher probabilities, while smaller ones get very low probabilities, affecting training focus.
25:04 *🧮 Training Issues*
- High variance affects backpropagation in neural networks.
- Training focuses on correcting larger numbers, ignoring smaller numbers, leading to vanishing gradient problems.
- Small gradients mean parameters do not update, hampering the training process.
26:15 *🏫 Classroom Analogy*
- Analogy of a classroom with students of varying heights to explain training issues.
- Taller students get more attention from the teacher, similar to larger numbers in training.
- A class with similar height students leads to better overall learning, just like balanced variance leads to better training.
28:18 *🔢 Reducing Variance*
- Discussing the importance of reducing variance in high-dimensional vectors for better training.
- High variance in vectors leads to extreme probabilities in softmax, causing focus on large values and ignoring small ones.
- The goal is to reduce variance so the training process distributes focus evenly.
30:21 *📏 Scaling for Variance Reduction*
- Describing the technique of scaling to reduce variance in matrices.
- Scaling the numbers in a matrix by a factor can reduce variance effectively.
- The key challenge is determining the appropriate scaling factor for optimal variance reduction.
32:35 *🔍 Understanding Scaling Factor*
- Introducing the concept of a scaling factor to control variance.
- Explaining that the scaling factor needs careful consideration and mathematical understanding.
- Focusing on the first row of the matrix to simplify the problem and then applying the solution to the entire matrix.
35:52 *📊 Calculating Population Variance*
- Explanation on the need to calculate population variance instead of sample variance for accuracy.
- Describing expected variance for potential new vector values.
- Emphasizing the importance of considering all possible values in variance calculation.
38:02 *🧮 Variance with Increased Dimensions*
- Exploring the effects of increasing vector dimensions on variance.
- Demonstrating how adding dimensions increases variance.
- Establishing that variance increases linearly with dimensions.
42:02 *📈 Linear Relationship of Variance and Dimensions*
- Summarizing the linear relationship between dimension increase and variance.
- Showing that as dimensions increase, variance also increases proportionally.
- Confirming the mathematical quantification of variance growth with dimension expansion.
43:09 *📉 Maintaining Constant Variance*
- Explanation on maintaining variance constant across dimensions.
- Use of division by a specific factor to achieve consistent variance.
- Introduction of a mathematical rule to support the variance adjustment.
44:43 *🔢 Mathematical Rule Application*
- Detailed explanation of using a constant to scale and adjust variance.
- Calculations showing how dividing by the square root of the dimension maintains variance.
- Examples of applying the rule to different dimensions.
48:03 *🤔 Summary and Practical Application*
- Summary of the scaling process to maintain variance.
- Integration of the scaling step into the self-attention model.
- Final formula for calculating attention in transformers using the scaling factor.
Made with HARPA AI
@sachink9102 3 месяца назад ⁺²
Nitish Sir .... your knowledge is like MIT equivalent.. really outstanding !!
@hari8568 7 месяцев назад ⁺²
Sir it's just a suggestion, but there are some viewers in your channel who have the mathematical background to understand things but basic conceptual idea is missing (common among university students) so if you need to go into more mathematical details please do because we can relate things better that way
@muhammadfarrukhshafeeq1955 3 месяца назад
Mathematical ground is sufficient, relevant, and well explained in this Tutorial. you are advised to go on separate videos on those Methamatical concepts that you could not understand.
@meherunfarzana Месяц назад
Your background knowledge for explanation is so amazing. Where did you study? Please complete the whole playlist.
@anshumantanwar6679 Месяц назад
if Softmax is so much sensitive towards size of dimensions, then why using (Dk)^1/2 before using softman (to normalise values) is not a standard process ?
@shrikantvanarase7870 Месяц назад
This was good and easy. Also it was as fine as we dont have to remember it again,It will flow automatically....Thank you so much!!!
@abdulwahabkhan1086 19 дней назад
Aesi explanation mjhy kahin bhi nahi mili. Never stop explaining like this Nitish Sir. ❤❤
@kavyasaxena235 Месяц назад
Very nice explanation. Really enjoyed the Transformer lecture series. The teaching style surely improves the thinking approach, which is helpful when reading a research paper.
@anandshaw-ie3qk 6 месяцев назад
Multiclass Classification Problem Kaise Handel kare? How Handel Multiclass Classification Problem Please Make a Video Very Important
@muhammadshafay7754 5 дней назад
This is the pinnacle of teaching. Lots of respect from the other side of the border.
@ersushantkashyap Месяц назад
dear viewers, please please do hit like share subscribe this channel, Nistish sir kisi bhi vajeh se naraz ho ker video bnana bund na kerde. This channel is a Granth, it's a Bible, it's a Quran for all the students. I wish you have billion subscribers and trillion likes on each video.
@deepkothari7211 2 месяца назад
Thank you!! Sir Can you try to make a video on LLM project because I understand the all concept but i don't know how to implement and how to write code using LLMs
@shinwarikhan4677 7 месяцев назад
Sir, we appreciate your efforts, and they are truly helpful for us. However, we have one issue to address. In your videos, you thoroughly explain all types of questions that arise in our minds about concepts, but there is a lack of focus on the actual code. For instance, I personally encountered this issue when learning from your videos on LSTM, GRU, Bidirectional LSTM, and their concepts. While I grasped the concepts well, I faced difficulties when it came to implementing them in coding. Therefore, it would be immensely beneficial if you could create some videos specifically focusing on coding, using different datasets. This would greatly assist us in applying the concepts practically. Thank you
@shinwarikhan4677 7 месяцев назад
Sir, we appreciate your efforts, and they are truly helpful for us. However, we have one issue to address. In your videos, you thoroughly explain all types of questions that arise in our minds about concepts, but there is a lack of focus on the actual code. For instance, I personally encountered this issue when learning from your videos on LSTM, GRU, Bidirectional LSTM, and their concepts. While I grasped the concepts well, I faced difficulties when it came to implementing them in coding. Therefore, it would be immensely beneficial if you could create some videos specifically focusing on coding, using different datasets. This would greatly assist us in applying the concepts practically. Thank you
@shinwarikhan4677 7 месяцев назад
Sir, we appreciate your efforts, and they are truly helpful for us. However, we have one issue to address. In your videos, you thoroughly explain all types of questions that arise in our minds about concepts, but there is a lack of focus on the actual code. For instance, I personally encountered this issue when learning from your videos on LSTM, GRU, Bidirectional LSTM, and their concepts. While I grasped the concepts well, I faced difficulties when it came to implementing them in coding. Therefore, it would be immensely beneficial if you could create some videos specifically focusing on coding, using different datasets. This would greatly assist us in applying the concepts practically. Thank you
@space_ace7710 4 месяца назад ⁺¹
6:40 DK = Dinesh Kartik XD
@decodingdatascience844 Месяц назад
Thank you so much for the explanation. I feel satisfied after understanding your explanation.
@sandeeppolaki664 3 месяца назад
Thank yo Nitish sir, you have given an amazing explanation of these concepts. Appreciate your time and efforts into this. Will soon clear interviews.
@minaxisharma7630 4 месяца назад
Sir too well explained/. We want such kind of detail. This is the potential that makes you different from others. Could you keep it up? Every time get a solution from your channel sir.
@rushikeshjagtap3172 6 месяцев назад
"NITISH_SIR" Is All You Need ......#Master_Blaster of the Data Domain!!!
@hariskarim4331 Месяц назад
Sir you are good teacher. Long explanation is so good. I have understand how they worked in self attention
@AjayPateldeveloper 12 дней назад
Very well taught. Thank you for putting in so much effort.
@lokeshsharma4177 6 месяцев назад
why rook Dk ?? in our case Dq ,Dk and Dv are same but as you said they may be different then why we are using Dk to adjust variance, any particular reason unless Dq ,Dk and Dv are always same. Please shed some light
@taashna_j Месяц назад
We really do appreciate the details. Explanation of the intuition behind every little step is very useful in understanding the concept
@Sandesh.Deshmukh Месяц назад
Amazing!!! You are truly unbelievable Nitish Sir ❤❤
@techvideo4752 19 дней назад
i think it is a wonderful explaination and whatever time you took to explain, it was amazing
@shubhamagrawal8620 16 дней назад
Bro, these are the best videos I have seen to understand Attention mechanism. Thank you.
@avinashbhardwaz5717 4 месяца назад
Kya bta diye batate batate sir......
Subject ko agar feel karna ho to aj jao Nitish sir ke pass.
@Abhishekkumar-tx3of 6 месяцев назад ⁺¹
Best explanation
@dr.dishonest3134 7 месяцев назад
thank you very much for this video..
but we need a project video by using transformer...
@virajkaralay8844 21 день назад
Too good sir, i loved the detailed explanation. Absolute banger video again.
@JVSatyanarayana-n2o 13 дней назад
Can see the passion of the teacher in teaching, particularly a concept which other ignore
@bhavikpunmiya9641 5 месяцев назад
Sir this is really nice approach, I would love to learn the topics in this detailed, One request please increase the frequency of videos and complete this playlist
@navneetkumar598 6 месяцев назад
Sir mujhe full course chaheye deep learning tak...kiya karun?
@technicalhouse9820 6 месяцев назад
Sir please please increase the frequence of vidio upload. Atleast 1-2 in a week
lot of love and respect from pakistan sir
@mrityunjaykumar2893 4 месяца назад
Hi @Nitish, All videos on Transformer are truly remarkable. RUclips pe bhut sare videos hai iss topic pe, but your's are GEM, Main yaha explanation ke liye he aaya tha, and deep down I am Satisfied. Keep making this type of videos with explanation, warna ajkal har koi short me just for view video bana deta hai. Truly appreciating your hard work .
@kayyalapavankumar2711 7 месяцев назад ⁺³
Thank you so much for restarting this playlist
@TheVarshita 3 месяца назад ⁺¹
It was an amazing explanation, so thorough and still so simple to understand such complex topic. I have followed courses from NPTEL, Stanford, Deep learning but yet this was the smoothest explanation! Your content is highly underrated. I wish I had found your channel sooner! Thanks 🙂
@krishnakant8687 Месяц назад
There could'nt be a better playlist for Deep Learning.
@ggggdyeye 7 месяцев назад ⁺²
I always wanted a teacher who could explain the hard concepts simply and in detail. I have that kind of teacher like you. Thank you sir for this awesome and detailed video.
@abirahmedsohan3554 6 месяцев назад
Sir, have you any plan of making tutorials on reinforcement learning? I want to learn from you sir,
@muhammadfarrukhshafeeq1955 3 месяца назад
Truly, a marvelous effort and a superb way of explaining as you said.
@SamirGaykar-k6u Месяц назад
The explanation is detailed. kindly keep the explanation always this detailed. thanks :)
@anandshaw-ie3qk 7 месяцев назад
sql mein kaya padh na parega data science ke liye please make a video ?? 🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏
@jaiminjariwala5 6 месяцев назад ⁺¹
WOW WOW WOW, BEST EXPLAINED VIDEO I HAVE EVER WATCHED ON RUclips!
Honestly, Your way of explanation is D best Sir!
♾🌟
@ashutoshpatidar3288 5 месяцев назад
in favor of this detailed explaination!, get to know very minute details!
@ali75988 7 месяцев назад
Really want to watch this. Aik baar hi pora transformer parhain gae. pichla bhool jata agli video ane tak
@rb4754 3 месяца назад
I wish I could like 👍 his videos unlimited times!!!!! This is a masterpiece series on DL and Language Models... Still the likes are less than 1000... You deserve a lot Nitish... #CampusX
@bgboy3103 Месяц назад
Perfect way of explanation, thanks, keep it up
@harshitdaga2225 5 месяцев назад
Its Truly Amazing to experience how beautifully you make us understand the concepts and give us a crystal clear intuition..
Hats-off to your level of patience/calmness. That soothing explanations...🤌😲
@tannaprasanthkumar9119 7 месяцев назад
Your explain was so awesome, can you upload videos with in 2- 3 days
@kunalmakwana4451 3 месяца назад
Truly awesome! Eagerly waiting for more videos!
@jayanthAILab 6 месяцев назад
Great Video sir, After this video I understood that you are a very curious teacher sir!
@raviparihar3298 4 месяца назад
sir self attention ka code likhne wala part bhi please krwa dijiye
@anmolshrestha4391 17 дней назад
aap detail main samjhate hain to ache se samaj aata hain sir
@rahulaaseri6307 Месяц назад
ye bahoot badhiya tarika hai. ap bahoot sahi patha rha ho. hame to ase hi pathana hai
@titaniumgopal 3 месяца назад
😄😄😄apke videos to ek din students ko researcher bana dege
@harshgola517 7 месяцев назад
Can anyone tell which topics are left now in 100 days of DL?
@SourabhDaate 4 месяца назад
My friend got interview question how VGP and EGP problem is solved in transformer.....This is my 3rd time I'm seeing this playlist...it is the one stop solution for DL ....Love u Nitish Bhai.....Love from Maharashtra
@neetikashree5098 4 месяца назад
hi! I am going through this masterpiece video by Nitish sir(all my love to him), wanted to ask what do you mean by VGP and EGP here?
@saurabsingh655 Месяц назад
no words sir for your explanation, love you sir...
@pujamehta9755 4 месяца назад ⁺¹
Great explanation, many coaching institutes teachers are copying your content and teaching students. They dont have there own contents and concepts clear.
I dont think anyone can explain these concepts in such a easy manner. You are a great teacher!
@yadavalliravikumar7464 5 месяцев назад
Extraordinary….no one can do like this…did like this
@arjunpaudel9278 3 месяца назад
Detailed approach develops creator for next ai model.
@1111Shahad 3 месяца назад ⁺¹
I have seen many tutorials and explanation's of transformers and its architecture. I have never seen a detailed explanation like this in crisp and precise way.
Thanks Nitish
@rajsharma-bd3sl 4 месяца назад
Superb Explanation .... keep it detailed :)
@josebgeorge227 5 месяцев назад
Thanks a lot SIr for this. You are amazing.
@sachink9102 3 месяца назад
Good .. this detailed way of Explanation is good
@A_book123 7 месяцев назад
Sir can you do separate course for mloops and deep learning?
@barunkaushik7015 5 месяцев назад
Superb explanation. Loved the video.
@KaranDabas-em8vz 2 месяца назад
great video sir,best explanation this is wanted
@sangitasingha7134 Месяц назад
Great explanation . Thank you Sir .
@rikeshdatascientist 6 месяцев назад
other (AI,ML,DL) teacher ❌ nitish sir ✅
@akshitnaranje7426 7 месяцев назад ⁺¹
this type of explaination we need to make our concept strong.. thank you sir...
@ankitlakra6832 6 месяцев назад
good explanation, I need same kind of explanations.

Следующие

Автовоспроизведение

Self Attention Geometric Intuition | How to Visualize Self Attention | CampusX