At first I was inclined to click away from the video because of the unorthodox explanation of LSTM in "steps", which was different to what I had seen in other videos and blog posts which focus on the infamous LSTM diagram. However, I was struggling to fully grasp LSTMs so I decided to give the video a try. And it paid off! I can't believe LSTMs are that simple! This video is absolutely essential for understanding LSTMs at a fundamental level.
i have listened to a 2-hour lecture in my MSc data science, still don't know what is happening. Your video is explain it in a succinct way!!! Thank you!!!
I've been trying to understand LSTM through multiple blogs and videos but the thing that why it needs to be this complex , you specifically targeted that point of view to understand it , this is really one of the best videos , because you showed why there was a need for a LSTM and how could the gaps be filled , which is what made it very easy to understand . Could you please list the references as well for the video , so that if anyone has to go further deep into the concepts, it would be very helpful ! Thanks a lot for this video !
Thank you so much for your videos! They are super informative and much more intuitive than the hundreds of slides I have from my master's class. Keep up the great work!
Great video as always! The part that still perplexes me: How does the LSTM "know" what is important (like dog) and when to actually use that to predict the next word?
Hi Ritvik, It will be really great if you could create videos which explain maths behind ML models like SVM and PCA.I am also curious about ODE, PDE, real analysis, complex analysi and stochastic calculus. But the problem is that i want to explore topics which are relevant to financial engineering. So i could read all quant finance related textbooks. I am a professional and really dont have time to read all applied maths textbooks 😅.
Thank you very much! I have a few questions: 1. Could you please explain the reasoning behind using a candidate cell state and why the tanh activation function is necessary? 2. I have noticed that many implementations, papers, or blogs I have read use concatenation of h[t-1] and x[t] and a single learnable weight matrix W instead of U and V used in this video. Can you clarify why this is the case? 3. Despite the success of the model in predicting words, I remain somewhat skeptical about how it achieves such accuracy. :)
1. The candidate cell state is simply the same weighted combination of input and previous hidden state, just like all the other gates. However, the other gates pass through a sigmoid because we want to essentially binarize them. We want them to be ~0 or 1, which is what a sigmoid does. This way, these gates function as True/False activations, letting data through with a 1 and stopping data with a 0. With the candiate cell state, we don't want a binarized output. Rather, we want all the information from the input state and the hidden state. Applying tanh moves the bounds between -1 and 1, which allows it to retain more information (basically we don't want 0s) To sumarize, we use the candidate cell state to obtain the actual information from the input and prior hidden state. This is why tanh is applied. The canditate cell state is then multiplied by the output gate (which IS 0s and 1s), and that determines how much of its information should be passed on to the actual cell state 2. The concatenation of the weight matrices is the same thing as what he showed, it's just a little bit more efficient to store all the numbers in a single matrix. Conceptually its exactly the same though. 3. That's kind of how it is with all of deep learning lol
This is probably a not-so-great comment but something seems "wrong" with the gradient descent method (and the chain rule), if it generates a vanishing gradient problem (or exploding gradient problem). I mean, it isn't really "matching reality", because in reality, we can speak about a character on page 500 that was introduced on page 1. We are forced to apply this "band aid" of LSTM because the basic method of gradient descent is not "good enough" or in some way it is artificial. Does anyone agree with this? I am not sure what I mean by "matching reality".
At first I was inclined to click away from the video because of the unorthodox explanation of LSTM in "steps", which was different to what I had seen in other videos and blog posts which focus on the infamous LSTM diagram. However, I was struggling to fully grasp LSTMs so I decided to give the video a try. And it paid off! I can't believe LSTMs are that simple! This video is absolutely essential for understanding LSTMs at a fundamental level.
i have listened to a 2-hour lecture in my MSc data science, still don't know what is happening. Your video is explain it in a succinct way!!! Thank you!!!
the great thing about your videos is that I am always guaranteed to learn something and learn it with much better understanding.
Happy to hear that!
Your videos (specifically sampling and deep learning videos) helped me a lot during my master's. Thanks for all the videos!
Thanks for watching !
Best explanation of LSTMs on the internet
thanks!
I've been trying to understand LSTM through multiple blogs and videos but the thing that why it needs to be this complex , you specifically targeted that point of view to understand it , this is really one of the best videos , because you showed why there was a need for a LSTM and how could the gaps be filled , which is what made it very easy to understand . Could you please list the references as well for the video , so that if anyone has to go further deep into the concepts, it would be very helpful ! Thanks a lot for this video !
Plz continue the same good work by blending Mathematics with simple Real time example. Fantastic Explanation👍
Thank you so much for your videos! They are super informative and much more intuitive than the hundreds of slides I have from my master's class. Keep up the great work!
Extremely good and helpful! A great genuine desire to help learners by explaining difficult ideas in a most self effacing manner! Many thanks!
Thanks for the video!! Just what I needed for my ML midterm exam. Will be waiting for the Transformers topic that I believe build upon this concept.
Glad it was helpful!
This is an extremely good explanation. Thanks for all the effort and sharing!!
super helpful. I cant thank you enough for making this explanation
Amazing explanation!
great job on this topic explanation!
I love your videos, keep up the awesome work!!!
Thank you! Will do!
So happy you did this video!!! :D Thank you for all the great work!
You are so welcome!
Really easy to follow ! Thanks a lot
Glad it helped!
Great video as always!
The part that still perplexes me:
How does the LSTM "know" what is important (like dog) and when to actually use that to predict the next word?
It learns through lots of training, just like every other aspect of DL
Just amazing intuition! Thanks so much for the great content.
Glad you enjoyed it!
Thanks! Could you also explain GRU
A video about transformers and GANs in this style would be awesome as well.
Thanks for the suggestion!
Hi Ritvik,
It will be really great if you could create videos which explain maths behind ML models like SVM and PCA.I am also curious about ODE, PDE, real analysis, complex analysi and stochastic calculus. But the problem is that i want to explore topics which are relevant to financial engineering. So i could read all quant finance related textbooks. I am a professional and really dont have time to read all applied maths textbooks 😅.
he has covered the ML models you listed in depth. it's illegal to cover ODE and PDE though bc nobody likes them
Awesome, like always - good job ;-)
Thank you! Cheers!
Thank you very much! I have a few questions:
1. Could you please explain the reasoning behind using a candidate cell state and why the tanh activation function is necessary?
2. I have noticed that many implementations, papers, or blogs I have read use concatenation of h[t-1] and x[t] and a single learnable weight matrix W instead of U and V used in this video. Can you clarify why this is the case?
3. Despite the success of the model in predicting words, I remain somewhat skeptical about how it achieves such accuracy. :)
1. The candidate cell state is simply the same weighted combination of input and previous hidden state, just like all the other gates. However, the other gates pass through a sigmoid because we want to essentially binarize them. We want them to be ~0 or 1, which is what a sigmoid does. This way, these gates function as True/False activations, letting data through with a 1 and stopping data with a 0. With the candiate cell state, we don't want a binarized output. Rather, we want all the information from the input state and the hidden state. Applying tanh moves the bounds between -1 and 1, which allows it to retain more information (basically we don't want 0s)
To sumarize, we use the candidate cell state to obtain the actual information from the input and prior hidden state. This is why tanh is applied. The canditate cell state is then multiplied by the output gate (which IS 0s and 1s), and that determines how much of its information should be passed on to the actual cell state
2. The concatenation of the weight matrices is the same thing as what he showed, it's just a little bit more efficient to store all the numbers in a single matrix. Conceptually its exactly the same though.
3. That's kind of how it is with all of deep learning lol
Hi Ritvik. It would be amazing if you could better organize the playlists.(chronological and right videos in right playlists)
Noted!
Thank you
This is probably a not-so-great comment but something seems "wrong" with the gradient descent method (and the chain rule), if it generates a vanishing gradient problem (or exploding gradient problem). I mean, it isn't really "matching reality", because in reality, we can speak about a character on page 500 that was introduced on page 1. We are forced to apply this "band aid" of LSTM because the basic method of gradient descent is not "good enough" or in some way it is artificial. Does anyone agree with this? I am not sure what I mean by "matching reality".
Isn't h9 also affected by x9?
what da dogg doin😄😄
Bro literally explained what the dog doing
who cares? we have transformers
It’s part of the series building up to more complex things
lstms are still surprisingly powerful for a lot of applications.