I had no idea about CNN at all, this was great and given me immense confidence in learning about CNN. Great video. scratch to end explained beautifully.
you made me realize there are indeed other RUclipsrs "don't really know much about" what they're saying (0:17). You explain the best way in youtube especially about the structure of the CNN
The way you explained made me feel like I didn't know so much about CNN. I wonder when did you read so many papers. Thanks for sharing your knowledge. Helps a lot.
Thank you so much! Everyone just explaining like : ""So this is convolution and that generates this numbers and this is our feature cubes and you apply pooling and get that... lets jump in to the python code i wrote in 5 weeks but imma explain in 15 seconds". You've explained all these concepts clearly and one by one. Can you make a video about training the CNN, it would be awesome.
5 weeks? Nah bro they're not as dumb as you are lol. But seriously code is a shit way of explaining something. You should check out lectures from universities though, this video was pretty shit too
mate I have been working as junior AI engineer for over a year now and I have successfully deployed custom built CNNs on nvidia hardware but I am still learning from your videos! Just discovered and watched them all back to back. Best videos I have found and I watch a hell of a lot of videos on this topic! I have also read some hardcore books on it. Your videos are par excellence please keep making them! Would love to see some practical examples, there are many tutorials on how things like segmentation and superpixels WORK but nobody wants to show us how to actually implement them into a custon network and display the results. ie. detect flame or smoke. When it comes to practical solutions nobody really goes beneath the provided API examples! Very frustrating.
thank you so much for an amazing video even after going through several videos I did not get the concept clear after this video all of my doubts are clear please make hands-on tutorials it's a humble request, hope to see you soon small correction @16:40 calculation of 12.5, not 13.5 == (26-2+1)/2 = 12.5
17:00 - From 13x13x32 to conv3x3,64. How the volume/deep of 32 is handle? I understand the result of 11x11x64(filters) but those 32 layers are summed/packed and send to conv3x3x64?
Maybe you could give that explanation over a cup of hot chocolate by the fire as we cuddle up, listening to the latest episode of the Lex Fridman Podcast together. We laugh as Lex goes off on some profound tangent about how the human mind is hard to understand. "That's not the only thing that's hard" I think to myself, as you spoon me ever so gently. It's a perfect night. Just you and me, by the fire, as the sky darkens outside the cabin windows. I know that you could never leave me wanting more... Sorry, I got a paper due in 9 days that I don't want to write.
Sorry, one moment is not clear. After first convolution (and maxpool) we end up with 13x13x32. When applied conv3x3,64. How did it work? We had 32 layers (feature maps). If we apply conv3x3,64 to each layer we would end up with 32x64 layers. But we end up with only 64 layers. thanks
When we have a 13×13×32 volume, and apply one filter of 5×5×32, then we get a 11×11 feature map. So if we apply 64 such filters to the 13×13×32 volume, we end up with 64 such 11×11 feature maps. In other words, an output of 11 × 11 × 64
Sorry, allow me rephrase the question. At 4:50 you apply the convo filter 3x3x1 to image 5x5x1. Basically just weighting and adding pixels that fit into 3x3 square. How would you apply 3x3x1 filter to image 5x5x2 (2 layers 5x5x1 ) ? Weighting and adding pixels from both layers.
Depth of the filter and the input should be the SAME. 3 x 3 x 1 filter convolves with a 5 x 5 x 1 image as they have the same depth (1). But in the case of 5 x 5 x 2, we NEED to apply a filter of shape 3 x 3 x 2. A 3 x 3 x 1 filter will only convolve with one of the 5 x 5 x 1 layers. We don't take the average of both layers as they represent different data. Hope that makes sense.
15:35. After first convolution and pooling we end up with 13x13x32. So how do we apply convolution 3x3x64 to it? We got 32 layers of 13x13 grid. So now we apply 3x3 convolution filter 64 times and end up with 64 layers. How do we do it since we have 32 layer in the source?
We don't apply convolution with a 3 x 3 x 64 filter. We apply convolution for 64 filters of shape 3 x 3 x 32, each with the input 13 x 13 x 32. The result of each convolution will be a 11 x 11 output. Since we have 64 such convolution operations, we end up with 11 x 11 x 64. Just note the OUTPUT depth is equal to the number of filters chosen for convolution. And the depth of filter is equal to the depth of INPUT.
I think the value of this video is not so much that you will be able to sit down and use CNN from the get-go. Rather, it demonstrates some of the key concepts quite well (convolving layers for example). Looking at the final example is helpful and should probably be viewed several times to get the full meaning. But in all, the video is - when used with other information sources - a good start to learning CNN.
I know how a filter in a Convolutional Neural Network "scans" the input image and multiplies the values of the kernel with the corresponding receptive field in the input image and adds it all up to get a new pixel in the output activation map. But Im unsure how the numbers in a filter is decided. Is the kernel a patch from the image that is chosen? Like a 5x5 patch of the image that the network must decide to be good to be used as a filter? Or are they random numbers that backpropagation will soon change to fit best with the data? And would these numbers in the filter be considered as the weights of the network? Thanks for any help.
The values in the kernel are randomly initialised and altered via backpropagation. If you know about simple densely connected networks, then you can consider a single weight in this type of network to be analogous to a 2D kernel that convoles a single channel in the input image. If you consider a 3-channel image as the input to a layer, and a single channel as the layer output, then the output (a 2D image) is taken by convolving each input channel with its own K*K kernel and summing (superimposing) the resulting 3 images. This is analagous to a simple densely connected network except each weight in the layer is a K*K kernel rather than a scalar. However it makes more sense to consider a K*K*3 kernel rather than summing 3 K*K kernels for the 3 input channels. If N is the number of input channels, M the number of output channels and K the width of a kernel, then you have K*K*N*M parameters for a single layer.
16:47 you explained the pooling width output and in the equation used 26-2+1/2 which will be 12.5 but you said it will be 13.5 ! and I don't know how you get to 13 ? can you please explain?
Thanks,good explanation @ filters. can you refer links :how filters/kernels prepared ?.For a object how many filters minimum required?, development and updation of filter upto latest yolo model
One doubt: In the last image shown will what will the width of each filter be in the second conv. layer? My understanding is that it will be 32 as the input width is 32 i.e. the filter of 3x3x32. Am I right or is there something wrong I have understood? Plz help.
Hello AJ, today I discovered your channel( subscribed long back but never explored this much) and guess what you provide much simple intuition of topics that’s hard to grasp within minutes. Can you do the same for some Machine learning part like ARIMA and other predictive models..!! Anyhow great content. Really appreciate your effort and knowledge.
Ive been playing around with time series models recently too. Not sure if there is enough drive for a video at this time. But will definitely keep this in mind
ANN takes 1D input and thus loses the spatial details of the image, but in cnn those are extracted and presented to ANN in a more meaningful and trainable manner
17:25 how come h(width) is 2 and after doing arithmetic Out(width) is 11.. and as per my observation while doing conv3x3, 64 kernal size (h (width)) should be 3 right?
When we have a 13×13×32 volume, and apply convolution with one filter of 3×3×32. This will give us an 11×11 feature map (as the stride is 1). Apply 64 such kernels, we get 64 such 11×11 feature maps i.e. a 11×11×64 volume.
Nice video, quick question though. How do you determine the weights in each filter? I would assume they are randomly assigned like the weights in a normal neural network on the first feed-forward pass. Follow up question: How would one then go about updating the weights in each filter? Thank you
Thank you very much! This is great video containing many helpful information. Really appreciate the time and effort you spent on making this video. Here is a question, when conv 3*3, 64 applied on 13*13Z*32 images, isn't the result 11*11* (64*32)? for each 32 layers, the filters that is 64 times were applied. One more thing, I believe 13-2+1 = 11 is not correct (should be 12) @17:29
The network tries to understand features of the input (image). The shallower layers extract high level features (edges, strokes, shadowing, texture, etc). The deeper we go, lower level features are extracted (could by anything. Most likely not human interpretable). Such lower level features are more complex. Hence we need more parameters to learn them. So the deeper we go, the more kernels we use.
@@CodeEmporium Follow up question, where can I get the parameters? What is the basis of these parameters? Are parameters and features the same? Just also wanna give appreciation and thanks to your videos and answer! The backstory of this questions is, me and my thesismates are creating a CNN model that revolves on genre classification with some enhancement of new techniques and methodologies. This video was actually our basis from learning how CNN works and it's specifics in terms of layers - from nothing to almost intuitively knowing the basics.
finally the video i wanted, how to convert the deep volume matrix into ANN input. I have one doubt, suppose we have an image of 28x28 pixel and the first cnn layer with 3 kernel, we will get 3 feature maps, now in the next layer if we have "64" kernels how many feature map do we get, is it 64 * 3 or is it just x no of feature maps. if it is only 64 no of maps then how do we convolve the 3 feature maps into 64 feature maps using only 64 kernels, should we sum the 64 * 3 maps we get into 64 maps??
Someone sending me conversation like AI Chatbot through all of actions in neural networks by inner voice using brain!!! Is it possible or not, if it is than how can I control this thing?? #Thanks in advanced.
poorly explained the layers. The same surface level explanation with no intuition behind it for the core concepts The easier concepts were explained well but that wasn't why people watch these vids
I had no idea about CNN at all, this was great and given me immense confidence in learning about CNN. Great video. scratch to end explained beautifully.
you made me realize there are indeed other RUclipsrs "don't really know much about" what they're saying (0:17). You explain the best way in youtube especially about the structure of the CNN
YOU HAVE MADE ME ACTALLY LIKE ML DL for the first time
Thanks a lot for not having a superficial touch of the topic. Keep it up!
The way you explained made me feel like I didn't know so much about CNN. I wonder when did you read so many papers. Thanks for sharing your knowledge. Helps a lot.
Thank you so much! Everyone just explaining like : ""So this is convolution and that generates this numbers and this is our feature cubes and you apply pooling and get that... lets jump in to the python code i wrote in 5 weeks but imma explain in 15 seconds". You've explained all these concepts clearly and one by one. Can you make a video about training the CNN, it would be awesome.
5 weeks? Nah bro they're not as dumb as you are lol. But seriously code is a shit way of explaining something. You should check out lectures from universities though, this video was pretty shit too
You explain better than well established organizations boi!! Keep it up.
mate I have been working as junior AI engineer for over a year now and I have successfully deployed custom built CNNs on nvidia hardware but I am still learning from your videos! Just discovered and watched them all back to back. Best videos I have found and I watch a hell of a lot of videos on this topic! I have also read some hardcore books on it. Your videos are par excellence please keep making them! Would love to see some practical examples, there are many tutorials on how things like segmentation and superpixels WORK but nobody wants to show us how to actually implement them into a custon network and display the results. ie. detect flame or smoke. When it comes to practical solutions nobody really goes beneath the provided API examples! Very frustrating.
thank you so much for an amazing video even after going through several videos I did not get the concept clear after this video all of my doubts are clear
please make hands-on tutorials it's a humble request, hope to see you soon
small correction @16:40 calculation of 12.5, not 13.5 == (26-2+1)/2 = 12.5
17:00 - From 13x13x32 to conv3x3,64. How the volume/deep of 32 is handle? I understand the result of 11x11x64(filters) but those 32 layers are summed/packed and send to conv3x3x64?
lmao I have the same question. pretty sure there are 64, 3*3*32 filters.
Such an amazing video man. The best educational I have watched in a while
very well explained! good job! thank you so much for putting the effort in this video!
Thanks so much!
Should've found this a month ago before i proceeded to try and learn this on the fly and just embarrassed myself in front of my department
This is a great video. I have one small doubt. @17:11 How do you apply 64 kernels on 32 response maps and get 64 response maps in the next layer?
remember the depth of each filter is 32. so actually, you apply 64 3*3*32 filters, which is why the output depth is 64.
Thank you for this question! Wondering the same thing!
Ah thank you, so each takes the 3x3 over all of the previous filters.
17:27 Out.width = 13 - 2 + 1 = 11. Something is wrong here, as 13-2+1 is 12.
Well done! Your voice and method left me wanting a more detailed explanation from you.
Maybe you could give that explanation over a cup of hot chocolate by the fire as we cuddle up, listening to the latest episode of the Lex Fridman Podcast together. We laugh as Lex goes off on some profound tangent about how the human mind is hard to understand. "That's not the only thing that's hard" I think to myself, as you spoon me ever so gently. It's a perfect night. Just you and me, by the fire, as the sky darkens outside the cabin windows. I know that you could never leave me wanting more...
Sorry, I got a paper due in 9 days that I don't want to write.
Still relevant today, thanks.
Sorry, one moment is not clear. After first convolution (and maxpool) we end up with 13x13x32. When applied conv3x3,64. How did it work? We had 32 layers (feature maps). If we apply conv3x3,64 to each layer we would end up with 32x64 layers. But we end up with only 64 layers. thanks
When we have a 13×13×32 volume, and apply one filter of 5×5×32, then we get a 11×11 feature map. So if we apply 64 such filters to the 13×13×32 volume, we end up with 64 such 11×11 feature maps. In other words, an output of 11 × 11 × 64
Sorry, allow me rephrase the question. At 4:50 you apply the convo filter 3x3x1 to image 5x5x1. Basically just weighting and adding pixels that fit into 3x3 square. How would you apply 3x3x1 filter to image 5x5x2 (2 layers 5x5x1 ) ? Weighting and adding pixels from both layers.
Depth of the filter and the input should be the SAME. 3 x 3 x 1 filter convolves with a 5 x 5 x 1 image as they have the same depth (1). But in the case of 5 x 5 x 2, we NEED to apply a filter of shape 3 x 3 x 2. A 3 x 3 x 1 filter will only convolve with one of the 5 x 5 x 1 layers. We don't take the average of both layers as they represent different data. Hope that makes sense.
15:35. After first convolution and pooling we end up with 13x13x32. So how do we apply convolution 3x3x64 to it? We got 32 layers of 13x13 grid. So now we apply 3x3 convolution filter 64 times and end up with 64 layers. How do we do it since we have 32 layer in the source?
We don't apply convolution with a 3 x 3 x 64 filter. We apply convolution for 64 filters of shape 3 x 3 x 32, each with the input 13 x 13 x 32. The result of each convolution will be a 11 x 11 output. Since we have 64 such convolution operations, we end up with 11 x 11 x 64. Just note the OUTPUT depth is equal to the number of filters chosen for convolution. And the depth of filter is equal to the depth of INPUT.
This is genuinely a brilliant explanation. Many thanks
Please explain again why we have 32 and 64 layers (feature maps)? from where these number, are they calculated or just pick numbers? thanks.
Sir. It depends how many feature vector do you need. These num are majorly used
I am also having same query, how to decide how many filters are required
Be careful !!! thank you, at 17:28 time of the clip there is a mistake in the equation (13-3+1=1 is true however you have typed 13-2+1=11
Thanks you very muchh.
Cleared lots of doubts.
I just noticed that we round up when pooling, we don't floor. cause (26 - 2 + 1)/2 is 12.5 not 13.5
7 months later but I noticed the same. Either that or by mistake calculated using the first output and took 28 instead of 26: (28-2+1)/2 = 13.5
greatest of all the other videos
Thanks for the compliments :)
Great videos. One small question at 5:07 how did you select the weights of the 3 by 3 filter
@17.25 , Output (width) = 13-3+1/1. So the result will be 11
You are right. Will like this so others can see it. Nice catch!
@@CodeEmporium You're welcome. You should do some tutorials on Kaggle Problem solving, it will be helpful.
this is just so good. thank you for this.
I think the value of this video is not so much that you will be able to sit down and use CNN from the get-go. Rather, it demonstrates some of the key concepts quite well (convolving layers for example). Looking at the final example is helpful and should probably be viewed several times to get the full meaning. But in all, the video is - when used with other information sources - a good start to learning CNN.
Bang on. Explained very good
Actually, CNNs were introduced bit earlier. I recall it was LeCun's 1989 paper.
man you are a genius.
you provide references, thank you very much. yours videos is great.
this video really help me alot
Great video, filled in a lot of gaps of understanding.
Location independence is an important feature
I have never understood cnn like I do after this video.
Awesome explanation. Loved it. Just a little correction , at 17:24 I think "hwidth" is 3 not 2 .
Thanks for the catch! Yeah there are definitely a few typos here that you and some others called out. (Also thanks for the compliments) :)
Awesome video! Keep it up!
hi can you tell me how to find confusion matrix for image retrival using CNN?
Great explanation. Thank you!
I know how a filter in a Convolutional Neural Network "scans" the input image and multiplies the values of the kernel with the corresponding receptive field in the input image and adds it all up to get a new pixel in the output activation map. But Im unsure how the numbers in a filter is decided.
Is the kernel a patch from the image that is chosen? Like a 5x5 patch of the image that the network must decide to be good to be used as a filter? Or are they random numbers that backpropagation will soon change to fit best with the data? And would these numbers in the filter be considered as the weights of the network?
Thanks for any help.
The values in the kernel are randomly initialised and altered via backpropagation. If you know about simple densely connected networks, then you can consider a single weight in this type of network to be analogous to a 2D kernel that convoles a single channel in the input image. If you consider a 3-channel image as the input to a layer, and a single channel as the layer output, then the output (a 2D image) is taken by convolving each input channel with its own K*K kernel and summing (superimposing) the resulting 3 images. This is analagous to a simple densely connected network except each weight in the layer is a K*K kernel rather than a scalar. However it makes more sense to consider a K*K*3 kernel rather than summing 3 K*K kernels for the 3 input channels. If N is the number of input channels, M the number of output channels and K the width of a kernel, then you have K*K*N*M parameters for a single layer.
16:47 you explained the pooling width output and in the equation used 26-2+1/2 which will be 12.5 but you said it will be 13.5 ! and I don't know how you get to 13 ? can you please explain?
{[Filter length - pooling value length]÷stride} +1 formula
Then {[26 - 2]÷2} + 1 =13
your videos are great please make a video on U-net plz
Thanks,good explanation @ filters. can you refer links :how filters/kernels prepared ?.For a object how many filters minimum required?, development and updation of filter upto latest yolo model
About CNNs url is broken ... Pls update the latest one
Yann Lecun is great
great video mate.
Excellent explanation
One doubt: In the last image shown will what will the width of each filter be in the second conv. layer? My understanding is that it will be 32 as the input width is 32 i.e. the filter of 3x3x32. Am I right or is there something wrong I have understood? Plz help.
i have the same question. have you figured it out?
Please make a video on visual question answering
hello dear, thank you for video i have question how to deal with pooling in one dimensional input case?
I think at 16:32 the +1 should be outside the fraction in the end again?
Hey can I get the whole content with diagram
How does back propagation work for Convolutional Neural Network?
Why is the Filter size 3x3 @8: 06? Can we take some different size for the Filter?
Yes, you can
Hello AJ, today I discovered your channel( subscribed long back but never explored this much) and guess what you provide much simple intuition of topics that’s hard to grasp within minutes. Can you do the same for some Machine learning part like ARIMA and other predictive models..!! Anyhow great content. Really appreciate your effort and knowledge.
Ive been playing around with time series models recently too. Not sure if there is enough drive for a video at this time. But will definitely keep this in mind
CodeEmporium That would be a great help. thanks for the reply AJ can’t thank enough for your efforts.
why do we use convolution ??? why not just simple ANN in case of image ?? main question is what is need of convolution in CNN?? please Answer....
ANN takes 1D input and thus loses the spatial details of the image, but in cnn those are extracted and presented to ANN in a more meaningful and trainable manner
Well done! Thanks buddy.
Great video, this is really helpful and detailed. Loved it!!!
17:25 how come h(width) is 2 and after doing arithmetic Out(width) is 11.. and as per my observation while doing conv3x3, 64 kernal size (h (width)) should be 3 right?
When we have a 13×13×32 volume, and apply convolution with one filter of 3×3×32. This will give us an 11×11 feature map (as the stride is 1). Apply 64 such kernels, we get 64 such 11×11 feature maps i.e. a 11×11×64 volume.
Mistake in the slide: should be 13 - 3 + 1 = 11
@@CodeEmporium where does this 3*3*32 filter come from? did I miss something or is something missing in the images shown?
Nice video, quick question though. How do you determine the weights in each filter? I would assume they are randomly assigned like the weights in a normal neural network on the first feed-forward pass.
Follow up question:
How would one then go about updating the weights in each filter?
Thank you
The 32 Filters that are demonstrated at 8:46, are those filters in the other layers behind the first the same or different?
how is h-height change from 3 to 2?
Thank you very much! This is great video containing many helpful information. Really appreciate the time and effort you spent on making this video. Here is a question, when conv 3*3, 64 applied on 13*13Z*32 images, isn't the result 11*11* (64*32)? for each 32 layers, the filters that is 64 times were applied. One more thing, I believe 13-2+1 = 11 is not correct (should be 12) @17:29
Yes! I thought the same... it is confusing enough as it is! :D ... maybe a mistake or something not mentioned about how the convolution works?
thank u, teacher
Question, why is there an increase of kernels for every convolution layer and where are those kernels coming from? What is the basis of those kernels?
The network tries to understand features of the input (image). The shallower layers extract high level features (edges, strokes, shadowing, texture, etc). The deeper we go, lower level features are extracted (could by anything. Most likely not human interpretable). Such lower level features are more complex. Hence we need more parameters to learn them. So the deeper we go, the more kernels we use.
@@CodeEmporium Follow up question, where can I get the parameters? What is the basis of these parameters? Are parameters and features the same?
Just also wanna give appreciation and thanks to your videos and answer! The backstory of this questions is, me and my thesismates are creating a CNN model that revolves on genre classification with some enhancement of new techniques and methodologies. This video was actually our basis from learning how CNN works and it's specifics in terms of layers - from nothing to almost intuitively knowing the basics.
gonna need to subscribe bc multiple videos about audio and cnns ! :) yes!
Thank you very much!
can i get the slides of this.
Whoaa!!!!
how to calculate 512 and 512 dense
good one
May I know how to calculate the input, output and learnable parameters in the following case?
Assumptions:
- Input size is (32, 32, 3)
- No padding for all convolutions
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Layer Type Kernel Stride Neurons/feature maps input size output size No. of parameters
-------------------------------------------------------------------------------------------------------------------------------------------------------------
1 Conv (3, 3) (1, 1) 16 (32, 32, 3)
2 Pool (2, 2) (2, 2) 16
3 Conv (5, 5) (1, 1) 32
4 Pool (2, 2) (2, 2) 32
5 Conv (3, 3) (1, 1) 64
6 Dense -- -- 128
7 Dense -- -- 2
--------------------------------------------------------------------------------------------------------------------------------------------------------------
thank you
Good job, thanks.
hey man,
is it somehow possible to ask you some questions in terms of my master thesis? ;)
finally the video i wanted, how to convert the deep volume matrix into ANN input. I have one doubt, suppose we have an image of 28x28 pixel and the first cnn layer with 3 kernel, we will get 3 feature maps, now in the next layer if we have "64" kernels how many feature map do we get, is it 64 * 3 or is it just x no of feature maps. if it is only 64 no of maps then how do we convolve the 3 feature maps into 64 feature maps using only 64 kernels, should we sum the 64 * 3 maps we get into 64 maps??
What is dense layer, why it is 512??
Thank u bro
How to set value in filter (kernel)? Is it set by randomized?
Initially, yes. They take on random values, which are later "learned".
how do they 'learned'? do you have this cnn code in Keras?
Thank you!
Hey Can you do intuitive explanation of CNN on text data
Sure. Maybe a future video.
ty
Thank you soo much ...you saved me alot of reading time....
Perfect! Glad it helped
Where does 32 come from?
merci
good
Someone sending me conversation like AI Chatbot through all of actions in neural networks by inner voice using brain!!! Is it possible or not, if it is than how can I control this thing??
#Thanks in advanced.
Thanks for this video! You are cool, keep going 🤗
Yay! Thanks! Imma keep it up ;)
@17:25 13-2+1=11 is not correct.
17:21 your filter in round 2 convolution is (3, 3). So it should be 13-3+1=11. Not 13-2+1, which is 12.
And my fake PhD supervisor don’t even know or understand a single thing about this!!!! Damn those quacks! My country sucks!
Which country?
can you please make a video on Keras - container
Dude study on your own lol
16:34 shouldn't that be 12.5, not 13.5? (26-2+1)/2 = 12.5
Yes
13-2+1 is not 11 its 12
memo 13:30
poorly explained the layers. The same surface level explanation with no intuition behind it for the core concepts
The easier concepts were explained well but that wasn't why people watch these vids
Poorly explained!! Anyway a good try