Great work! I have personally been confused over how a 2d conv is applied in perspective of depth, the fact that the function is called Conv2D just added to it, though I understand the reasoning behind the naming. Seeing your animations have made my basics strong. Looking forward to future videos.
You may have learned or not that doing first a depth wise convolution followed by a pointwise convolution results in exactly the same computation with a factor of 10 less multiplication. This first does a convolution over each colour channel first, and then on the end results looks at each pixel.
Oh thank you so much! I waited with a lot of joy for a new episode! Thank you so much for the amazing work! (And this explaination what exactly what I was lacking from the previous episode and everything clicked in my head)
Nice, you did the video! Maybe you remember but I asked you this question before in one of your previous videos. The video turned out great! Thank you.
Hey, great content! Isn't each output channel actually the sum of the filtered input channels in neural nets? One filter should equal exactly one feature map. So your 1x1 convolution filter should actually be just a scalar in 2D-convolution (simple multiplication). I think you presented a 3D-convolution. Hope to hear your answer!
I think part of the misunderstanding here is the difference between applying a 'known' filter (gaussian) to an image. Which would be the equivalent of performing a forward pass on a convolutional layer. Versus learning different filters via backprop. Both the image processing convolution and the neural network convolution (forward pass) ARE exactly the same operation if you don't oversimplify the convolution in the image processing example. In the example at 3:06 you apply the same kernel to all 3 color channels. And you simplify it by showing only a single kernel. But if you truly wanted to match what a neural network does. You could have just shown 3 gaussian kernels (one for each channel). Applied the 3 separate (but identical) gaussian kernels to the 3 separate channels, then summed the result. This is exactly the same operation as a Conv2D layer would perform in a neural network for a forward pass. Its just that in the case of the image processing example, the filter you want to apply is known. So you can simplify and apply a single, known kernel to multiple channels.
I think this video also ties in with the misunderstanding of what a kernel is versus what a filter is in deep learning. If your input image is of size W x H x C where W, H, and C represent the width, height, and the size of channels. The dimension of the filter would be K x K x C where K denotes the length of the dimension of the kernel. The dimension of the kernel shown in parts of this video would be 3x3. And the dimension of the filter would be 3x3x3 which is 3 dimensional. With your example at 3:06 you inadvertently showed the application of a 'filter' in a round about way by only showing a single gaussian 'kernel' applied to 3 separate channels, one at a time. The terms kernel, feature, and filter get interchanged a lot which causes confusion. And I myself am guilty of this mix up a lot of the time.
You're having the same thoughts that I did about a year ago when I ranted about how all the convolution animations were wrong in this video: ruclips.net/video/w4kNHKcBGzA/видео.html. And specifically, I complained that they over-simplified neural network convolution. But viewers with a classical image processing background disagreed, and I wasn't sure why. So I took a closer look at classical image processing libraries and was surprised that their convolution really is this "simplified" version. A perfect example is OpenCV's filter2d() which takes a "single-channel" kernel and applies that same kernel to each channel of the input. It doesn't even have an option to provide one kernel per channel like you describe. The documentation explicitly says, "if you want to apply different kernels to different channels, split the image into separate color planes using split and process them individually." That realization motivated me to create this video to highlight the differences. So my example at 3:06 isn't meant to match what Neural Networks do; it's meant to match what OpenCV and other classical libraries do. The true neural network example starts around 7:52. Source: docs.opencv.org/3.4/d4/d86/group__imgproc__filter.html#ga27c049795ce870216ddfb366086b5a04
@@animatedai My comment was not really a critique of the video. I thought the video was great. I understand why you showed the examples that you did. My comment was more about my own realizations / takeaways based on the video. I view things more from the deep learning side rather than the image processing side. So my comment was more about adding an explanation using the specific deep learning terms. I just felt like I would comment my own understanding of the comparison in case it would help someone else draw the same conclusion. Even in the deep learning space alone there is a lot of inconsistent terminology. Across the major neural network libraries you will see various different parameters with different names that represent the exact same concept. Take Google Flax for example. For the Conv layer the first parameter is called "features". And the definition right under it says "number of convolution filters". Compare that to Tensorflow (also made by Google). They call the first parameter in their Conv2D layer as "filters". Both pieces of software written by the same company and representing the same exact concept, but called by a different name. Multiply that confusion by however many libraries and layers are out there and we are left with the cauldron of varying definitions / terminology we have today lol.
very nice video! it fully explain my what im confused about the CNN and Image Processing Convolution. But however as a newer on learning CNN my understanding is still a bit limited, such as what else channel we could have (except the R,G,B color) if the input is a image? And also I could understand how the im2col work in a 2D dimension, but once the input is up the 3D dimension does it work the same as 2D or in otherway?
an rgb image will only have three channels r, g, and b. he mentions more than 3 channels because a network can collect more features (like yellow, seen in the video) through training which is just added as an extra channel to the input for that layer before passing it on.
As someone who watched the previous video, you made alot of assumptions about the users knowledge, I have a little knowledge about CNN's and I was worse off than before I watched it. The difference between this and that in terms of the ease of understanding is actually staggering. In short, wonderful video and explanation. Thanks
Great videos!!! So helpful to visualize and understand. Do you think you can make a video about Graph CNN's? and how they are different from CNNs for images?
Wasn't aware that they're using LAB but LAB is superior to everything else when it comes to colour spaces: It covers everything very well and is the ideal reference to convert anything to anything. Luminosity having its own channel is probably also beneficial when it comes to helping AI making sense of stuff, the same might be true for the a/b channels being pairs of complimentary colours, in the sense that it already encodes a good chunk of human perception, the AI doesn't have to learn those aspects of colour theory.
@Amejonah @aoeuable I think I've only seen RGB as the input to image-based neural networks. My guess is that it doesn't matter much which format you use. Theoretically, the network is a universal function approximator. So if it was useful to have the data in a different format, the network could learn to transform from one format to the other. There might still be practical implications, but the only one that I've seen is normalizing the RGB values to a range of [-1, 1] (vs [0, 255]) or something close to that.
@@animatedai I saw it in Colorizing Nets. Reason probably being that RGB doesn't capture lightness and so, LAB is used to be able to detect color changes better. But I dunno if this is true or not. As you said, it probably doesn't affect the result at all (as in this paper: "ColorNet: Investigating the importance of color spaces for image classification")
I'm using Blender with a heavy reliance on its Geometry Nodes feature. I've got a behind-the-scenes video on Patreon, and I'm planning to upload more in the future.
the red green and blue channels are independent features thats. onother reason why we use 3d filter in contrast to the first example. in the case of depthwise cnn each layer of rgb. Has its own independent receptive field. thanks for the previous videos.
Maybe someone can help me understand this. If I have just one 3d volume. Would ever make sense to do a 3D convolution in say PyTorch? Because doing a 2D convolution will work all the slices right? So say that I have a volume that's 300,300,100. Should I just move the slice dimension to the channel dimension and apply a 2D convolution? What would a 3D convolution even do here?
The convolution is the same, you're just using different dimensions and then using the same dimension label for the operation. You're comparing apples to oranges.
Really great explanation and visualization, like all of your videos. I would just disagree with the conclusion that image processing and NN convolutions are fundamentally different. The only difference is the kernel size of 3x3x3 vs 3x3x1. The way you separate the RGB channels during the 2D convolutions just complicates the process. It would be clearer to keep the image as a single 3d array (width x height x channel) and sweep the kernels (both the 3x3x1 and the 3x3x3) simply *through every possible position*. Each possible placement results in one corresponding output value. Then it could be recognized that increasing the kernel size (from 3x3x1 to 3x3x3) just combinatorically reduces the number of possible positions at which it can be placed into the image array (and thereby the number of output values/size of output dimensions) and the conclusion would be that both kinds of convolutions are exactly the same. More abstractly any convolution can be though of as only working between two arrays/tensors/signals of the same dimensionality but (optionally) different sizes along each dimension. For example applying 10 kernels of size 5x5x3 at once onto an iamge of size10x10x3 would be the same as applying a single 5x5x3x10 kernel to a 10x10x3x1 image. As longe as the dimensions are ordered correctly to match up. The result would be an array of size 5x5x1x10. The output size can be determined along each dimension separately: (10-5+1)x(10-5+1)x(3-3+1)x(10-1+1) as explained in your other videos. The same works for higher dimensions. A kernel for video processing could be of size 10x10x3x20 to span across 10x10 pixels vertically/horizontally, across 3 colors channels and across 20 frames in time. The video might have a spatial resolution of 720x1280, consisting of 4 channels (rgba) and be 500 frames long. resulting in an output of size 711x1271x2x481 A linear colorspace conversion (weighted sum of channels) would be an example of a simple pointwise convolution in classical image processing.
I normally don't reply to comments that don't contain a direct question, but I think your generalization is especially interesting. I don't think we actually disagree on anything here. I agree that, theoretically, you could certainly frame the convolution operations in that kind of a structure. My video comes from a practical application perspective. And I think you'll agree that, in practice, both disciplines (neural networks and image processing) would refer to your generalization of sliding a 3D filter over a 3D input at every possible position as "3D" convolution, not 2D convolution. My video is just pointing out that in current state of the fields there's a fundamental conceptual difference, both in the research papers and code libraries, between the two disciplines when they refer to "2D convolution". In other words, the two disciplines currently have a different mental model that they apply when they say "2D convolution".
@@animatedai Thanks for your reply. I agree that the labeling as "2D" is a strong source for confusion. Their might be an analogy to column-, row-, and nullspace from linear algebra in the way that the dimensionality when calling a convolution being "2D" could mean either the dimensions of the kernel, or of the image, or of the difference between the size of the image and kernel. Anyway, I really love all of your videos. I used them as inspiration for some of my own visualizations (see my channel). Have you ever though about publishing your visualizations as interactive webgl widgets so that one can interactively configure the kernel size/count and manually moving the kernel around?
Yes, I have thought about making an interactive visualization tool. Right now, I render everything using Blender's raytracing (cycles render engine) which is much too slow for real-time rendering. So the output of the webapp wouldn't look as fancy as my videos but could potentially be more useful for learning/teaching. I added the webapp as an option in a poll on my Patreon for what I should do next. I think that hearing that I was your inspiration of some of your videos is the best compliment I've ever gotten! So thank you! @3blue1brown was my personal inspiration.
This man is not just a man. He is an angel sent personally by God in his chair of light. God spoke and he said: "Send forth a man. He will be the light in the darkness, the path for the lost. The sight for the blind" and so this channel was born
Great work! I have personally been confused over how a 2d conv is applied in perspective of depth, the fact that the function is called Conv2D just added to it, though I understand the reasoning behind the naming. Seeing your animations have made my basics strong. Looking forward to future videos.
that means for RGB image, the size for each kernel when applying Conv2D is actually expanded into 3 dimensions, right?
Amazing work.
Have always wondered why nobody is talking about the differences as I am from Signal Processing background.
You may have learned or not that doing first a depth wise convolution followed by a pointwise convolution results in exactly the same computation with a factor of 10 less multiplication. This first does a convolution over each colour channel first, and then on the end results looks at each pixel.
exactly, becuse convolution is an associative (end even commutative) operation
Yes, in fact I've got a video on depthwise-separable convolution for anyone interested: ruclips.net/video/vVaRhZXovbw/видео.htmlfeature=shared
Oh thank you so much!
I waited with a lot of joy for a new episode!
Thank you so much for the amazing work!
(And this explaination what exactly what I was lacking from the previous episode and everything clicked in my head)
Nice, you did the video! Maybe you remember but I asked you this question before in one of your previous videos. The video turned out great! Thank you.
Yes! I was actually planning to reply to your comment and give you a link to this video, but you already found it :)
Did you release already the Transformers video on Patreon?
This is such an amazing video. Thank you.
Hey, great content! Isn't each output channel actually the sum of the filtered input channels in neural nets? One filter should equal exactly one feature map. So your 1x1 convolution filter should actually be just a scalar in 2D-convolution (simple multiplication). I think you presented a 3D-convolution. Hope to hear your answer!
The reason the filter at 3:00 being 2D gets glossed over is because most image signal processing is taught in grayscale
I think part of the misunderstanding here is the difference between applying a 'known' filter (gaussian) to an image. Which would be the equivalent of performing a forward pass on a convolutional layer. Versus learning different filters via backprop. Both the image processing convolution and the neural network convolution (forward pass) ARE exactly the same operation if you don't oversimplify the convolution in the image processing example. In the example at 3:06 you apply the same kernel to all 3 color channels. And you simplify it by showing only a single kernel. But if you truly wanted to match what a neural network does. You could have just shown 3 gaussian kernels (one for each channel). Applied the 3 separate (but identical) gaussian kernels to the 3 separate channels, then summed the result. This is exactly the same operation as a Conv2D layer would perform in a neural network for a forward pass. Its just that in the case of the image processing example, the filter you want to apply is known. So you can simplify and apply a single, known kernel to multiple channels.
I think this video also ties in with the misunderstanding of what a kernel is versus what a filter is in deep learning. If your input image is of size W x H x C where W, H, and C represent the width, height, and the size of channels. The dimension of the filter would be K x K x C where K denotes the length of the dimension of the kernel. The dimension of the kernel shown in parts of this video would be 3x3. And the dimension of the filter would be 3x3x3 which is 3 dimensional. With your example at 3:06 you inadvertently showed the application of a 'filter' in a round about way by only showing a single gaussian 'kernel' applied to 3 separate channels, one at a time. The terms kernel, feature, and filter get interchanged a lot which causes confusion. And I myself am guilty of this mix up a lot of the time.
You're having the same thoughts that I did about a year ago when I ranted about how all the convolution animations were wrong in this video: ruclips.net/video/w4kNHKcBGzA/видео.html. And specifically, I complained that they over-simplified neural network convolution.
But viewers with a classical image processing background disagreed, and I wasn't sure why. So I took a closer look at classical image processing libraries and was surprised that their convolution really is this "simplified" version.
A perfect example is OpenCV's filter2d() which takes a "single-channel" kernel and applies that same kernel to each channel of the input. It doesn't even have an option to provide one kernel per channel like you describe. The documentation explicitly says, "if you want to apply different kernels to different channels, split the image into separate color planes using split and process them individually."
That realization motivated me to create this video to highlight the differences. So my example at 3:06 isn't meant to match what Neural Networks do; it's meant to match what OpenCV and other classical libraries do. The true neural network example starts around 7:52.
Source: docs.opencv.org/3.4/d4/d86/group__imgproc__filter.html#ga27c049795ce870216ddfb366086b5a04
@@animatedai My comment was not really a critique of the video. I thought the video was great. I understand why you showed the examples that you did. My comment was more about my own realizations / takeaways based on the video. I view things more from the deep learning side rather than the image processing side. So my comment was more about adding an explanation using the specific deep learning terms. I just felt like I would comment my own understanding of the comparison in case it would help someone else draw the same conclusion.
Even in the deep learning space alone there is a lot of inconsistent terminology. Across the major neural network libraries you will see various different parameters with different names that represent the exact same concept. Take Google Flax for example. For the Conv layer the first parameter is called "features". And the definition right under it says "number of convolution filters". Compare that to Tensorflow (also made by Google). They call the first parameter in their Conv2D layer as "filters". Both pieces of software written by the same company and representing the same exact concept, but called by a different name. Multiply that confusion by however many libraries and layers are out there and we are left with the cauldron of varying definitions / terminology we have today lol.
very nice video! it fully explain my what im confused about the CNN and Image Processing Convolution.
But however as a newer on learning CNN my understanding is still a bit limited, such as what else channel we could have (except the R,G,B color) if the input is a image?
And also I could understand how the im2col work in a 2D dimension, but once the input is up the 3D dimension does it work the same as 2D or in otherway?
an rgb image will only have three channels r, g, and b. he mentions more than 3 channels because a network can collect more features (like yellow, seen in the video) through training which is just added as an extra channel to the input for that layer before passing it on.
"actively confusing" this choice of words is so abstract to me
so so so good !
As someone who watched the previous video, you made alot of assumptions about the users knowledge, I have a little knowledge about CNN's and I was worse off than before I watched it.
The difference between this and that in terms of the ease of understanding is actually staggering.
In short, wonderful video and explanation. Thanks
Great content
Amazing content!!
Great work!
thanks thanks thanks this is the best
Are you still going to make videos on attention? Your style of teaching is great because it focuses on details
Great videos!!! So helpful to visualize and understand. Do you think you can make a video about Graph CNN's? and how they are different from CNNs for images?
Very clearly explained! Good Job!
amazing!
Thx very much. I couldn’t stand all those comments of wannabe guys under ur previous video
One thing I always wondered in image nets, why they use LAB. Why not use HSV/HSL or RBG?
Wasn't aware that they're using LAB but LAB is superior to everything else when it comes to colour spaces: It covers everything very well and is the ideal reference to convert anything to anything. Luminosity having its own channel is probably also beneficial when it comes to helping AI making sense of stuff, the same might be true for the a/b channels being pairs of complimentary colours, in the sense that it already encodes a good chunk of human perception, the AI doesn't have to learn those aspects of colour theory.
@Amejonah @aoeuable I think I've only seen RGB as the input to image-based neural networks. My guess is that it doesn't matter much which format you use. Theoretically, the network is a universal function approximator. So if it was useful to have the data in a different format, the network could learn to transform from one format to the other. There might still be practical implications, but the only one that I've seen is normalizing the RGB values to a range of [-1, 1] (vs [0, 255]) or something close to that.
@@animatedai I saw it in Colorizing Nets. Reason probably being that RGB doesn't capture lightness and so, LAB is used to be able to detect color changes better. But I dunno if this is true or not.
As you said, it probably doesn't affect the result at all (as in this paper: "ColorNet: Investigating the importance of color spaces for image classification")
what are the algorithm examples for image processing covolution?
Great video! Thanks!
I want to know what tools you made this amazing video ? I follow long long ago
I'm using Blender with a heavy reliance on its Geometry Nodes feature. I've got a behind-the-scenes video on Patreon, and I'm planning to upload more in the future.
please make video on batch norm and layer norm in cnn
the red green and blue channels are independent features thats. onother reason why we use 3d filter in contrast to the first example.
in the case of depthwise cnn each layer of rgb. Has its own independent receptive field.
thanks for the previous videos.
Maybe someone can help me understand this.
If I have just one 3d volume. Would ever make sense to do a 3D convolution in say PyTorch? Because doing a 2D convolution will work all the slices right? So say that I have a volume that's 300,300,100. Should I just move the slice dimension to the channel dimension and apply a 2D convolution? What would a 3D convolution even do here?
_"Let's not pretend that greyscale is a thing in 2023."_ Christopher Nolan would like a word with you.
The convolution is the same, you're just using different dimensions and then using the same dimension label for the operation. You're comparing apples to oranges.
Do you do your animations with blender?
Yes, I do! I'm uploading some behind-the-scenes content to my Patreon if you want to see more of my workflow.
I want to know @@animatedai
Really great explanation and visualization, like all of your videos. I would just disagree with the conclusion that image processing and NN convolutions are fundamentally different. The only difference is the kernel size of 3x3x3 vs 3x3x1. The way you separate the RGB channels during the 2D convolutions just complicates the process. It would be clearer to keep the image as a single 3d array (width x height x channel) and sweep the kernels (both the 3x3x1 and the 3x3x3) simply *through every possible position*. Each possible placement results in one corresponding output value. Then it could be recognized that increasing the kernel size (from 3x3x1 to 3x3x3) just combinatorically reduces the number of possible positions at which it can be placed into the image array (and thereby the number of output values/size of output dimensions) and the conclusion would be that both kinds of convolutions are exactly the same.
More abstractly any convolution can be though of as only working between two arrays/tensors/signals of the same dimensionality but (optionally) different sizes along each dimension. For example applying 10 kernels of size 5x5x3 at once onto an iamge of size10x10x3 would be the same as applying a single 5x5x3x10 kernel to a 10x10x3x1 image. As longe as the dimensions are ordered correctly to match up. The result would be an array of size 5x5x1x10. The output size can be determined along each dimension separately: (10-5+1)x(10-5+1)x(3-3+1)x(10-1+1) as explained in your other videos. The same works for higher dimensions. A kernel for video processing could be of size 10x10x3x20 to span across 10x10 pixels vertically/horizontally, across 3 colors channels and across 20 frames in time. The video might have a spatial resolution of 720x1280, consisting of 4 channels (rgba) and be 500 frames long. resulting in an output of size 711x1271x2x481
A linear colorspace conversion (weighted sum of channels) would be an example of a simple pointwise convolution in classical image processing.
I normally don't reply to comments that don't contain a direct question, but I think your generalization is especially interesting.
I don't think we actually disagree on anything here. I agree that, theoretically, you could certainly frame the convolution operations in that kind of a structure.
My video comes from a practical application perspective. And I think you'll agree that, in practice, both disciplines (neural networks and image processing) would refer to your generalization of sliding a 3D filter over a 3D input at every possible position as "3D" convolution, not 2D convolution. My video is just pointing out that in current state of the fields there's a fundamental conceptual difference, both in the research papers and code libraries, between the two disciplines when they refer to "2D convolution". In other words, the two disciplines currently have a different mental model that they apply when they say "2D convolution".
@@animatedai Thanks for your reply. I agree that the labeling as "2D" is a strong source for confusion. Their might be an analogy to column-, row-, and nullspace from linear algebra in the way that the dimensionality when calling a convolution being "2D" could mean either the dimensions of the kernel, or of the image, or of the difference between the size of the image and kernel.
Anyway, I really love all of your videos. I used them as inspiration for some of my own visualizations (see my channel). Have you ever though about publishing your visualizations as interactive webgl widgets so that one can interactively configure the kernel size/count and manually moving the kernel around?
Yes, I have thought about making an interactive visualization tool. Right now, I render everything using Blender's raytracing (cycles render engine) which is much too slow for real-time rendering. So the output of the webapp wouldn't look as fancy as my videos but could potentially be more useful for learning/teaching. I added the webapp as an option in a poll on my Patreon for what I should do next.
I think that hearing that I was your inspiration of some of your videos is the best compliment I've ever gotten! So thank you! @3blue1brown was my personal inspiration.
This man is not just a man. He is an angel sent personally by God in his chair of light. God spoke and he said: "Send forth a man. He will be the light in the darkness, the path for the lost. The sight for the blind" and so this channel was born
What I mean is thanks for the good content
Lol grayscale is a real thing still! Medical and microscopy imaging
hi