If I didn't already know what was going on, I'd be supremely confused by this explanation 3:31... The channel @AnimatedAI has a great explanation on 1x1 convolutions. The way I think about it is you've made a previous layer with a bunch of filters. So maybe you have one filter detecting vertical edges, another doing horizontal edges, another doing angled edges, another detecting red to black transitions, another for yellow colors, etc etc. Now you've got a stack of those images, and you want to go thru each pixel and combine the results of those filters with some weights. So if you want to get complete edge detections, you might add the horizontal edge channel + vertical edge channel + diagonal edge channel (and not include the yellow channel results or the red to black channel). That's what the 1x1 convolution is doing. Mixing the results of the various filters. Maybe you have a second 1x1 filter channel that is trying to isolate yellow objects next to red objects (like mustard bottles next to ketchup bottles, idk). Then the second filter channel might heavily weight the yellow channel pixels and the red-black channel pixels but ignore the other channels. You inherently need mixing like this if you want to eventually get to "detect a dog's face".
For the example at 6:12, with the 32 1x1 filters, even the channel number dropped down to 32, but each channel are almost identical or be different just by the value of the 1*1 filter, is this correct? What is a typical use case for this example?
When we multiply 1*1*32 filter with 6*6*32 then no. after multiplied we get for all the 32 channels, we have to take the sum and then apply the relu function to it. Is I am right??
I'm not quite sure what he means when he says the output is the # of filters. Doesn't the output of one of those 1 x 1 x 32 (in this case) filters just a single real number?
The output after applying filter = ( n - f + 1 ) x ( n - f + 1 ) x #of filters n = input dimension which is 6 x 6, so n = .6 f = filter dimension which is 1 x 1, so f = 1 # of filters = 32 so the final o/p after applying all filters will be: ( 6 - 1 + 1 ) x ( 6 -1 +1 ) x 32 = 6 x 6 x 32 The formula n - f + 1 works when stride = 1 , watch : ruclips.net/video/smHa2442Ah4/видео.html
Yes for each filter at each location is a single real number. Therefore for one 1x1 filter the output over all locations in the input image volume is an image with depth=1. Usually we have multiple filters and hence the output depth is equal to the number of filters.
( 1 x 1 x 32 ) is the filter volume and #n x ( 1 x 1 x 32) where as n = no of filter. The output of ( 1 x 1 x 32) gives a scalar value for each pixel in the 32 input channel and #n is the no of channel in the output filter.
This is old but I will answer for future viewers - In my understanding the # of filters IS NOT 32, the number of filters will be the number of times you applied different filters of 1X1X32, so if you did 1X1X32 X Z times you will get here 6X6XZ
Is this also the case for normal sized filters too? Filters aren't applied over 2D'ally, for each channel, but rather 3D'ally, over the entire channels?
As far as I understood, yeah. But I think sometimes (specially at the beginning of the network) the filter is shared over the three channels RGB. This is, instead of, for example, a 3x3x3 filter, you only have a 3x3x1 filter and the parameters are shared. However, this is a trick, and the filters are applied in 3D
I don't see the point in what he said about 1x1 convolution reducing filters? For any convolution filter 3x3, 5x5 or any size, the output channels are always determined by the number of filters not by filter size. So if you have 192 input channels, if you use 32 3x3 size filters, that will also reduce the channel dimension to 32 just like using 32 1x1 filters. So why decouple reducing height width and reducing channels? Filters of any size do both at the same time anyway.
You can use inception n/w which uses 1x 1 convolution for computing the embedding for the siamese n/w. Siamese which means same or similar is used in the final layer where constructive loss / triplet loss is used to optimise the loss function so that similar vectors tend to have less distance than dissimilar vectors with certain margins in it.
If I didn't already know what was going on, I'd be supremely confused by this explanation 3:31...
The channel @AnimatedAI has a great explanation on 1x1 convolutions.
The way I think about it is you've made a previous layer with a bunch of filters. So maybe you have one filter detecting vertical edges, another doing horizontal edges, another doing angled edges, another detecting red to black transitions, another for yellow colors, etc etc. Now you've got a stack of those images, and you want to go thru each pixel and combine the results of those filters with some weights. So if you want to get complete edge detections, you might add the horizontal edge channel + vertical edge channel + diagonal edge channel (and not include the yellow channel results or the red to black channel). That's what the 1x1 convolution is doing. Mixing the results of the various filters.
Maybe you have a second 1x1 filter channel that is trying to isolate yellow objects next to red objects (like mustard bottles next to ketchup bottles, idk). Then the second filter channel might heavily weight the yellow channel pixels and the red-black channel pixels but ignore the other channels.
You inherently need mixing like this if you want to eventually get to "detect a dog's face".
For the example at 6:12, with the 32 1x1 filters, even the channel number dropped down to 32, but each channel are almost identical or be different just by the value of the 1*1 filter, is this correct? What is a typical use case for this example?
isnt it just a regular filter with 1by1 dimension that is not used for edge detection but change filter dimension or add non linearity?
Is there information sharing happening across the channels in this case?
Curious if we use filters any some other dim but less channel, won't it reduce the resulting channel?
yes.
When we multiply 1*1*32 filter with 6*6*32 then no. after multiplied we get for all the 32 channels, we have to take the sum and then apply the relu function to it. Is I am right??
Yes
@@sammathew243 and relu will be a Number if greater than zero and 0 if no is less or equl to 0 right?
I'm not quite sure what he means when he says the output is the # of filters. Doesn't the output of one of those 1 x 1 x 32 (in this case) filters just a single real number?
may be its a channel(R,G,B) and number of filters are different.
The output after applying filter = ( n - f + 1 ) x ( n - f + 1 ) x #of filters
n = input dimension which is 6 x 6, so n = .6
f = filter dimension which is 1 x 1, so f = 1
# of filters = 32
so the final o/p after applying all filters will be:
( 6 - 1 + 1 ) x ( 6 -1 +1 ) x 32 = 6 x 6 x 32
The formula n - f + 1 works when stride = 1 , watch : ruclips.net/video/smHa2442Ah4/видео.html
Yes for each filter at each location is a single real number. Therefore for one 1x1 filter the output over all locations in the input image volume is an image with depth=1. Usually we have multiple filters and hence the output depth is equal to the number of filters.
( 1 x 1 x 32 ) is the filter volume and #n x ( 1 x 1 x 32) where as n = no of filter. The output of ( 1 x 1 x 32) gives a scalar value for each pixel in the 32 input channel and #n is the no of channel in the output filter.
This is old but I will answer for future viewers - In my understanding the # of filters IS NOT 32, the number of filters will be the number of times you applied different filters of 1X1X32, so if you did 1X1X32 X Z times you will get here 6X6XZ
Is this also the case for normal sized filters too? Filters aren't applied over 2D'ally, for each channel, but rather 3D'ally, over the entire channels?
As far as I understood, yeah. But I think sometimes (specially at the beginning of the network) the filter is shared over the three channels RGB. This is, instead of, for example, a 3x3x3 filter, you only have a 3x3x1 filter and the parameters are shared. However, this is a trick, and the filters are applied in 3D
Yes. Every filter in a cnn will have 3 dimensions (height, width, depth) with depth being equal to the depth of the input features maps.
I don't see the point in what he said about 1x1 convolution reducing filters? For any convolution filter 3x3, 5x5 or any size, the output channels are always determined by the number of filters not by filter size. So if you have 192 input channels, if you use 32 3x3 size filters, that will also reduce the channel dimension to 32 just like using 32 1x1 filters. So why decouple reducing height width and reducing channels? Filters of any size do both at the same time anyway.
Okay, One use case is explained in the next Inception motivation video
In what situations is it useful? Can you please provide some case study/example.
to reduce number of channels
1. GoogLeNet Inception network
2. ResNet when they get to more than 50 layers
Does the yellow block of size 1*1*32 have the same numbers over 32 voxels?
No. It can have 32 different weights.
Thanks
good explanation
Great 👍🏼
Can a siamese network be done upon 1x1 convolutions if we have precomputed 1-D features ?
You can use inception n/w which uses 1x 1 convolution for computing the embedding for the siamese n/w. Siamese which means same or similar is used in the final layer where constructive loss / triplet loss is used to optimise the loss function so that similar vectors tend to have less distance than dissimilar vectors with certain margins in it.
@@sandyz1000 when should one use it ? and when not?
thaanks (⌐■_■)
thsrink!
Scary movie