@@ziggycross I wanted some images too. I didn't understand fully what the output is going to be with pixelshuffle Edit: grammar will always be difficult for me
It's not working on actual pixels. The 'depth' or input to the shuffle is the feature maps generated from the low res image and it's at this last stage that the image is upsampled. This is in contrast to older methods that would upsample the image straight away and then try and process that into the super resolution output which was both less efficient and potentially introduced the artifacts mentioned in the video. For more information see the paper referenced in the video. "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network" by Shi et al.
I have to say, that's really awesome! Especially the hint that transposed convolution is just the gradient computation of convolution w.r.t. its inputs. I regularly contribute to the backends of Deep Learning Frameworks in the Julia Programming Language, and transposed convolution (or deconvolution, or some freaky way to say it: fractionally strided convolution) is really just a function call to the function calculating the adjoint (gradient) of a normal convolution (except output_padding, but this just affects the size calculation anyway).
Thanks. Was already using that for quite some time in my super resolution upscaler. Downside of the tensorflow implementation, as far as I know, you can only use squares, but it would make sense to also just do it in one dimension, or more in a rectangle. Some work to be done there ...
One of the best channels! I wish u‘d be covering more topics than only CNN, but guess can’t be a top pro in every topic. I def subbed and wished u‘d have way more videos already. But i can see that it takes alot of time and effort so i will wait. Thank u so much for this work ❤
Hi, isn't this virtually the same effect as a stride 2, 2x2 transpose convolution with the output channel just being 4 times smaller? Its a convolutional filter with some binary weights that causes each pixel channel to be mapped to some new channel. The aforementioned transpose convolution would be the same if you just had a linear layer before the pixel shuffle.
If you accept that transposed convolution (kernel size=3, stride=2) produces gridding artifacts in the output image then by definition, standard convolution (kernel size=3, stride=2) produces gridding artifacts in the input image gradient. The reason is that transposed convolution is implemented as a literal call to the gradient function of standard convolution in TensorFlow and PyTorch. I learned this at some point studying the papers and code of the StyleGAN saga. (nvlabs.github.io/stylegan2/versions.html) I wish I could narrow it down more for you, if you're trying to cite this. I have a feeling I learned it from reading their code or one of their references. You'll notice in all the versions of their code, they go out of their way to implement downsampling as a blur -> convolution rather than just a plain strided convolution. StyleGAN3 is all about aliasing.
Its probably because some pixels overlap the convolutional filter only once (the ones in the centers), some pixels overlap the convolutional filter 2 times (the ones on the sides but not the corners), and some pixels overlap the convolutional filter 4 times (the ones in the corners). I wonder if using ConvNext's 2x2 convolutional layers still results in this sort of gradient artifacts.
It's not working on actual pixels. The 'depth' or input to the shuffle is the feature maps generated from the low res image and it's at this last stage that the image is upsampled. This is in contrast to older methods that would upsample the image straight away and then try and process that into the super resolution output which was both less efficient and potentially introduced the artifacts mentioned in the video. For more information see the paper referenced in the video. "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network" by Shi et al.
would've loved an example image for the pixel suffle too there to really grasp what is happening
Was just about to leave a comment to say this! Was waiting for some example images, would be great to keep in mind for future videos!
@@ziggycross I wanted some images too. I didn't understand fully what the output is going to be with pixelshuffle
Edit: grammar will always be difficult for me
It's not working on actual pixels. The 'depth' or input to the shuffle is the feature maps generated from the low res image and it's at this last stage that the image is upsampled. This is in contrast to older methods that would upsample the image straight away and then try and process that into the super resolution output which was both less efficient and potentially introduced the artifacts mentioned in the video. For more information see the paper referenced in the video. "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network" by Shi et al.
@@djmips - now I understand. thanks
I have to say, that's really awesome! Especially the hint that transposed convolution is just the gradient computation of convolution w.r.t. its inputs. I regularly contribute to the backends of Deep Learning Frameworks in the Julia Programming Language, and transposed convolution (or deconvolution, or some freaky way to say it: fractionally strided convolution) is really just a function call to the function calculating the adjoint (gradient) of a normal convolution (except output_padding, but this just affects the size calculation anyway).
Thanks. Was already using that for quite some time in my super resolution upscaler. Downside of the tensorflow implementation, as far as I know, you can only use squares, but it would make sense to also just do it in one dimension, or more in a rectangle. Some work to be done there ...
Beautiful work as always
this made it make a ton of sense. but one problem is pixel shuffle does not get rid of the artifacts. it introduces its own artifacts
One of the best channels! I wish u‘d be covering more topics than only CNN, but guess can’t be a top pro in every topic. I def subbed and wished u‘d have way more videos already. But i can see that it takes alot of time and effort so i will wait. Thank u so much for this work ❤
Great content. Thank you!
👍 you make awesome illustrations.. ❤ Can you explain Transformers encoding and inference? ❓
That would be a big hit also. 👏
this video should have way more likes...
This is really cool! 😄 Thanks for the information.
Loved the animation thank you!!
Great series! Keep it up :)
thanks for your effort
Would be nice to have a video about TensorTrain technique
Hi, isn't this virtually the same effect as a stride 2, 2x2 transpose convolution with the output channel just being 4 times smaller? Its a convolutional filter with some binary weights that causes each pixel channel to be mapped to some new channel. The aforementioned transpose convolution would be the same if you just had a linear layer before the pixel shuffle.
Do you have a paper or resource about the artifacts in the gradient when using strided 3x3 convs?
If you accept that transposed convolution (kernel size=3, stride=2) produces gridding artifacts in the output image then by definition, standard convolution (kernel size=3, stride=2) produces gridding artifacts in the input image gradient. The reason is that transposed convolution is implemented as a literal call to the gradient function of standard convolution in TensorFlow and PyTorch.
I learned this at some point studying the papers and code of the StyleGAN saga. (nvlabs.github.io/stylegan2/versions.html) I wish I could narrow it down more for you, if you're trying to cite this. I have a feeling I learned it from reading their code or one of their references. You'll notice in all the versions of their code, they go out of their way to implement downsampling as a blur -> convolution rather than just a plain strided convolution. StyleGAN3 is all about aliasing.
Its probably because some pixels overlap the convolutional filter only once (the ones in the centers), some pixels overlap the convolutional filter 2 times (the ones on the sides but not the corners), and some pixels overlap the convolutional filter 4 times (the ones in the corners). I wonder if using ConvNext's 2x2 convolutional layers still results in this sort of gradient artifacts.
Can you explain how will you pixel_unshuffle, if resolution is 4000x3000 (WxH) and downscale_factor is 16?
theres zero explanation about how this would work with real images
It's not working on actual pixels. The 'depth' or input to the shuffle is the feature maps generated from the low res image and it's at this last stage that the image is upsampled. This is in contrast to older methods that would upsample the image straight away and then try and process that into the super resolution output which was both less efficient and potentially introduced the artifacts mentioned in the video. For more information see the paper referenced in the video. "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network" by Shi et al.
But why is it necessary to do pixel shuffle? Why can't we just output a rH x rW x 3 matric directly?
有点想到了亚像素插值
hi
jif
yes
super cool. waiting for Transformers and BN,LN