I've been studying neural network for the last couple of months and haven't come across any resource that explains it with this perfection. You have made it so easy with the visualization. I'd really appreciate more videos on topics like RNN, how to set number of layers, filters etc (hyperparameters).
It'd be really interesting to take a network trained to detect random objects as seen by a camera, then give it the live feed from a camera and watch the activation of each neuron in realtime as the object moves about in the camera's view, or rotates around the object, etc. I guess the earlier layers would change a lot, while the deeper layers (which have a better idea of what constitutes an object) would change less.
Projects like this have been done, but only in the sense that they usually just output the most probable class(es) because that's usually the only real way to deal with the amount of information. For modern networks you should be able to visualize activations of a single layer in real-time, but the number of pixels you'd need for a given layer can range from thousands to millions. So doable, but probably not easy to visually parse just by looking at it.
Consider looking into visualization approaches (like saliency/heat maps/deconvolutional neural network) and approaches that focus on maximal activation (like Google's DeepDream)
In the last video I asked how the images were in these various convolutions. I knew that they wouldn't be nothing like the input image, but I was very curious to see the process anyway. And now you make a video answering exactly what I wanted! Thank you so much! :)
there's a computerphile video on that too somewhere, again a google project. you get shown two words, and the computer knows one of them and not the other, so when you type the two words in the computer learns what a word is. that's for transcribing libraries and things... i cant remember which computerphile video it was though.
Now I see it, too... For some reason, YT gave me the video at a lower resolution (not watching it in full screen mode, I hadn't noticed), and I was thinking "I don't understand all the people complaining about the video being "wobbly", the video looks fine to me"... Then I saw I wasn't watching it at 50 fps, so I changed the quality. I, too, find it a bit weird. I guess stabilization doesn't quite kick in as hard when there are more frames to interpolate between?
13:16 what he wants to say is that if the images are segmented then its much easier. the segmentation problem is hard. That's why google captchas are all mushed up on each other. Google apparently fixed the segmentation problem by just training it to recognize multiple pairs of letters
Seeing that a lot of people are confused by this video being 50fps, I'd want to clear that up. 50fps is a standard frame rate for television and video in general. 60fps is a standard for animated and generated images, like animations, or games. Sure, you can do either with both, but it's generally so that high-frame-rate TV broadcast are always 50, not 60 fps. The scale for TV: 25, 50, 100, 200 Hz The scale for Computers: 30, 60, 120 Hz (Hz = fps)
Very cool! @5:24 Grayscale is quite a few bits deep, 1-bit depth would be Black & White ( which is not the case in your images, looks like you have at least 16-bit images - if not 256-bit standard grayscale - )
14:36 google captcha api morphs the filter the more samples you get until the training data is useless, also those image based captchas are also being broken after all the success of imagenet
Thank you Mike, and thank you Shaun, this video is really helping me in my quest! I'm making a small game in which I'm trying to make an AI using the tensorflow library.
How are the outputs of the multiple kernels at each layer managed? Are they somehow merged so that the kernels of the next layer all process the same input? Or do the 20 kernels of layer 2 operate on the 20 outputs of the layer 1 kernels respectively? And if the latter, then what happens when moving from a 20 kernel layer to a 50 kernel layer? Would some of the 20 kernels of the previous layer be duplicated twice, and others duplicated three times to make up the inputs to the 50 kernels in the new layer?
After watching this, the one thing I don't feel is completely explained is where the convolution kernel values come from. At first he says they are things "like Sobel edge detectors", but later says they are not manually entered, but rather learned values. That leaves the obvious question of how are they initialized? Do they start as just matrices with random entries? During the training, how are they adjusted? Is the "training" some kind of iterative search for kernel values that give the strongest response (e.g. the values that most consistently uniquely identify the one digit being learned and most strongly reject the other 8 digits?) I could use a bit more explanation on what the training process looks like and how it adjusts all the kernels.
wrote python convolution algos on bitmaps around this time just to self learn python, filters and convolutions are amazing to see in action . Its a little scarry to see how far we are now in 2021 . Covid hasnt stopped SW engineers .
i don't think so, that would be like asking "what 5 numbers did i multiply to get 3600?" there is only on possibility if you do the multiplication, but many possibilities when you try to guess backwards, and with those convolutions it's the same thing just exponentially worse. basically you drop a lot of information
Well, I'd say yes, as if you have all the information every layer puts out, you just have to reverse the process the first layer did on the data it gave. Since you have many processes on the one image, there should be much redundancy and therefore a high certainty. If you however only have the output of the sixth convolution layer, I highly doubt that you could get much out of it.
Partially. The problem is that (in general) a convolution is not a reversible operation. However, you can apply something that is known as a "matched filter" which is basically a convolution with the transposed filter kernel. If you go backwards through the network you can (to some degree) reconstruct the input signal. If you look at this paper you can see how the reconstructions look like: arxiv.org/pdf/1311.2901v3.pdf And just to prevent confusion: The author calls it "Deconvolution". But he isn't doing a "deconvolution" as he describes in his paper. He is applying a "matched filter".
I realize this isn't likely to get a reply this late, but I'm trying to replicate the configuration of this network. What activation function are you using for the first fully connected layer? Is it dotplus with a renormalization? I'm assuming FC2 is a softmax layer, so maybe they are both softmax.
enjoying the neural net videos. looks like ANNs are coming back in after not really seeing much of it since the 90s. i remember my first exposure to the math ans theory behind this was an assembly program on my 8bit C64 back in the late 80s creating a 3 layer Back Propegation network
Why are there 4x4x50 neurons after the last conv-layer? I get 4x4x(20^2)x(50^4) neurons, if every 5x5 kernel runs over every image from the previous layer. I'm confused.. maybe the kernels in the following layers are 3-dimensional? Like 20 5x5x20 kernels in the second layer?
In short: "Each kernel is 5x5xD, where D is the number of features in the previous layer" I dont know why their answers are not showing up on youtube. Maybe a google+ thing.
As always, a splendid video! However, every single clip taken from the angle where the pictures of the convolutions are visible, are out of focus. Pitty
It would be nice, if you talked a bit about how much data is needed for a CNN to be any kind of useful. The datasets in this video seem extremely big. Specifically it would be nice to have an idea on how well it works on many "categories" with a low amount of data.
If the first convolution layer has 20 filter and the second one has 20, does thing mean that each C2 filter processes all 20 images from C1? That would make 400 images for C2 output
would i be wrong in thinking that if you gave a convolutional neural network the ability to control where to click and what to type and gave it enough convolutions and kernals (perhaps beyond what current computers can handle) and trained it enough then it would be able to solve any captcha, even a new one with different interface that still used the same basic principles?
really interesting! I would be interested tot see if it is possible to start from the final convolution and see which image fits it the best, as in 'what looks the most like a 2'.
I saw somewhere a neural network that was trained to fool convolutional neural networks, sometimes it produced normal images (in this case it would have produced a 2) other times it produced something that looked almost like pure noise but it was still able to fool the networks
How to you replicate the learned connections to other systems? How is the "knowledge" abstracted for transport, backup, and further improvements? With discrete programming, the instructions are compact and finite and are easily copied.
Let's see if someone can help me out here. The first layer here outputs 20 24×24 images (or a 20 channel image) after performing all the convolutions. The second layer will output 20 20×20 images. But how are they constructed? How do they combine the 20 channels from the previous layer? I mean, they are not applying all 20 filters to each of the 20 channels, that'd be a 400 channel output. Do they simple add the convolutions for each channel up? So channel 1 of layer 2 is the sum of the convolution between kernel 1 of layer 2 with each of the 20 channels of layer 1?
Shouldn't one be able to generate characters(letters, whatever) by going the other way around? I'm thinking what if you tell it to generate a picture from a fully connected layer?
I am one of those strange people who draws a horizontal bar through the number 7. How would you deal with that? Would you need a separate set of 7+bar training digits (in effect an 11th character) and then map both 7 and 7+bar back to 7?
Excuse me if I have missed something obvious, but I'm not sure I understand what the input of, say, C2 is. Is it a sort of average of all of the images produced by C1?
With all the edge detection going on, would it be harder to recognize a 4 if some versions had the top parts join at an angle, like the 4 in this font, versus the open version as in the video? Likewise a 7 with or without the strike through it? I mean, does it remember some kind of average of all the objects in a class or all of them / all of the sufficiently different ones (which might be hard for a large database)?
I imagined that each layer uses all its kernels on all the images of the previous layer. But that can't be right, hearing that the last convolutional layer here only outputs in a size of 50*4*4. Does that mean that there essentially are "kernel pipelines"? So kernel0 of layer1 will only be fed with the output of kernel0 of layer0?
Can you look at your last but one fully connected layer and calculate the typical "distance" between different digits? E.g. just euclidean distance on the normalized terms in FC1. Would those distances depend on your neural network you're using or would they be similar across all successful neural networks? That is could you say something like a 1 and a 7 are typically closer than a 0 and a 4.
That is an excellent question! Unfortunately it requires at least a moderate amount of knowledge in the subject matter to answer, so I doubt you'll be getting a satisfactory response from this resource any time soon.
Can someone explain how the final convolutional layer is 4x4x50? My understanding based on the previous Neural Network video is that the first convolution will produce an output of 24x24x20, but then wouldn't the next convolution, which has 20 kernels, produce 20 images of the first image layer of the 20 produced from the first convolution, and then another 20 on the second image layer of the 20 produced from the first convolution, such that at the end of the second layer you'd have a 20x20x400 output, and so forth until at the end you'd have 4x4x(some large number) not 4x4x50?
You decide the depth on each layer. So the first layer will have 20 different 4x4x1 kernels but the second layer will have 20 different 4x4x20. Then after that he uses 50 kernels of 4x4x20 and then 50 kernels of 4x4x50 until the last layer before the fully connected network
A thought that I've gotten when thinking about this and the previous episode, would it be possible to "reverse" the order of the convolutional neural network, getting a sort of idealized result, probably not extremely useful in most cases, but likely somewhat usable for seeing what extra data can be used to train it for more accurate results or perhaps some sort of data generation. Doing the same for a standard neural network would not result in any useable data I know, but it seems like it might be possible with the convolutional one.
This is single digit recognition, multi digit/character recognition is a whole 'nother can of worms. When there is no activation's on the output layer you know it is not a digit.
+Francois Molinier That's not how it works. the grayscale pixels represent the probability of it being a given number. if u input NaN u will probably see a few bright grays or a couple of almost whites.
So would it be possible to use convolutional neural networks for something general like arbitrary image matching or are they limited to narrowly trained applications like the one here?
Why do you need a GitHub page? He just literally explained the full architecture of his built CNN (Convolutional Neural Network). Now, if you want to test this for yourself, you can easily implement all he said. Only find the right programming language which is supported by the libraries to complete the task. He even mentioned what he used, but you could also look at Lua with Torch for example. All the libraries that he mentioned all already there, so you won't need to code in any of the layers, just implement them.
Its a really good lecture to understand what is going on inside NN. I am using NN for target classification in thermal images. Is NN is a good approach to do that ? Or I should go for any other option.
Is there any chance you could upload a copy of the source code for the CNN some where? (or even pseudo code) I am sure many people would greatly appreciate it :D
I am curious, would it be possible to run this sort of neural network in reverse in order to produce the sort of "Deep Dream" images that you can see on the Internet? For instance, instead of asking the network 'what digit dose this image resemble?', ask 'what dose a 2 look like?'
As i understand it, "convolution" in this context just means, that you apply some sort of filter (or function) to your data and use their results to do another set of filters on it. Filter can mean anything here. For this task he used some filter that he ran on the image, from the looks a sobel, which highlights edges that have a certain angle. After doing this a few times in a row with different filters you get those 4x4 images, that are brighter the more edges on specific angles were in the picture. Guess im trying to tell you that convolution is not the filter, but rather the method of generating feature specific data from your original data.
Convolution is a very specific operation between two things : an image and a kernel. By choosing different kernels, the result will vary : you may start detecting horizontal edges, thin diagonal lines, dark spots...
A convolution is a function that takes a grid of values (an image) and returns another image. It works by replacing each pixel in the image with a function of the pixels around it, so a "blur" kernel will replace each pixel with the average value of of a grid of pixels centred around it, but other options exist, there are convolutions that do many different things, like making diagonal edges sharper, or removing small spots and such. Each convolution produces another image, meaning you can easily chain them together, much like you can blur an image in photoshop (which uses a convolution) and then call edge detection (also a convolution) on the result. This channel has a video on convolutions that explains it better than I can.
TheHDreality I already watched that episode but they only talked about a blur and an edge detector they didn't speak about how it can be modified with parameters at all. But I think I understand know that it just gives every pixel in the grid a weight and comes up with a total which it devides by a number
Florian H. As a math student, I know of function convolution, which is a binary operator very similar to what is shown here. But you're saying there exists more general convolution, usable on databases. This seems interesting. Could you please give me a link to a reliable source?
It’s like a David Lynch movie to me: I almost think I understood it and then everything just becomes a convoluted mess and I feel dumber than before...
Based on PAL (25/50/100HZ), whereas NA is based on NTSC (30/60/120HZ). There is no techincal reason for it anymore, but it was originally because it was clocked off of the AC power grid which ran at 60hz in NA and 50hz in Europe.
Hey I have a question. After the first conv layer, we are left with 20 images of 24*24 pixels. Do these 20 images transform into one 24*24 sized image, to be given as an input to the next conv. layer?
No, after the first conv layer you have like a volume (24*24*20), and it is the input to the next conv layer of size ( 5*5*20), so if you apply this kernel to that input volume you'll get one image of size 20*20, and because you have 50 filters of (5*5*20) so your output will be 20*20*50
May be I didn't get the idea, but why there is 10, and not 11 classes for numbers? Because if I will give an image of "A" to this network, it will probably say to me "this is 1" or "this is 4" instead of giving the negative answer like "its neither of 10 numbers".
Дмитрий Сулин Because what you are expecting is that one, and only one, of the ten output nodes would have a very high confidence, say 0.9 or 0.95 of 1.0. The other nodes would be near 0. If the input image didn’t match any number 0 to 9, then all the nodes would output a low or very low confidence value.
Since the end layers are fully connected, it will have to choose randomly from any of the 10 output classes. So it will just output any number wrongly, and it won't have any output like "No Number" or something like that. Even though you could extend the network in smart ways, like adding another class. And the eleventh class will then in turn have the meaning of 'NaN' (Not a Number), but you will have to label additional samples with "NaN' input for training of course.
Dr Pound is the best lecturer here. Very clear, intelligently funny, interesting topics.
Would deserve his own channel
The pictures he printed of the layers helped me grasp the concept so much better than other videos, so thank you
Me too
Computerphile, you single handedly helped me regain my interest with computer science.
Thank you very much for all your videos (:
This is the best explaination of what is going on inside a neural net! Now I can imagine it more clearly
Thanks alot!
I've been studying neural network for the last couple of months and haven't come across any resource that explains it with this perfection. You have made it so easy with the visualization.
I'd really appreciate more videos on topics like RNN, how to set number of layers, filters etc (hyperparameters).
So useful. As a CS student, this was more helpful than a ton of other DLNN stuff I've seen online. Thank you!
Loving these videos with Dr. Pound, keep it up!
It'd be really interesting to take a network trained to detect random objects as seen by a camera, then give it the live feed from a camera and watch the activation of each neuron in realtime as the object moves about in the camera's view, or rotates around the object, etc. I guess the earlier layers would change a lot, while the deeper layers (which have a better idea of what constitutes an object) would change less.
Projects like this have been done, but only in the sense that they usually just output the most probable class(es) because that's usually the only real way to deal with the amount of information. For modern networks you should be able to visualize activations of a single layer in real-time, but the number of pixels you'd need for a given layer can range from thousands to millions. So doable, but probably not easy to visually parse just by looking at it.
Consider looking into visualization approaches (like saliency/heat maps/deconvolutional neural network) and approaches that focus on maximal activation (like Google's DeepDream)
SomethingUnreal Id imagine if it was programmed properly and trained long enough, it may look similar to an fMRI.
Massively interesting and well presented, even for my aging neural network!
In the last video I asked how the images were in these various convolutions. I knew that they wouldn't be nothing like the input image, but I was very curious to see the process anyway.
And now you make a video answering exactly what I wanted! Thank you so much! :)
what a fantastic explanation, I loved the digits convolution representation
hope to see more videos about this!
(RNNs)
This guy is my second favourite on computerphile. Lovin these demos
Would love to have someone like him as my professor in my life!
Oh wow, this video made me understand neurological networks in an insanely deep way. Thank you!
Mike and Rob, the stars of computerphile.
Great content and nice puns.
Keep it up guys
Doesn't google use those captchas as a crowd sourced labeling technique for their own deep learning stuff?
there's a computerphile video on that too somewhere, again a google project. you get shown two words, and the computer knows one of them and not the other, so when you type the two words in the computer learns what a word is. that's for transcribing libraries and things... i cant remember which computerphile video it was though.
They do
I wonder if there’s a website that someone can go to to do the image things to help train the deep learning systems?
Now I see it, too... For some reason, YT gave me the video at a lower resolution (not watching it in full screen mode, I hadn't noticed), and I was thinking "I don't understand all the people complaining about the video being "wobbly", the video looks fine to me"... Then I saw I wasn't watching it at 50 fps, so I changed the quality.
I, too, find it a bit weird. I guess stabilization doesn't quite kick in as hard when there are more frames to interpolate between?
Excellent video! Visual seeing the neurons light up blew me away... It was like looking at an artificial, scaled down brain being imaged...
wow....
I didn't expect to understand any of that, but it was all explained perfectly. It made sense. Awesome video
13:16 what he wants to say is that if the images are segmented then its much easier. the segmentation problem is hard. That's why google captchas are all mushed up on each other. Google apparently fixed the segmentation problem by just training it to recognize multiple pairs of letters
The best tutorial ever! Cheers, Mike!
Seeing that a lot of people are confused by this video being 50fps, I'd want to clear that up. 50fps is a standard frame rate for television and video in general. 60fps is a standard for animated and generated images, like animations, or games. Sure, you can do either with both, but it's generally so that high-frame-rate TV broadcast are always 50, not 60 fps.
The scale for TV: 25, 50, 100, 200 Hz
The scale for Computers: 30, 60, 120 Hz
(Hz = fps)
who needs dual monitors when you have dual PC! Great video btw
love this series about machine learning
Very cool!
@5:24 Grayscale is quite a few bits deep, 1-bit depth would be Black & White ( which is not the case in your images, looks like you have at least 16-bit images - if not 256-bit standard grayscale - )
14:36 google captcha api morphs the filter the more samples you get until the training data is useless, also those image based captchas are also being broken after all the success of imagenet
Please do a video on the maths of forward and back propagation and how they are implemented
Thank you Mike, and thank you Shaun, this video is really helping me in my quest! I'm making a small game in which I'm trying to make an AI using the tensorflow library.
How are the outputs of the multiple kernels at each layer managed? Are they somehow merged so that the kernels of the next layer all process the same input? Or do the 20 kernels of layer 2 operate on the 20 outputs of the layer 1 kernels respectively? And if the latter, then what happens when moving from a 20 kernel layer to a 50 kernel layer? Would some of the 20 kernels of the previous layer be duplicated twice, and others duplicated three times to make up the inputs to the 50 kernels in the new layer?
After watching this, the one thing I don't feel is completely explained is where the convolution kernel values come from. At first he says they are things "like Sobel edge detectors", but later says they are not manually entered, but rather learned values. That leaves the obvious question of how are they initialized? Do they start as just matrices with random entries? During the training, how are they adjusted? Is the "training" some kind of iterative search for kernel values that give the strongest response (e.g. the values that most consistently uniquely identify the one digit being learned and most strongly reject the other 8 digits?) I could use a bit more explanation on what the training process looks like and how it adjusts all the kernels.
Could you maybe link to the actual code? Would be interesting to look at the implementation
I love this man.
Dr Mike Pound, please make a tutorial series on q learning! In depth
wrote python convolution algos on bitmaps around this time just to self learn python, filters and convolutions are amazing to see in action . Its a little scarry to see how far we are now in 2021 . Covid hasnt stopped SW engineers .
Fantastic video! Interesting to see "inside the mind" of a neural network
How do you even visualize the output of the NN?
Crazy, this perspective is so insightful.
3:43 but wouldn't it mean that the digit's 2? because we're starting at index 0, and index 0 is 0, so index 2 is 2.
Index 0 is digit 1
Index 1 is digit 2
..
Index 8 is digit 9
Index 9 is digit 0
oh ok.
I wonder, can you work backwards somewhat to get a general idea of what the original image looked like from the convolution layers?
i don't think so, that would be like asking "what 5 numbers did i multiply to get 3600?" there is only on possibility if you do the multiplication, but many possibilities when you try to guess backwards, and with those convolutions it's the same thing just exponentially worse.
basically you drop a lot of information
Well, I'd say yes, as if you have all the information every layer puts out, you just have to reverse the process the first layer did on the data it gave. Since you have many processes on the one image, there should be much redundancy and therefore a high certainty. If you however only have the output of the sixth convolution layer, I highly doubt that you could get much out of it.
Partially. The problem is that (in general) a convolution is not a reversible operation. However, you can apply something that is known as a "matched filter" which is basically a convolution with the transposed filter kernel. If you go backwards through the network you can (to some degree) reconstruct the input signal. If you look at this paper you can see how the reconstructions look like: arxiv.org/pdf/1311.2901v3.pdf
And just to prevent confusion: The author calls it "Deconvolution". But he isn't doing a "deconvolution" as he describes in his paper. He is applying a "matched filter".
@ 5:58 I got the point i am searching for. Thank you very much..
I realize this isn't likely to get a reply this late, but I'm trying to replicate the configuration of this network. What activation function are you using for the first fully connected layer? Is it dotplus with a renormalization? I'm assuming FC2 is a softmax layer, so maybe they are both softmax.
enjoying the neural net videos. looks like ANNs are coming back in after not really seeing much of it since the 90s.
i remember my first exposure to the math ans theory behind this was an assembly program on my 8bit C64 back in the late 80s creating a 3 layer Back Propegation network
Why are there 4x4x50 neurons after the last conv-layer?
I get 4x4x(20^2)x(50^4) neurons, if every 5x5 kernel runs over every image from the previous layer.
I'm confused.. maybe the kernels in the following layers are 3-dimensional? Like 20 5x5x20 kernels in the second layer?
Now I understand. Thanks a lot!
Got it, thank you!
I have the same puzzle. Could you enlighten me?
In short: "Each kernel is 5x5xD, where D is the number of features in the previous layer"
I dont know why their answers are not showing up on youtube. Maybe a google+ thing.
As always, a splendid video! However, every single clip taken from the angle where the pictures of the convolutions are visible, are out of focus. Pitty
It would be nice, if you talked a bit about how much data is needed for a CNN to be any kind of useful. The datasets in this video seem extremely big. Specifically it would be nice to have an idea on how well it works on many "categories" with a low amount of data.
I never knew YT even supported 50FPS. :O
Also, cool computer learning. Today is a day of new smarts.
Can you share the caffe scripts you used please?
If the first convolution layer has 20 filter and the second one has 20, does thing mean that each C2 filter processes all 20 images from C1? That would make 400 images for C2 output
would i be wrong in thinking that if you gave a convolutional neural network the ability to control where to click and what to type and gave it enough convolutions and kernals (perhaps beyond what current computers can handle) and trained it enough then it would be able to solve any captcha, even a new one with different interface that still used the same basic principles?
really interesting! I would be interested tot see if it is possible to start from the final convolution and see which image fits it the best, as in 'what looks the most like a 2'.
it would be interesting to know if there can be totally different pictures that just would get the same number. Similar to a hashing collisions.
Well sure, that's basically the same concept (irreversible / one-way transformations giving you an abstract result)
I saw somewhere a neural network that was trained to fool convolutional neural networks, sometimes it produced normal images (in this case it would have produced a 2) other times it produced something that looked almost like pure noise but it was still able to fool the networks
How to you replicate the learned connections to other systems? How is the "knowledge" abstracted for transport, backup, and further improvements?
With discrete programming, the instructions are compact and finite and are easily copied.
Let's see if someone can help me out here. The first layer here outputs 20 24×24 images (or a 20 channel image) after performing all the convolutions. The second layer will output 20 20×20 images. But how are they constructed? How do they combine the 20 channels from the previous layer? I mean, they are not applying all 20 filters to each of the 20 channels, that'd be a 400 channel output. Do they simple add the convolutions for each channel up? So channel 1 of layer 2 is the sum of the convolution between kernel 1 of layer 2 with each of the 20 channels of layer 1?
so how do the nueral networks do this? is there speed advantages to this network vs just regular processing?
Thank you so much. This was very helpful.
Shouldn't one be able to generate characters(letters, whatever) by going the other way around? I'm thinking what if you tell it to generate a picture from a fully connected layer?
how were the kernels generated for this one?
I am one of those strange people who draws a horizontal bar through the number 7. How would you deal with that? Would you need a separate set of 7+bar training digits (in effect an 11th character) and then map both 7 and 7+bar back to 7?
Excuse me if I have missed something obvious, but I'm not sure I understand what the input of, say, C2 is. Is it a sort of average of all of the images produced by C1?
With all the edge detection going on, would it be harder to recognize a 4 if some versions had the top parts join at an angle, like the 4 in this font, versus the open version as in the video? Likewise a 7 with or without the strike through it? I mean, does it remember some kind of average of all the objects in a class or all of them / all of the sufficiently different ones (which might be hard for a large database)?
I imagined that each layer uses all its kernels on all the images of the previous layer. But that can't be right, hearing that the last convolutional layer here only outputs in a size of 50*4*4. Does that mean that there essentially are "kernel pipelines"? So kernel0 of layer1 will only be fed with the output of kernel0 of layer0?
amazing, i didnt know u could visualize the high rank features
What's the library/program Dr Mike is using please ?
Big Corporate Top Secret
It's called caffee.
using caffe in linux
Love the visualization!
people interested in this experiment, you can actually do it in the Machine Learning course (Stanford) on Coursera
Very interesting! I wonder if this gives some insight into how neurones in our brains work on a very basic level?
Points for making me look at my screen with my head turned 90 degrees to the left until I realize I look like a crazy person
it was very enjoyable, thanks for the video.
Can you look at your last but one fully connected layer and calculate the typical "distance" between different digits? E.g. just euclidean distance on the normalized terms in FC1.
Would those distances depend on your neural network you're using or would they be similar across all successful neural networks? That is could you say something like a 1 and a 7 are typically closer than a 0 and a 4.
how do you know how many kernels, layers, etc. are best suited for your needs?
That is an excellent question! Unfortunately it requires at least a moderate amount of knowledge in the subject matter to answer, so I doubt you'll be getting a satisfactory response from this resource any time soon.
Can someone explain how the final convolutional layer is 4x4x50? My understanding based on the previous Neural Network video is that the first convolution will produce an output of 24x24x20, but then wouldn't the next convolution, which has 20 kernels, produce 20 images of the first image layer of the 20 produced from the first convolution, and then another 20 on the second image layer of the 20 produced from the first convolution, such that at the end of the second layer you'd have a 20x20x400 output, and so forth until at the end you'd have 4x4x(some large number) not 4x4x50?
You decide the depth on each layer. So the first layer will have 20 different 4x4x1 kernels but the second layer will have 20 different 4x4x20. Then after that he uses 50 kernels of 4x4x20 and then 50 kernels of 4x4x50 until the last layer before the fully connected network
wow. I didnt realize that kernels also got multi-dimensional on the way. thanks
A thought that I've gotten when thinking about this and the previous episode, would it be possible to "reverse" the order of the convolutional neural network, getting a sort of idealized result, probably not extremely useful in most cases, but likely somewhat usable for seeing what extra data can be used to train it for more accurate results or perhaps some sort of data generation.
Doing the same for a standard neural network would not result in any useable data I know, but it seems like it might be possible with the convolutional one.
Would you not have 11 output options? 0-10 and NaD(Not-a-Digit)?
This is single digit recognition, multi digit/character recognition is a whole 'nother can of worms. When there is no activation's on the output layer you know it is not a digit.
Francois Molinier that makes sense.
+Francois Molinier That's not how it works. the grayscale pixels represent the probability of it being a given number. if u input NaN u will probably see a few bright grays or a couple of almost whites.
This really clarified the previous video :)
So would it be possible to use convolutional neural networks for something general like arbitrary image matching or are they limited to narrowly trained applications like the one here?
You can. I think google uses a neural net for their "visually similar images" feature.
nice picture.
What if the training images have digits drawn at different scales?
is there a GitHub link to the projekt, Mike ?
github or it didn't happen ;-) "I call GnuImageManip'Prog"
Why do you need a GitHub page? He just literally explained the full architecture of his built CNN (Convolutional Neural Network). Now, if you want to test this for yourself, you can easily implement all he said. Only find the right programming language which is supported by the libraries to complete the task. He even mentioned what he used, but you could also look at Lua with Torch for example. All the libraries that he mentioned all already there, so you won't need to code in any of the layers, just implement them.
Great Video!
Love the out of focus shots on the pictures...
Its a really good lecture to understand what is going on inside NN. I am using NN for target classification in thermal images. Is NN is a good approach to do that ? Or I should go for any other option.
Is there any chance you could upload a copy of the source code for the CNN some where? (or even pseudo code) I am sure many people would greatly appreciate it :D
Is captcha a method to filter out bots, or is it a way to coerce humans into training and AI?
The person who interviews this guy doesn’t ask enough questions.
How do you decide what the convolution kernels should be? Is that important, or could they be defined randomly at the beginning?
Neural network weights are set randomly and then learnt
Thank you very much! Very helpful video!
I'm taking a two credit course in deep learning next week!
please do more on this
I am curious, would it be possible to run this sort of neural network in reverse in order to produce the sort of "Deep Dream" images that you can see on the Internet? For instance, instead of asking the network 'what digit dose this image resemble?', ask 'what dose a 2 look like?'
yes thats what deep dream is
It would have been way more interesting to see different examples of the same number and how it tranlates into the same output.
I'm still confused ... how can you have so many different convolutions? Isn't a convolution a very specific operation?
As i understand it, "convolution" in this context just means, that you apply some sort of filter (or function) to your data and use their results to do another set of filters on it. Filter can mean anything here. For this task he used some filter that he ran on the image, from the looks a sobel, which highlights edges that have a certain angle. After doing this a few times in a row with different filters you get those 4x4 images, that are brighter the more edges on specific angles were in the picture.
Guess im trying to tell you that convolution is not the filter, but rather the method of generating feature specific data from your original data.
Convolution is a very specific operation between two things : an image and a kernel. By choosing different kernels, the result will vary : you may start detecting horizontal edges, thin diagonal lines, dark spots...
A convolution is a function that takes a grid of values (an image) and returns another image. It works by replacing each pixel in the image with a function of the pixels around it, so a "blur" kernel will replace each pixel with the average value of of a grid of pixels centred around it, but other options exist, there are convolutions that do many different things, like making diagonal edges sharper, or removing small spots and such.
Each convolution produces another image, meaning you can easily chain them together, much like you can blur an image in photoshop (which uses a convolution) and then call edge detection (also a convolution) on the result.
This channel has a video on convolutions that explains it better than I can.
TheHDreality I already watched that episode but they only talked about a blur and an edge detector they didn't speak about how it can be modified with parameters at all.
But I think I understand know that it just gives every pixel in the grid a weight and comes up with a total which it devides by a number
Florian H. As a math student, I know of function convolution, which is a binary operator very similar to what is shown here. But you're saying there exists more general convolution, usable on databases. This seems interesting. Could you please give me a link to a reliable source?
Brilliant video.
It’s like a David Lynch movie to me: I almost think I understood it and then everything just becomes a convoluted mess and I feel dumber than before...
whoa why is it at 50 fps?
europe is poor and can't afford the extra 10
muh socialism
Except for Scandinavia
Based on PAL (25/50/100HZ), whereas NA is based on NTSC (30/60/120HZ). There is no techincal reason for it anymore, but it was originally because it was clocked off of the AC power grid which ran at 60hz in NA and 50hz in Europe.
"So my monitor runs at 60 fps"
Actually, The correct pronunciation of Le-Net is Lo-Net. "Le" in French is like "The" in in English but just for masculine.
HOw can i get these algos if i want to do it on my machine?
1:45 3:26 4:00 5:08 5:22 5:34 5:51 5:58 6:12 8:00 9:12 9:20 9:30 9:36 11:34 12:00
Hey I have a question. After the first conv layer, we are left with 20 images of 24*24 pixels. Do these 20 images transform into one 24*24 sized image, to be given as an input to the next conv. layer?
No, after the first conv layer you have like a volume (24*24*20), and it is the input to the next conv layer of size ( 5*5*20), so if you apply this kernel to that input volume you'll get one image of size 20*20, and because you have 50 filters of (5*5*20) so your output will be 20*20*50
The best explanation of CNN's thanx
I always do a *very* firm two. 06:55
May be I didn't get the idea, but why there is 10, and not 11 classes for numbers?
Because if I will give an image of "A" to this network, it will probably say to me "this is 1" or "this is 4" instead of giving the negative answer like "its neither of 10 numbers".
Дмитрий Сулин Because what you are expecting is that one, and only one, of the ten output nodes would have a very high confidence, say 0.9 or 0.95 of 1.0. The other nodes would be near 0. If the input image didn’t match any number 0 to 9, then all the nodes would output a low or very low confidence value.
i just love this guy
Why would you not show us what it does when you put in a random squiggle? That'd be cool.
Since the end layers are fully connected, it will have to choose randomly from any of the 10 output classes. So it will just output any number wrongly, and it won't have any output like "No Number" or something like that. Even though you could extend the network in smart ways, like adding another class. And the eleventh class will then in turn have the meaning of 'NaN' (Not a Number), but you will have to label additional samples with "NaN' input for training of course.
I haven even noticed the video was at 50fps. Probably because the video in most part is out of focus.