I think that simple models like SVMs without non-linear kernels just lack the power or expressiveness to generate loss landscapes that even contain large margin flat minima, since they simply can't linearly separate data. In some sense all minima in SVMs are equally bad?
Thanks a lot for the awesome talk! I really wish more talks in the are of deep learning would prioritize intuition and general understanding over the mathematical proof.
Very interesting! So this kinda explains why CNNs tend to overfit on textures or otherwise seek simpler visual features: it leads to wider margin and flatter minima. Also, seems that for an example at 49:30 we need either more data for inner red circle (and naturally get wider margins for the desirable minimum) or stronger priors. Speaking of the latter case, for example, "circular pattern" is much simpler geometrically than cherry-picking formation with arcs which was found by the model + we already have 2 circular formations on the image. Can we integrate these sorts of priors to the loss function via attention mechanism or some other way? Can transformers do that?
I guess it would be nice if NNs could somehow "zoom" into the dateset in this case, to artificially make linear separability with a wide margin easier (more likely). No way how to implement that though.
So, we know wide margin minima are good and that they are easy to find when they exist but I guess the question remains, why do wide margin flat minima exist in the first place? My bet would be that current networks tend to contain at least a few wide layers, and wide layers produce outputs which are of high dimension and we know linear separability is easier in higher dimensions. Also, I think that the deeper a network is, the more likely it is that data becomes easily separable at some layer (and therefore a wider margin minimum can exist), since layers near the end tend to represent higher-level features.
Absolute gold for developing intuition. Many thanks professor.
When you explained the mystery of how Neural Networks generalize so well I really wanted to know the answer.
Absolute gem of a video!!
Great presentation! Refreshing to go through a math concept without maths :)
Great talk nice slides, thanks! I am wondering, if it is so difficult to find narrow, sharp minima, why are SVMs finding them?
I think that simple models like SVMs without non-linear kernels just lack the power or expressiveness to generate loss landscapes that even contain large margin flat minima, since they simply can't linearly separate data. In some sense all minima in SVMs are equally bad?
Thanks a lot for the awesome talk! I really wish more talks in the are of deep learning would prioritize intuition and general understanding over the mathematical proof.
Absolutely fantastic video, thank you!
amazing talk
Fantastic talk.
Really good lecture, thanks!
Very interesting! So this kinda explains why CNNs tend to overfit on textures or otherwise seek simpler visual features: it leads to wider margin and flatter minima.
Also, seems that for an example at 49:30 we need either more data for inner red circle (and naturally get wider margins for the desirable minimum) or stronger priors. Speaking of the latter case, for example, "circular pattern" is much simpler geometrically than cherry-picking formation with arcs which was found by the model + we already have 2 circular formations on the image. Can we integrate these sorts of priors to the loss function via attention mechanism or some other way? Can transformers do that?
I guess it would be nice if NNs could somehow "zoom" into the dateset in this case, to artificially make linear separability with a wide margin easier (more likely). No way how to implement that though.
Spectacular Talk!
So, we know wide margin minima are good and that they are easy to find when they exist but I guess the question remains, why do wide margin flat minima exist in the first place? My bet would be that current networks tend to contain at least a few wide layers, and wide layers produce outputs which are of high dimension and we know linear separability is easier in higher dimensions. Also, I think that the deeper a network is, the more likely it is that data becomes easily separable at some layer (and therefore a wider margin minimum can exist), since layers near the end tend to represent higher-level features.
Great talk! Thanks for sharing
Extremely useful