This is my personal summary: 00:00:00 History of Deep Learning 00:07:30 "Ingredients" of the Talk 00:12:30 DNN and Information Theory 00:19:00 Information Plane Theorem 00:23:00 First Information Plane Visualization 00:29:00 Mention of Critics of the Method 00:32:00 Rethinking Learning Theory 00:37:00 "Instead of Quantizing the Hypothesis Class, let's Quantize the Input!" 00:43:00 The Information Bottleneck 00:47:30 Second Information Plane Visualization 00:50:00 Graphs for Mean and Variance of the Gradient 00:55:00 Second Mention of Critics of the Method 01:00:00 The Benefit of Hidden Layers 01:05:00 Separation of Labels by Layers (Visualization) 01:09:00 Summary of the Talk 01:12:30 Question about Optimization and Mutual Information 01:16:30 Question about Information Plane Theorem 01:19:30 Question about Number of Hidden Layers 01:22:00 Question about Mini-Batches
I wonder if based on this we can create better training algorithms. Like for example effectiveness of dropout may have a connection to this theory. The dropout may introduce more randomness in "diffusion" stage of training.
I read another paper ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING by Harvard's researchers published in 2018, and they hold a very different view. Seems it's still unclear how neural network works.
does anybody know how to show the part that the gibbs distribution converges to the optimal IB bound? And what is the epsilon cover of an hypothesis class?
This theory looks correct! When neural networks became popular, everybody in the scientific computation community eagerly wanted to describe it in their own languages. Many had achieved limited success. I think the information theory one makes the most sense, because it finds simplicity of the information from complexity of data. It is like how human thinks. We create abstract symbols that captures essence of the nature and conduct logical reasoning, which means that the dimension of freedom behind the world should be small since it is structured. Why did the ML community and industry not adopt this explanation?
oooo! so it is SGD ? If I wouldn't listen to the Q&A session I wouldn't understand it all. Now I do. Well, with second order algorithms (like Levenberg Marquard) you won't need all these balls floating to understand what's going on with your neurons. Gradient Descent is poor's man gold.
I've been thinking about this a lot too. The weights are partly function of the data of course, and we also have things like the good regulator theorem that kinda points towards it. Also, a latent code and the parameters learned aren't distinguished in Bayesian model selection.
This is my personal summary:
00:00:00 History of Deep Learning
00:07:30 "Ingredients" of the Talk
00:12:30 DNN and Information Theory
00:19:00 Information Plane Theorem
00:23:00 First Information Plane Visualization
00:29:00 Mention of Critics of the Method
00:32:00 Rethinking Learning Theory
00:37:00 "Instead of Quantizing the Hypothesis Class, let's Quantize the Input!"
00:43:00 The Information Bottleneck
00:47:30 Second Information Plane Visualization
00:50:00 Graphs for Mean and Variance of the Gradient
00:55:00 Second Mention of Critics of the Method
01:00:00 The Benefit of Hidden Layers
01:05:00 Separation of Labels by Layers (Visualization)
01:09:00 Summary of the Talk
01:12:30 Question about Optimization and Mutual Information
01:16:30 Question about Information Plane Theorem
01:19:30 Question about Number of Hidden Layers
01:22:00 Question about Mini-Batches
Thank you!
Bless your soul
I have used your personal summary as a template for a section of my personal notes.
Thank you very much!
RIP Naftali!
I wonder if based on this we can create better training algorithms. Like for example effectiveness of dropout may have a connection to this theory. The dropout may introduce more randomness in "diffusion" stage of training.
1:22:31 - thesis statement about how to choose mini batch size
11:30 "information measures are invariant to computational complexity"
Aah this is so relaxing.. Thank you!
I read another paper ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING by Harvard's researchers published in 2018, and they hold a very different view. Seems it's still unclear how neural network works.
Is that the one he mentions at ~ 29:00 ?
Amazing talk, thank you!
does anybody know how to show the part that the gibbs distribution converges to the optimal IB bound?
And what is the epsilon cover of an hypothesis class?
"Learn to ignore irrelevant labels" yes intriguing..........
Anybody know what a “pattern” is in information theory?
Can he use deep learning to fix the audio problems of this video?
probably not because there are none
Seems like this was asked in jest, but it's actually a good question.
When was this talk given? Has he published his paper yet? I found nothing online so far, but maybe I just didn't see it.
1)Deep learning and the Information Bottleneck, 2) Opening the black box of Deep neural networks via Information
such a loss… blessed be his memory
23:04
44:25
This theory looks correct!
When neural networks became popular, everybody in the scientific computation community eagerly wanted to describe it in their own languages. Many had achieved limited success. I think the information theory one makes the most sense, because it finds simplicity of the information from complexity of data. It is like how human thinks. We create abstract symbols that captures essence of the nature and conduct logical reasoning, which means that the dimension of freedom behind the world should be small since it is structured.
Why did the ML community and industry not adopt this explanation?
oooo! so it is SGD ? If I wouldn't listen to the Q&A session I wouldn't understand it all. Now I do. Well, with second order algorithms (like Levenberg Marquard) you won't need all these balls floating to understand what's going on with your neurons. Gradient Descent is poor's man gold.
If the theories are true, maybe we can compute the weights directly without iteratively learning them via gradient decsent.
Binyu Wang oh
How so?
I've been thinking about this a lot too. The weights are partly function of the data of course, and we also have things like the good regulator theorem that kinda points towards it. Also, a latent code and the parameters learned aren't distinguished in Bayesian model selection.