Reminds me of creating sparse neural networks that remove neurons that don't contribute a lot to make inference more efficient on computationally limited devices.
thank you for breaking these down. some day i'll learn math notation. but for now, it just feels like my brain is imploding. putting this into terms i can understand as a programmer, or just a layman really helps me understand the underlying functions and adaptations these papers present 🙌
Thanks for the great video! Some thoughts on the paper: 1. The cited DeepMind paper on overcoming catastrophic forgetting (Kirkpatrick et al., 2017) has a different approach for identifying "useful" parts of the network, which relies on Fisher information-their approach looked at the information carried by weights, but you could look at the Fisher information carried by neurons' output values instead. This would give an alternative to the "contribution utility" used in this paper. I'd be interested to see a comparison here, since it seems like Fisher information is a more direct/accurate way of determining utility (it also might be best to combine the two methods). Mathematically, it should just mean taking a running average of the squared loss gradients at the neuron, instead of using the product of the mean-adjusted output with the outgoing weights. This also means that low-utility neurons would be ones that consistently have small gradients, so resetting them makes a lot of sense (as little learning is likely taking place there). 2. Regarding the Figure 19 experiments (23:00), I really wish we could see how the saturation and gradient magnitudes correlate with a neuron's utility (and for CBP its age). Intuitively, I would guess that despite having a lower mean gradient magnitude than L2, CBP's distribution probably has an upper tail, with a small number of recently-reset neurons seeing larger gradients as they drive adaptation. Maybe what's harming the non-CBP approaches is the abundance of low-utility, low-gradient neurons, which are not contributing much and also unable to adapt. In CBP, these would just get reset, restoring them to have better gradients, so we likely wouldn't see as many low-utility and low-gradient neurons. 3. I wonder if this paper's technique would help in training GANs? It seems like adaptability would be a big benefit in that space, as the generator and discriminator are being trained against each other while both are constantly evolving. Does anyone know if something similar to this has been explored? Disclaimer: I'm not an expert :)
I absolutely love the extended thoughts you have on this paper! Thanks for taking the time to write this out! My thoughts on your points: 1. I'm not familiar with Fisher information, but I'm not sure that re-inniting neurons with small gradients would have the desired effect. It would indeed be wiping out features that are not learning, but I don't think that necessarily means that the feature is bad. I could imagine a scenario where one hidden unit has learned something that is general to all or most of the problems it sees, and hence is not updated very often because it is already working. That being said, optimization is far from my strong suite so I'm really not sure. The authors do mention at the end that they realize that this heuristic is a limiting factor though, and I would certainly agree it would be interested to see how it would compare against other measures of "importance". 2. Interesting insight! I think it would definitely be nice to see the graphs you mentioned. One of the reasons the paper got rejected actually seemed to be because the reviewers wanted a further deep dive into why the plasticity decay was happening as they thought it was the most interesting part of the paper. I also would love to see it explored a bit more. I know the authors are still working on this idea so maybe we will see some more interesting work like this in the near future. 3. My experience with GANs is limited, but I would imagine you would be right on this. The core problem CBP is attempting to solve is that of stochasticity in the task. In GANs that is exactly what you are dealing with - a constantly changing task because the opponent is always changing. Especially when you consider how fast the problem is changing in GANs, and the fact that the decaying plasticity issue was shown to be worse the quicker the problem changed, I would love to see how an implementation of that would fair. Disclaimer: I'm also not an expert lol
@@EdanMeyer Thanks for the thoughtful responses. I'll definitely have to keep an eye out for updates on this work. Re 1., I think you're right and what I wrote was wrong. Looks like I misunderstood the Fisher computation; it should take the squared gradient magnitudes of the log output probabilities w.r.t. the weight in question, and take an expectation over the predicted output distribution (not the ground truth distribution). This apparently gives the second derivative of how much the output distribution changes (measured by KL divergence) when changing the weight, i.e., how much the weight contributes to what the model is actually predicting. In practice though, all of the implementations I could find just used a one-hot distribution for taking the expectation, in order to avoid computing separate gradients for all output classes (this makes it the same as taking the squared gradient of a CE loss function). Anyway, I'm not sure this technique works well for regression problems, since the above assumes the model is outputting class probabilities. In case anyone is interested, I found a lecture explaining this calculation and the related "natural gradient descent" concept which was new to me: ruclips.net/video/eio6l-Po83o/видео.html ruclips.net/video/uginB9Fgup4/видео.html
Many thanks for the overview, a very thorough explanation! A quick and very specific question after seeing the algorithm...in relation with the replacement rate p (I guess it's 'ro'), why do you think is so small in most of the experiments? From what I saw, the biggest layer they use has 2,000 units/features/nodes (whatever!), and even assuming that on a given time all those are eligible...then if you use a p=0.001, that gives you just 2 units, but sometimes they even use lower rates like 10-4 (0.0001). This would give then 0.2 units, which makes no sense to me. Am I missing anything here? On the other hand, some aspects on this paper (this somehow 'selective' dropout) bear some similarities with a paper I saw a while ago: (Jung et al. 2021, "Continual Learning with Node-Importance based Adaptive Group Sparse Regularization"). Thanks!
Great question and I’m don’t know the exact reason, but if I had to guess it would be because changing a significant portion of the network constantly would make it much more difficult for the model to ever have a solid performance. This is a continual learning setting which means evaluation is a constant, there is no training/test set separation, so the model needs to consistently perform well.
@@EdanMeyer Many thanks for the quick reply. Yes, it makes sense as you say, also they play with a plethora of replacement rates combinations in their experiments and at the conclusion they reckon their 'heuristic' approach and point it out as something open for future research (such as using meta-learning, as you specifically mentioned)
1 Question: Resetting all the output weights of the neurons r to zero should cause them to "die". Their signal doesnt contribute to the output signal anymore and thus the derivative of the error calculated by the backprop algorithm creates no training signal for the weights of those neurons. Consequently, the output weights of the neurons r stay zero forever and they are "dead"? If thats true, all the algorithm does is to prevent overfitting.
Thanks for the overview! This approach reminds me a lot of dropout. The idea seems elegant and straightforward. You’d have to pick an optimizer that does not decay it’s learning rate to zero, but that is easy to do. I’m curious how an approach like this could identify neurons that are important for other tasks in a multi-task setting. If we train it, for example, to play Go, and then switch to playing Chess, can we distinguish between neurons that were important for Go and neurons that weren’t important for anything? (Probably difficult to do, but maybe there is a way. Or we resort to curriculum learning and toss in some Go examples from time to time) I’m a bit sad about the title. The alteration is not to backprop, and it has more to do with long-term learning than continuity. I’d have called it Backprop with Re-initialization or similar. Decidedly less sexy, but more to-the-point.
Hey Eldan, just found your content and i really appreciate your insight on these somewhat hidden topics. I’m curious what you think are some of the most important skills for aspiring machine learning researchers to develop. Thanks for the content!
man I love your videos. okay maybe I'm interpreting this incorrectly or my takeaways are wrong as I'm not an ML researcher/practitioner but my understanding is that this akin to the "garbage in / garbage out" problem that is pervasive in software development. I'm curious if results could be improved with signal processing functions applied to the inputs. For instance, in the slippery ant problem, if I was building a multi-pedal robot I would be inclined to use a movement controller that would adjust servo output based on some variable such as how that terrain is impacting the pitch or yaw of the robot. In a sense that's a signal processing function feeding the vector back to the AI controlling it, the ground is rough so we are moving at X vector or the ground is treacherous so we are moving at vX vector where v is the velocity coefficient given the impact of the terrain adjustments. So essentially we are normalizing our inputs into a single domain which is the vector of the robot relative to its destination. Not sure if that made any sense. Thanks again!
Interesting paper. The video is quite long for the benefit for me personally. I would think that this is the case for most people. I know that a more concise version would be more time consuming to produce. If you want to grow your subscriber base, I would aim for 10 minute videos.
I would love to see the experiments rerun with a more complicated dataset like ImageNet. The approach looks promising but I am curious if it holds true for less simple problems and how it performs against L2 on those more difficult tasks
Trying to force a square peg into a round hole. The whole thing is completely messed up, patching backpropagation seems inherently flawed. Look up HTM theory, extremely fast learning of spatio-temporal patterns after the model only sees them 2-3 times, absolutely no catastrophic forgetting as neurons specialise over time and only active neurons learn from their input using hebbian learning. Its an algorithm inspired by the human neocortex which is just so much better than modern deep learning in every way
Reminds me of creating sparse neural networks that remove neurons that don't contribute a lot to make inference more efficient on computationally limited devices.
Don't sparse networks also make it difficult to run on gpus?
You mean "differentiable width networks"? Using L1 regularisation like evolutionary fitness selection was such a wild idea
@@zyansheep yes, but gpus aren't "computationally limited devices" are they?
thank you for breaking these down. some day i'll learn math notation. but for now, it just feels like my brain is imploding.
putting this into terms i can understand as a programmer, or just a layman really helps me understand the underlying functions and adaptations these papers present 🙌
Thanks for the great video! Some thoughts on the paper:
1. The cited DeepMind paper on overcoming catastrophic forgetting (Kirkpatrick et al., 2017) has a different approach for identifying "useful" parts of the network, which relies on Fisher information-their approach looked at the information carried by weights, but you could look at the Fisher information carried by neurons' output values instead. This would give an alternative to the "contribution utility" used in this paper. I'd be interested to see a comparison here, since it seems like Fisher information is a more direct/accurate way of determining utility (it also might be best to combine the two methods). Mathematically, it should just mean taking a running average of the squared loss gradients at the neuron, instead of using the product of the mean-adjusted output with the outgoing weights. This also means that low-utility neurons would be ones that consistently have small gradients, so resetting them makes a lot of sense (as little learning is likely taking place there).
2. Regarding the Figure 19 experiments (23:00), I really wish we could see how the saturation and gradient magnitudes correlate with a neuron's utility (and for CBP its age). Intuitively, I would guess that despite having a lower mean gradient magnitude than L2, CBP's distribution probably has an upper tail, with a small number of recently-reset neurons seeing larger gradients as they drive adaptation. Maybe what's harming the non-CBP approaches is the abundance of low-utility, low-gradient neurons, which are not contributing much and also unable to adapt. In CBP, these would just get reset, restoring them to have better gradients, so we likely wouldn't see as many low-utility and low-gradient neurons.
3. I wonder if this paper's technique would help in training GANs? It seems like adaptability would be a big benefit in that space, as the generator and discriminator are being trained against each other while both are constantly evolving. Does anyone know if something similar to this has been explored?
Disclaimer: I'm not an expert :)
I absolutely love the extended thoughts you have on this paper! Thanks for taking the time to write this out!
My thoughts on your points:
1. I'm not familiar with Fisher information, but I'm not sure that re-inniting neurons with small gradients would have the desired effect. It would indeed be wiping out features that are not learning, but I don't think that necessarily means that the feature is bad. I could imagine a scenario where one hidden unit has learned something that is general to all or most of the problems it sees, and hence is not updated very often because it is already working. That being said, optimization is far from my strong suite so I'm really not sure. The authors do mention at the end that they realize that this heuristic is a limiting factor though, and I would certainly agree it would be interested to see how it would compare against other measures of "importance".
2. Interesting insight! I think it would definitely be nice to see the graphs you mentioned. One of the reasons the paper got rejected actually seemed to be because the reviewers wanted a further deep dive into why the plasticity decay was happening as they thought it was the most interesting part of the paper. I also would love to see it explored a bit more. I know the authors are still working on this idea so maybe we will see some more interesting work like this in the near future.
3. My experience with GANs is limited, but I would imagine you would be right on this. The core problem CBP is attempting to solve is that of stochasticity in the task. In GANs that is exactly what you are dealing with - a constantly changing task because the opponent is always changing. Especially when you consider how fast the problem is changing in GANs, and the fact that the decaying plasticity issue was shown to be worse the quicker the problem changed, I would love to see how an implementation of that would fair.
Disclaimer: I'm also not an expert lol
At this point it sounds easier to me to just... plug a brain with millions of electrodes and just feed it dopamine when it does something nice
@@EdanMeyer Thanks for the thoughtful responses. I'll definitely have to keep an eye out for updates on this work.
Re 1., I think you're right and what I wrote was wrong. Looks like I misunderstood the Fisher computation; it should take the squared gradient magnitudes of the log output probabilities w.r.t. the weight in question, and take an expectation over the predicted output distribution (not the ground truth distribution). This apparently gives the second derivative of how much the output distribution changes (measured by KL divergence) when changing the weight, i.e., how much the weight contributes to what the model is actually predicting.
In practice though, all of the implementations I could find just used a one-hot distribution for taking the expectation, in order to avoid computing separate gradients for all output classes (this makes it the same as taking the squared gradient of a CE loss function). Anyway, I'm not sure this technique works well for regression problems, since the above assumes the model is outputting class probabilities.
In case anyone is interested, I found a lecture explaining this calculation and the related "natural gradient descent" concept which was new to me:
ruclips.net/video/eio6l-Po83o/видео.html
ruclips.net/video/uginB9Fgup4/видео.html
I'm sad that youtube took so long to recommend your channel to me.
Great videos, thanks!
Great overview! keep the good work
Many thanks for the overview, a very thorough explanation! A quick and very specific question after seeing the algorithm...in relation with the replacement rate p (I guess it's 'ro'), why do you think is so small in most of the experiments? From what I saw, the biggest layer they use has 2,000 units/features/nodes (whatever!), and even assuming that on a given time all those are eligible...then if you use a p=0.001, that gives you just 2 units, but sometimes they even use lower rates like 10-4 (0.0001). This would give then 0.2 units, which makes no sense to me. Am I missing anything here? On the other hand, some aspects on this paper (this somehow 'selective' dropout) bear some similarities with a paper I saw a while ago: (Jung et al. 2021, "Continual Learning with Node-Importance based Adaptive Group Sparse Regularization"). Thanks!
Great question and I’m don’t know the exact reason, but if I had to guess it would be because changing a significant portion of the network constantly would make it much more difficult for the model to ever have a solid performance. This is a continual learning setting which means evaluation is a constant, there is no training/test set separation, so the model needs to consistently perform well.
@@EdanMeyer Many thanks for the quick reply. Yes, it makes sense as you say, also they play with a plethora of replacement rates combinations in their experiments and at the conclusion they reckon their 'heuristic' approach and point it out as something open for future research (such as using meta-learning, as you specifically mentioned)
1 Question: Resetting all the output weights of the neurons r to zero should cause them to "die". Their signal doesnt contribute to the output signal anymore and thus the derivative of the error calculated by the backprop algorithm creates no training signal for the weights of those neurons. Consequently, the output weights of the neurons r stay zero forever and they are "dead"?
If thats true, all the algorithm does is to prevent overfitting.
Thanks for the overview!
This approach reminds me a lot of dropout. The idea seems elegant and straightforward. You’d have to pick an optimizer that does not decay it’s learning rate to zero, but that is easy to do.
I’m curious how an approach like this could identify neurons that are important for other tasks in a multi-task setting. If we train it, for example, to play Go, and then switch to playing Chess, can we distinguish between neurons that were important for Go and neurons that weren’t important for anything? (Probably difficult to do, but maybe there is a way. Or we resort to curriculum learning and toss in some Go examples from time to time)
I’m a bit sad about the title. The alteration is not to backprop, and it has more to do with long-term learning than continuity. I’d have called it Backprop with Re-initialization or similar. Decidedly less sexy, but more to-the-point.
What kind of math is being used in the paper?
🤯Love this topic! Thank you! 🥳
Hey Eldan, just found your content and i really appreciate your insight on these somewhat hidden topics. I’m curious what you think are some of the most important skills for aspiring machine learning researchers to develop. Thanks for the content!
man I love your videos. okay maybe I'm interpreting this incorrectly or my takeaways are wrong as I'm not an ML researcher/practitioner but my understanding is that this akin to the "garbage in / garbage out" problem that is pervasive in software development. I'm curious if results could be improved with signal processing functions applied to the inputs. For instance, in the slippery ant problem, if I was building a multi-pedal robot I would be inclined to use a movement controller that would adjust servo output based on some variable such as how that terrain is impacting the pitch or yaw of the robot. In a sense that's a signal processing function feeding the vector back to the AI controlling it, the ground is rough so we are moving at X vector or the ground is treacherous so we are moving at vX vector where v is the velocity coefficient given the impact of the terrain adjustments. So essentially we are normalizing our inputs into a single domain which is the vector of the robot relative to its destination. Not sure if that made any sense. Thanks again!
Could you interview the authors of the paper? Maybe show some code and simulations
Can you please make a video on HSIC Bottleneck training? It's faster than Backprop and even Rprop
Wait so continual/lifelong learning means you CHANGE the neural network over time and not hard code its weights??
Interesting paper. The video is quite long for the benefit for me personally. I would think that this is the case for most people. I know that a more concise version would be more time consuming to produce. If you want to grow your subscriber base, I would aim for 10 minute videos.
I would love to see the experiments rerun with a more complicated dataset like ImageNet. The approach looks promising but I am curious if it holds true for less simple problems and how it performs against L2 on those more difficult tasks
please, the bug inside your profile is annoying me xD
why don t they just reset the ones that are less used. that is probably how our network works. i forget everything so i reset even those used
cool work indeed
Oh wow the inverse of transfer learning
Trying to force a square peg into a round hole. The whole thing is completely messed up, patching backpropagation seems inherently flawed. Look up HTM theory, extremely fast learning of spatio-temporal patterns after the model only sees them 2-3 times, absolutely no catastrophic forgetting as neurons specialise over time and only active neurons learn from their input using hebbian learning. Its an algorithm inspired by the human neocortex which is just so much better than modern deep learning in every way
Catastrophic forgetting