Could it be that the "winning tickets" can be identified after only a handful of training epochs instead of after a full training (e.g. 50 epochs or more)? If yes, it would mean that we can train for 3-4 epochs, prune 50% of the weights, then re-start the training on these weights only (with same initialisation as before), rinse and repeat. In theory it could allow faster training.
I think yes, because there is a paper that discusses that different weights initialization will create different local mininma in the loss landscape for the same data. What you can do is start with a really big network and a large learning rate. The network will find one of the local minima quickly, and then just start pruning to get to the lowest point of that minima.
Thanks for the wonderfully detailed walkthrough :) It might be worth mentioning that while training neural nets it's also possible to train it in a pruning-aware fashion with all the good stuff like pruning schedules, maximum achievable sparsity, etc.
This is actually reminiscent of how human brains develop from childhood to adulthood. At birth, humans have far more connections between their neurons and connections primarily die off as they learn and mature, much more than new neurons and connections are formed. And yet humans can still learn despite connection removal, and possibly because of it.
Great observation. That could simply be pruning, which doesn't decrease performance, and improves energy efficiency for the organism. But it could be something deeper and more important.
Hi @Yannic Kilcher! Isn't there any possibility that the weights that are not close to zero (or very small in the magnitude), are the weights that should be pruned? Can't that be a better idea, to monitor the weights in the initial training (with complete network) and prune based upon "which weights are traveled much further in the initial training with complete network"? 🤔 Kindly enlighten on this!
Hi @Yannic Kilcher! Can't we control the Random Initialization to keep almost every weight in the network (to get the most out of the original network)? Can't every weight win the lottery?
I love the idea of sparse neural nets. It feels kinda icky looking at these grossly overparameterized models that are often SOTA and thinking: "Right now, this is the best way of doing this." Pruning is good technique for finding sparse neural nets. I thought this was a great paper when I first read it. But I've been working on my own research that approaches sparse NN from the other direction. Instead of starting with fully connected layers and pruning, I start with extremely sparse layers and build it up, one edge at a time. It requires quite a different training procedure though. Instead of back-propagation and gradient decent, I take advantage of the piecewise linear properties of ReLU to guarantee a fully piecewise linear neural net. This allows me to explicitly find the optimal next best edge - and it's optimal value - in a single optimization step. I hope to finish implementing my research in the coming weeks, and would be happy to show you in more detail if you're interested.
@@jepkofficial Wow was that really 6 months ago? I still haven't finished implementing it. Hard to focus when working alone on independent research. Thanks for the reminder, I should return to that project and get it done.
Hi @Yannic Kilcher! It seems that the Random Initialization is very important before the pruning. Right? Because only lucky (in terms of random initialization) weights are kept after pruning. If random initialization is so bad and there is no (or very few) lucky candidate weight (after random initialization) then what to do in that case? Is there any particular Random initialization recommended by the paper or by practice? There are some of the recommended random initialization methods like Glorot or He.
What if insead of pruning the weights, you assume the low magnitude weights were initialized incorrectly, and re-train the dense network where the high-magnitude weights are kept at their inital initalization, and the low magnitude weighs get new values?
I've never heard this idea. Nice, might be worth a try. I doubt you're gonna get a massive improvement, but it might be interesting to analyze whether you could find an even smaller winning hypothesis.
How do you read and understand any paper so fast? Does it come by practice or is there a way to read different sections. I want to do that. Uploading a video on how to read a paper might help :)
After you've read a bunch both the structure, the methods and the ideas become repetitive over the entire field, that speeds up the reading process a lot. I guess I can do a video on that, but it will be pretty straightforward and obvious.
For the initial conditions that work, have anybody look at how much wiggle room you have. Is there an epsilon-neighborhood of the initial state you can safely start from, and how small is epsilon?
Question: suppose we have a network N that we train up to a certain accuracy on some data, prune p% of the weights using some algorithm (one shot, imp, etc) and revert the remaining weights to the initial values. Now, is there any way to ensure that the resulting pruned network will always perform better than the original when trained for the same#iterations? I mean, is there any algorithm for pruning which can guarantee the finding of a lottery ticket within the network everytime we use it? Or is it just trial and error (which is why, I guess, the term lottery ticket is used)?
I watched it 2x but I think the connections are thrown out not the neurons. What's interesting here though: 1. The weights are what are important. 2. Pruning involves throwing out both weight AND structure. Why not keep the structure but choose new weights. Perhaps it just randomly started at a plattaeu of a local min or randomization ended up created redundancies. Jump the the weight really far a way and try again.
I think the motivation is that activations are not the biggest source of memory access and energy loss. If we can get rid of 90% of weights, then it could mean speed and energy improvements
Yannic you're spoiling us. I hope you're able to keep your pace once (if???) this virus dies down a bit.
Could it be that the "winning tickets" can be identified after only a handful of training epochs instead of after a full training (e.g. 50 epochs or more)? If yes, it would mean that we can train for 3-4 epochs, prune 50% of the weights, then re-start the training on these weights only (with same initialisation as before), rinse and repeat. In theory it could allow faster training.
I think yes, because there is a paper that discusses that different weights initialization will create different local mininma in the loss landscape for the same data. What you can do is start with a really big network and a large learning rate. The network will find one of the local minima quickly, and then just start pruning to get to the lowest point of that minima.
Thanks for the wonderfully detailed walkthrough :)
It might be worth mentioning that while training neural nets it's also possible to train it in a pruning-aware fashion with all the good stuff like pruning schedules, maximum achievable sparsity, etc.
Amazing explanation! Thank you so much! I just looked through your channel and am excited to find that you have many of these videos. Just subscribed!
Very well explained, thanks! Please keep reviewing papers!
Where can I read more about the related finding at 17:16?
This is actually reminiscent of how human brains develop from childhood to adulthood. At birth, humans have far more connections between their neurons and connections primarily die off as they learn and mature, much more than new neurons and connections are formed. And yet humans can still learn despite connection removal, and possibly because of it.
Great observation. That could simply be pruning, which doesn't decrease performance, and improves energy efficiency for the organism. But it could be something deeper and more important.
Thanks a lot for this video. It explains essentials of the paper very good - and easy to follow for a non-native speaker, what is important as well!
Hi @Yannic Kilcher!
Isn't there any possibility that the weights that are not close to zero (or very small in the magnitude), are the weights that should be pruned?
Can't that be a better idea, to monitor the weights in the initial training (with complete network) and prune based upon "which weights are traveled much further in the initial training with complete network"? 🤔
Kindly enlighten on this!
Very good one hypothesis, very make sense
Great video! Looking forward to having a discussion on our street talk podcast!
Hi @Yannic Kilcher!
Can't we control the Random Initialization to keep almost every weight in the network (to get the most out of the original network)?
Can't every weight win the lottery?
I love the idea of sparse neural nets. It feels kinda icky looking at these grossly overparameterized models that are often SOTA and thinking: "Right now, this is the best way of doing this."
Pruning is good technique for finding sparse neural nets. I thought this was a great paper when I first read it.
But I've been working on my own research that approaches sparse NN from the other direction. Instead of starting with fully connected layers and pruning, I start with extremely sparse layers and build it up, one edge at a time. It requires quite a different training procedure though. Instead of back-propagation and gradient decent, I take advantage of the piecewise linear properties of ReLU to guarantee a fully piecewise linear neural net. This allows me to explicitly find the optimal next best edge - and it's optimal value - in a single optimization step.
I hope to finish implementing my research in the coming weeks, and would be happy to show you in more detail if you're interested.
What happened with this research?
@@jepkofficial Wow was that really 6 months ago? I still haven't finished implementing it. Hard to focus when working alone on independent research. Thanks for the reminder, I should return to that project and get it done.
How's it going the research?
checking in on this again, on the off chance you didn't get distracted from this one :)
@@jrkirby93 woohoo another reminder here :P
Very interesting. Does anyone know of software that allows doing this pruning?
Hi @Yannic Kilcher!
It seems that the Random Initialization is very important before the pruning. Right? Because only lucky (in terms of random initialization) weights are kept after pruning. If random initialization is so bad and there is no (or very few) lucky candidate weight (after random initialization) then what to do in that case?
Is there any particular Random initialization recommended by the paper or by practice?
There are some of the recommended random initialization methods like Glorot or He.
hey! have you seen uber's follow up work? they basically say
that the trick is just to prune weights that are going *towards* 0,
not near 0
HappyManStudiosTV Interesting
can you link the paper? :) thanks!
What if insead of pruning the weights, you assume the low magnitude weights were initialized incorrectly, and re-train the dense network where the high-magnitude weights are kept at their inital initalization, and the low magnitude weighs get new values?
I've never heard this idea. Nice, might be worth a try. I doubt you're gonna get a massive improvement, but it might be interesting to analyze whether you could find an even smaller winning hypothesis.
How do you read and understand any paper so fast? Does it come by practice or is there a way to read different sections. I want to do that. Uploading a video on how to read a paper might help :)
After you've read a bunch both the structure, the methods and the ideas become repetitive over the entire field, that speeds up the reading process a lot. I guess I can do a video on that, but it will be pretty straightforward and obvious.
@@YannicKilcher it'll be helpful if you make a video. Thanks a lot
did they check if those initial weights already tend to be relatively large ?
Good discussion. The sound is a bit too soft.
For the initial conditions that work, have anybody look at how much wiggle room you have. Is there an epsilon-neighborhood of the initial state you can safely start from, and how small is epsilon?
Question: suppose we have a network N that we train up to a certain accuracy on some data, prune p% of the weights using some algorithm (one shot, imp, etc) and revert the remaining weights to the initial values. Now, is there any way to ensure that the resulting pruned network will always perform better than the original when trained for the same#iterations? I mean, is there any algorithm for pruning which can guarantee the finding of a lottery ticket within the network everytime we use it? Or is it just trial and error (which is why, I guess, the term lottery ticket is used)?
Reminds me of dropout for some reason. Except we are throwing away the dropped out neurons.
I watched it 2x but I think the connections are thrown out not the neurons.
What's interesting here though:
1. The weights are what are important.
2. Pruning involves throwing out both weight AND structure.
Why not keep the structure but choose new weights. Perhaps it just randomly started at a plattaeu of a local min or randomization ended up created redundancies. Jump the the weight really far a way and try again.
I think the motivation is that activations are not the biggest source of memory access and energy loss. If we can get rid of 90% of weights, then it could mean speed and energy improvements
Interesting.. Just need to take the sick out of your mouth next time.