Really interesting. Whether overfitting is involved definitely matters, but if it turns out that that isn't the case - or at least that it doesn't *have* to be the case (i.e. you can get over that hurdle), this might lead to really nice compact networks. From what I recall, there has recently also been work on doing this but with a growth step added in, so layers can also become bigger if it turns out that that is helpful. In that case the savings aren't quite so dramatic but presumably (I'm not sure that's quite right?) the benefit is even higher accuracy. While *still* shrinking the network overall.
Could you expand on how a hurdle of overfitting pruned networks can be knowingly overcome? Also any news on this topic you've come across in the year since this comment?
I am a complete novice in this area but I feel the fact that they are getting accuracy metrics on test dataset, means that the we can get an idea about the overfitting. Obviously the training dataset gives a good idea about the underfitting. please do correct me if I am wrong any way. I am a learner here. :)
the thing is, pruning existed before this paper and it showed good accuracy, so idk how this paper is revolutionary. why would anyone want to retrain an already fully trained pruned network, instead of just finetuning it and save time on training? It's not like you know the sparse weight initialisations beforehand, which is what this presentation makes it seem. u would still need to train the fully connected larger network, and this really doesn't help much for any practical purposes (unless it somehow beats the accuracy of finetuning) seems like more hype than actual worth. or maybe im getting something wrong.
@@siddharthagrawal8300 you are right, the value in the work is based on the assumption that adding to scientific understanding facilitates future practical improvements. I haven't kept up with the literature since this paper to know whether that assumption held true in this case.
I'm not sure whether I'm getting this right and I would really like some elaboration. Under the assumption that the function that is intended to be learned is actually far less complex than the network architecture used for training, of course there is going to be that one sub-network that is going to perform the best when looked at in isolation, simply because of random initialization. If my function to be learned can be modeled optimally with a single weight but I use 10, random initialization will make one weight learn 'the fastest/best'. So isn't the only question here how to find this weight, i.e. the pruning strategy? Or is my assumption messy?
I think you have it right. I think the idea that you can prune early is interesting, but if you have enough computer power to buy all the lottery tickets, why not buy all of them. Perhaps the more subtle things will become the I think question "[will] random initialization make one weight learn 'the fastest/best" should still be explored. Is random really the best? What if all weights for a neuron (either all input or all output) end up positive. Perhaps the more interesting things is that its not the number of nodes or connection that are bad. It's the initialization. Why throw out the connection with the weight? Couldn't we throw out the weight and then reinitialize it?
Bottom line here have you been able to predict any winning numbers? Consistently? I would assume your program scores your predicted numbers against known winning numbers and tries to improve that score. Can you give us any info regarding how well the system learns and the training required.
very beautiful and convincing but no result Because in the lottery it’s important to guess the numbers and the date of the event; here it’s just brute force
This statement doesn't take into account that this is a first step which enables work that spins off the presented idea. One can argue that this could be considered to be a result as well. Notable mentions to your point might be: a) subnetworks generalise to similar tasks and can act as an initialisation scheme, see Morcos et al. b) there was recent work on how to find these subnetworks more efficiently than brute force, see You et al. and Tanaka et al. I acknowledge that depending on when this comment was made this work might not have been available yet.
Really interesting. Whether overfitting is involved definitely matters, but if it turns out that that isn't the case - or at least that it doesn't *have* to be the case (i.e. you can get over that hurdle), this might lead to really nice compact networks.
From what I recall, there has recently also been work on doing this but with a growth step added in, so layers can also become bigger if it turns out that that is helpful. In that case the savings aren't quite so dramatic but presumably (I'm not sure that's quite right?) the benefit is even higher accuracy. While *still* shrinking the network overall.
Could you expand on how a hurdle of overfitting pruned networks can be knowingly overcome? Also any news on this topic you've come across in the year since this comment?
@@tdk99-i8n do you think a pareto front can be generated? Would it necessarily be weights vs accuracy with the different networks generated?
I am a complete novice in this area but I feel the fact that they are getting accuracy metrics on test dataset, means that the we can get an idea about the overfitting. Obviously the training dataset gives a good idea about the underfitting. please do correct me if I am wrong any way. I am a learner here. :)
the thing is, pruning existed before this paper and it showed good accuracy, so idk how this paper is revolutionary. why would anyone want to retrain an already fully trained pruned network, instead of just finetuning it and save time on training?
It's not like you know the sparse weight initialisations beforehand, which is what this presentation makes it seem. u would still need to train the fully connected larger network, and this really doesn't help much for any practical purposes (unless it somehow beats the accuracy of finetuning)
seems like more hype than actual worth. or maybe im getting something wrong.
@@siddharthagrawal8300 you are right, the value in the work is based on the assumption that adding to scientific understanding facilitates future practical improvements. I haven't kept up with the literature since this paper to know whether that assumption held true in this case.
I'm not sure whether I'm getting this right and I would really like some elaboration. Under the assumption that the function that is intended to be learned is actually far less complex than the network architecture used for training, of course there is going to be that one sub-network that is going to perform the best when looked at in isolation, simply because of random initialization. If my function to be learned can be modeled optimally with a single weight but I use 10, random initialization will make one weight learn 'the fastest/best'. So isn't the only question here how to find this weight, i.e. the pruning strategy? Or is my assumption messy?
I think you have it right. I think the idea that you can prune early is interesting, but if you have enough computer power to buy all the lottery tickets, why not buy all of them. Perhaps the more subtle things will become the
I think question "[will] random initialization make one weight learn 'the fastest/best" should still be explored. Is random really the best? What if all weights for a neuron (either all input or all output) end up positive.
Perhaps the more interesting things is that its not the number of nodes or connection that are bad. It's the initialization. Why throw out the connection with the weight? Couldn't we throw out the weight and then reinitialize it?
Bottom line here have you been able to predict any winning numbers? Consistently? I would assume your program scores your predicted numbers against known winning numbers and tries to improve that score. Can you give us any info regarding how well the system learns and the training required.
Theres work going in this direction: arxiv.org/abs/1909.11957
@@robvdm Thanks keep up the good work.
13:53 老弟真的没礼貌
Hello do you like to play lottery pick 3?
@@justinking5964 Never played it. Why?
@@hangchen I assumed you are not American. Just ask randomly.
@@justinking5964 Lol yea your assumption is correct. Do most Americans play lottery pick 3?
very beautiful and convincing but no result
Because in the lottery it’s important to guess the numbers and the date of the event; here it’s just brute force
This statement doesn't take into account that this is a first step which enables work that spins off the presented idea. One can argue that this could be considered to be a result as well. Notable mentions to your point might be:
a) subnetworks generalise to similar tasks and can act as an initialisation scheme, see Morcos et al.
b) there was recent work on how to find these subnetworks more efficiently than brute force, see You et al. and Tanaka et al.
I acknowledge that depending on when this comment was made this work might not have been available yet.