The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Поделиться
HTML-код
  • Опубликовано: 28 дек 2024

Комментарии •

  • @JackofSome
    @JackofSome 4 года назад +17

    Yannic you're spoiling us. I hope you're able to keep your pace once (if???) this virus dies down a bit.

  • @jivan476
    @jivan476 3 года назад +13

    Could it be that the "winning tickets" can be identified after only a handful of training epochs instead of after a full training (e.g. 50 epochs or more)? If yes, it would mean that we can train for 3-4 epochs, prune 50% of the weights, then re-start the training on these weights only (with same initialisation as before), rinse and repeat. In theory it could allow faster training.

    • @wenhanzhou5826
      @wenhanzhou5826 Год назад

      I think yes, because there is a paper that discusses that different weights initialization will create different local mininma in the loss landscape for the same data. What you can do is start with a really big network and a large learning rate. The network will find one of the local minima quickly, and then just start pruning to get to the lowest point of that minima.

  • @sayakpaul3152
    @sayakpaul3152 4 года назад +5

    Thanks for the wonderfully detailed walkthrough :)
    It might be worth mentioning that while training neural nets it's also possible to train it in a pruning-aware fashion with all the good stuff like pruning schedules, maximum achievable sparsity, etc.

  • @milkteamx7183
    @milkteamx7183 Год назад +1

    Amazing explanation! Thank you so much! I just looked through your channel and am excited to find that you have many of these videos. Just subscribed!

  • @MrNightLifeLover
    @MrNightLifeLover 4 года назад +4

    Very well explained, thanks! Please keep reviewing papers!

  • @chesstanay
    @chesstanay 8 месяцев назад

    Where can I read more about the related finding at 17:16?

  • @nbrpwng
    @nbrpwng 4 года назад +6

    This is actually reminiscent of how human brains develop from childhood to adulthood. At birth, humans have far more connections between their neurons and connections primarily die off as they learn and mature, much more than new neurons and connections are formed. And yet humans can still learn despite connection removal, and possibly because of it.

    • @gorgolyt
      @gorgolyt 4 года назад +2

      Great observation. That could simply be pruning, which doesn't decrease performance, and improves energy efficiency for the organism. But it could be something deeper and more important.

  • @wolfgangmitterbaur3942
    @wolfgangmitterbaur3942 2 года назад +1

    Thanks a lot for this video. It explains essentials of the paper very good - and easy to follow for a non-native speaker, what is important as well!

  • @HassanBinHaroon
    @HassanBinHaroon Год назад

    Hi @Yannic Kilcher!
    Isn't there any possibility that the weights that are not close to zero (or very small in the magnitude), are the weights that should be pruned?
    Can't that be a better idea, to monitor the weights in the initial training (with complete network) and prune based upon "which weights are traveled much further in the initial training with complete network"? 🤔
    Kindly enlighten on this!

  • @freemind.d2714
    @freemind.d2714 3 года назад +1

    Very good one hypothesis, very make sense

  • @TimScarfe
    @TimScarfe 4 года назад +5

    Great video! Looking forward to having a discussion on our street talk podcast!

  • @HassanBinHaroon
    @HassanBinHaroon Год назад

    Hi @Yannic Kilcher!
    Can't we control the Random Initialization to keep almost every weight in the network (to get the most out of the original network)?
    Can't every weight win the lottery?

  • @jrkirby93
    @jrkirby93 4 года назад +10

    I love the idea of sparse neural nets. It feels kinda icky looking at these grossly overparameterized models that are often SOTA and thinking: "Right now, this is the best way of doing this."
    Pruning is good technique for finding sparse neural nets. I thought this was a great paper when I first read it.
    But I've been working on my own research that approaches sparse NN from the other direction. Instead of starting with fully connected layers and pruning, I start with extremely sparse layers and build it up, one edge at a time. It requires quite a different training procedure though. Instead of back-propagation and gradient decent, I take advantage of the piecewise linear properties of ReLU to guarantee a fully piecewise linear neural net. This allows me to explicitly find the optimal next best edge - and it's optimal value - in a single optimization step.
    I hope to finish implementing my research in the coming weeks, and would be happy to show you in more detail if you're interested.

    • @jepkofficial
      @jepkofficial 4 года назад +3

      What happened with this research?

    • @jrkirby93
      @jrkirby93 4 года назад +3

      @@jepkofficial Wow was that really 6 months ago? I still haven't finished implementing it. Hard to focus when working alone on independent research. Thanks for the reminder, I should return to that project and get it done.

    • @Leibniz_28
      @Leibniz_28 3 года назад +1

      How's it going the research?

    • @laurenpinschannels
      @laurenpinschannels 3 года назад +3

      checking in on this again, on the off chance you didn't get distracted from this one :)

    • @Poof57
      @Poof57 2 года назад +1

      @@jrkirby93 woohoo another reminder here :P

  • @thejll
    @thejll 9 месяцев назад

    Very interesting. Does anyone know of software that allows doing this pruning?

  • @HassanBinHaroon
    @HassanBinHaroon Год назад

    Hi @Yannic Kilcher!
    It seems that the Random Initialization is very important before the pruning. Right? Because only lucky (in terms of random initialization) weights are kept after pruning. If random initialization is so bad and there is no (or very few) lucky candidate weight (after random initialization) then what to do in that case?
    Is there any particular Random initialization recommended by the paper or by practice?
    There are some of the recommended random initialization methods like Glorot or He.

  • @HappyManStudiosTV
    @HappyManStudiosTV 4 года назад +2

    hey! have you seen uber's follow up work? they basically say
    that the trick is just to prune weights that are going *towards* 0,
    not near 0

  • @joirnpettersen
    @joirnpettersen 4 года назад +6

    What if insead of pruning the weights, you assume the low magnitude weights were initialized incorrectly, and re-train the dense network where the high-magnitude weights are kept at their inital initalization, and the low magnitude weighs get new values?

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      I've never heard this idea. Nice, might be worth a try. I doubt you're gonna get a massive improvement, but it might be interesting to analyze whether you could find an even smaller winning hypothesis.

  • @vishwajitkumarvishnu3878
    @vishwajitkumarvishnu3878 4 года назад +4

    How do you read and understand any paper so fast? Does it come by practice or is there a way to read different sections. I want to do that. Uploading a video on how to read a paper might help :)

    • @YannicKilcher
      @YannicKilcher  4 года назад +13

      After you've read a bunch both the structure, the methods and the ideas become repetitive over the entire field, that speeds up the reading process a lot. I guess I can do a video on that, but it will be pretty straightforward and obvious.

    • @vishwajitkumarvishnu3878
      @vishwajitkumarvishnu3878 4 года назад +2

      @@YannicKilcher it'll be helpful if you make a video. Thanks a lot

  • @kevalan1042
    @kevalan1042 3 года назад

    did they check if those initial weights already tend to be relatively large ?

  • @eugening
    @eugening 4 года назад

    Good discussion. The sound is a bit too soft.

  • @araldjean-charles3924
    @araldjean-charles3924 Год назад

    For the initial conditions that work, have anybody look at how much wiggle room you have. Is there an epsilon-neighborhood of the initial state you can safely start from, and how small is epsilon?

  • @MrSb192
    @MrSb192 3 года назад +2

    Question: suppose we have a network N that we train up to a certain accuracy on some data, prune p% of the weights using some algorithm (one shot, imp, etc) and revert the remaining weights to the initial values. Now, is there any way to ensure that the resulting pruned network will always perform better than the original when trained for the same#iterations? I mean, is there any algorithm for pruning which can guarantee the finding of a lottery ticket within the network everytime we use it? Or is it just trial and error (which is why, I guess, the term lottery ticket is used)?

  • @herp_derpingson
    @herp_derpingson 4 года назад +4

    Reminds me of dropout for some reason. Except we are throwing away the dropped out neurons.

    • @JungleEd17
      @JungleEd17 4 года назад +1

      I watched it 2x but I think the connections are thrown out not the neurons.
      What's interesting here though:
      1. The weights are what are important.
      2. Pruning involves throwing out both weight AND structure.
      Why not keep the structure but choose new weights. Perhaps it just randomly started at a plattaeu of a local min or randomization ended up created redundancies. Jump the the weight really far a way and try again.

    • @fsxaircanada01
      @fsxaircanada01 4 года назад

      I think the motivation is that activations are not the biggest source of memory access and energy loss. If we can get rid of 90% of weights, then it could mean speed and energy improvements

  • @Blooper1980
    @Blooper1980 3 года назад +1

    Interesting.. Just need to take the sick out of your mouth next time.