Isn't the score measurement on the left evaluated from actual trainning? I think he meant to discard architectures before even trainning them, which I think means that you have to select a vertical threshold on the validation accuracy, like he did
Funny tho, Yes you can discard any crap with bad validation accuracy... if only there were some way to predict that without having to train and validate it ... :0
@@pladselsker8340 Nope, 'score' is what is determined by eqn (2) for an untrained network, while validation accuracy is for a trained network. Before training, one could calculate 'score' for each network, and they would look like dots plotted on a vertical line. Then, discard all networks below a certain score -- by drawing a horizontal line that cuts this vertical line -- and only train the networks that lie above it.
These long videos are really growing on me. Not just introducing me to papers that I am not familiar with, but also the additional insight of your perspectives. Thank you.
Thank you so much, you save 10s of thousands of people hours of work. The impact of your work is immense even if you don't get hundreds of thousands of views. Please never stop, you're amazing!
It is basically an "anti-lottery ticket hypothesis". . 33:00 For the RL based search models, I think we would still need some negative samples. Otherwise the RL model would keep suggesting bad models for the sake of exploration. . Nice paper, easy to implement. Will definitely use this trick.
If it's true, then pre filtering (or rejection sampling) based up on these avg score is a cheap speedup tool for any neural architecture search algorithm too.
Great video! Thanks for you personal interpretation too---helps think things through. I would argue though that the interpretation of the pytorchcv at (25:40) is wrong (admittedly, I don't know if its your interpretation or the authors since they seemed to have removed this part from their most recent version). But it looks like they're showing that their metric scores methods that we know do well high. That is, architectures that have been found by humans to preform well achieve a high NASWOT score (or whatever they call it).
I think that a lot of the lack of performance compared to other techniques can be explained by the way the NAS-Bench-201 benchmark is constructed: We only have 15,625 different architectures: enough for the sample efficient "train until you're done" NAS systems, but searching without training may just need significantly different architectures. This would also explain why the more complex tasks the metric "spreads out": There just isn't enough ways the NAS-201 networks differ to make a meaningful difference that can be observed just by looking at the initial state. Maybe one could combine this approach with something like NEAT to generate a population of architectures and score them pretty much instantly using this. This would allow the system to get away from the "resnet-likes" that make up the NAS-BENCH.
I wonder if this scoring could be improved simply by exchanging regular correlation with distance correlation, since that will also capture non-linearities. It might make a difference in particular in those networks where currently the score no longer tells you much.
The role of nonlinear function in a neural network can be treated as the if...else statement in a traditional programming language. the LSTM, GRU, Attention also can be treated as the same way, they provide switch control capability.
Unrelated dumb comment but that annotated NAS-Bench-201 diagram at 19:52 roughly looks like map of Switzerland. Though yeah I am doing research on network compressions and this is a really cool idea, would like to see more studies between these parameters, scores and inference speed so that we can also optimize NAS to get the "smallest" or whatsoever that results to the best inference speed in embedded systems while still giving reasonably good accuracy value.
You have a misunderstanding on #21:42 about axes. Covariation score is on a vertical axe, and validation accuracy (after training) - is horizontal axe. So, if you wish to use described method to filter "bad" architectures in a fast way, you should cut by Score (draw horizontal line on some threshold level), instead of drawing vertical line at #22:21 That actually means, this method is even more far from being precise by itself...
The idea is interesting, thanks for making it so accessible. The big question is: Is it useful? I read fig 3 differently to you and so I come to a different conclusion. You think that this method weeds out most/90% of the bad architectures, I think it weeds out very few. If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy. By eye I integrate vertically to get the distribution of scores that would happen with a useless method. I then do the same again for (say) the top 10% of scores. Scatter plots are terrible at showing density, but it looks to me that all the probability mass is at the top of the plot, so the distributions would be very similar. The authors could have done this basic stats,.
@@ameetrahane1445 I agree, I do think it's an interesting idea. Please note that my long comment was on fig 3 only, it's easy to have some really bad architectures out of all the 15625 combinations and I believe these are given too much weight by the use of the scatter diagram (which doesn't show density well). Thus I was commenting on specifics, not making a general statement that "it wouldn't work".
Am kinda disappointed they did not show the scores for the well performing trained networks after showing initialisation affects the score significantly. If trained networks tend to have a greater correlation between it's score and accuracy perhaps this method can be useful by somehow mitigating the randomness from initialisation. Perhaps training the network a few rounds on random data to maximise the score and use that? A tangent: If random data does not affect the score, and the score is correlated with accuracy, what if a network is trained on random data to maximise the score, would it necessarily increase the accuracy on the actual data? This is a reason I'm skeptical on this method as idt the score is a good indication on the accuracy as it does not seem to account for the training data much.
Hi @Tony Robinson, I do not quite understand this sentence in your comment " If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy.", could you please clarify more? thanks
Is J of shape NxD or DxN (where D is the dimensionality of x)? The shape of JJ^T would be NxN and DxD respectively in these two cases. Clearly the first makes more sense in context but the J_i,n in the second line below (1) seems to indicate otherwise.
I don’t know if you’re aware, but the paper seems to have been edited/updated since you made this video with different graphs, showing correlation matrices instead of histograms, and a different equation for computing the score. Is this common for papers to be changed after publishing? Do you know if the new equation is mathematically equivalent and preferred because it’s easier to calculate? Or is it just a different score that measures approximately the same thing?
I think this goes in the direction of something people do, where computing power is saved by only doing as few updates as necessary to see whether the architecture is good/ bad. If it's good after five steps you might decide to continue for another bit since good models are harder to tell apart than bad models. The paper in this video seems to have found a better predictor of performance at convergence than is the score after five steps.
@@dermitdembrot3091 I know. I was wondering if the scores get more accurate / reliable when training the network for a few epochs and then looking at the correlations. Because if I understand it correctly, the network is just initialized and the correlations are based on the random weights. I just find it hard to understand why correlations of random weights are a good indicator of the final prediction. But I did not read the paper, just watched the video, so maybe I did not fully understand the idea.
Oh sorry I completely misread your comment. Kind of assumed you meant accuracy score. I agree that it's worth investigating whether the "gradient correlation" score improves the evaluation. Quite possibly the authors tried and didn't see an improvement.
Very interesting video once again! I have to say I thought I was going to like the "historical papers" but I have to admit that I found the present word2vec and gan videos boring and did not finish them. Just wanted to leave that feedback.
I somehow like these one-step methods. What I do not directly understand is how this method can predict the generalization capabilities of a network-architecture (e.g. validation set accuracy) from the linear map histogram of one mini-batch.
I wonder if there's a way to use that score as the reward for the RL algorithms instead of the accuracy, I think that will lower down the computation time without necessarily dropping performance so much, but I might horribly wrong haha
Great video as always! Did you activate ads? I don't mind them at all! I am only asking because you recently mentioned, you didn't plan to enable them.
I guess that if the loss landscape is similar to a random spin glass hamiltonian , as Yann was saying then it makes some sense to have some spreading in the orientation of the linearization... To some extent it is sad that we are basically saying that a convex function has to be discarded from the begining :) I am curious to see some changes in loss function as well.
Great paper and great review! I wonder if we can replace the gradient w.r.t. input with gradient w.r.t. weights. The updated score can be related to NTK and to the metric over the function space. Would such change produce a better expressiveness score? Insights anyone?
I think the correlation could be better after training for few batches for the hard tasks (ImageNet) - The lottery ticket also had similar problem with harder tasks and needed a bit of training. Does it make sense?
Am I right in thinking that people always initialize networks with random weights (or weights from a previous training) Has anyone done any work looking at what happens if you use some sort of "less" random values as initial ones? Is all randomness created equally, so to speak, and is it important to be completely random in your starting point? What happens if you initialize with a regular pattern, does it fail to train at all?
🤔 Add this before HyperBand in keras-tuner + add Bayesian opt after HyperBand. 1. search without train to get rid of 50-80 % of the really bad architectures 2. HyperBand to quickly abandon poor architectures 3. Bayes opt (with all of the HyperBand runs as input) for the fine-search
I thinl that this article implies some linf od contradiction if we look it in context of maniflod mixup, in that article they claimed (if I ma right) that they reduced number of meangfull eigenvalues making manifold itself more linear alike, here I am hearing exactly opposite thing
Have observed this in some places. At least in simple datasets, too much non-linear behaviour also allows over-fitting which might cause unexpected behaviours for high scores.
At 21:55 I think you should draw horizontal lines (not vertical) to discard architectures since score is on the y-axis. No?
We dont really know what the cutoff score should be. Maybe we can have the top N highest scoring models.
ah I was about to say the same! It;s confusing when the cause has been plotted on the y axis and the effect on the x axis.
Isn't the score measurement on the left evaluated from actual trainning? I think he meant to discard architectures before even trainning them, which I think means that you have to select a vertical threshold on the validation accuracy, like he did
Funny tho, Yes you can discard any crap with bad validation accuracy... if only there were some way to predict that without having to train and validate it ... :0
@@pladselsker8340 Nope, 'score' is what is determined by eqn (2) for an untrained network, while validation accuracy is for a trained network. Before training, one could calculate 'score' for each network, and they would look like dots plotted on a vertical line. Then, discard all networks below a certain score -- by drawing a horizontal line that cuts this vertical line -- and only train the networks that lie above it.
These long videos are really growing on me. Not just introducing me to papers that I am not familiar with, but also the additional insight of your perspectives. Thank you.
Thank you so much, you save 10s of thousands of people hours of work. The impact of your work is immense even if you don't get hundreds of thousands of views. Please never stop, you're amazing!
my 6-word slogan for this paper:
Neural architecture physiognomy! And it works!
Yet another area I am most interested to hear about. Thank you
Thank you for the efforts.. highly appreciated!!
It is basically an "anti-lottery ticket hypothesis".
.
33:00 For the RL based search models, I think we would still need some negative samples. Otherwise the RL model would keep suggesting bad models for the sake of exploration.
.
Nice paper, easy to implement. Will definitely use this trick.
amazing, always like your inspiring interpretation
If it's true, then pre filtering (or rejection sampling) based up on these avg score is a cheap speedup tool for any neural architecture search algorithm too.
Thank you for the very clear explanation!
This is very exciting and as others mention in the comments a super speedy tool to discard faulty architectures, thanks for the video!
Great video! Thanks for you personal interpretation too---helps think things through. I would argue though that the interpretation of the pytorchcv at (25:40) is wrong (admittedly, I don't know if its your interpretation or the authors since they seemed to have removed this part from their most recent version). But it looks like they're showing that their metric scores methods that we know do well high. That is, architectures that have been found by humans to preform well achieve a high NASWOT score (or whatever they call it).
Thank you for sharing this. It's very interesting. I learned a lot.
Thanks for sharing. Commendable work.
your thumbnails are getting better
I think that a lot of the lack of performance compared to other techniques can be explained by the way the NAS-Bench-201 benchmark is constructed:
We only have 15,625 different architectures: enough for the sample efficient "train until you're done" NAS systems, but searching without training may just need significantly different architectures. This would also explain why the more complex tasks the metric "spreads out": There just isn't enough ways the NAS-201 networks differ to make a meaningful difference that can be observed just by looking at the initial state. Maybe one could combine this approach with something like NEAT to generate a population of architectures and score them pretty much instantly using this. This would allow the system to get away from the "resnet-likes" that make up the NAS-BENCH.
I wonder if this scoring could be improved simply by exchanging regular correlation with distance correlation, since that will also capture non-linearities. It might make a difference in particular in those networks where currently the score no longer tells you much.
I just read about distance correlation. It makes sense.
The role of nonlinear function in a neural network can be treated as the if...else statement in a traditional programming language.
the LSTM, GRU, Attention also can be treated as the same way, they provide switch control capability.
Unrelated dumb comment but that annotated NAS-Bench-201 diagram at 19:52 roughly looks like map of Switzerland. Though yeah I am doing research on network compressions and this is a really cool idea, would like to see more studies between these parameters, scores and inference speed so that we can also optimize NAS to get the "smallest" or whatsoever that results to the best inference speed in embedded systems while still giving reasonably good accuracy value.
You have a misunderstanding on #21:42 about axes. Covariation score is on a vertical axe, and validation accuracy (after training) - is horizontal axe. So, if you wish to use described method to filter "bad" architectures in a fast way, you should cut by Score (draw horizontal line on some threshold level), instead of drawing vertical line at #22:21 That actually means, this method is even more far from being precise by itself...
The idea is interesting, thanks for making it so accessible. The big question is: Is it useful? I read fig 3 differently to you and so I come to a different conclusion. You think that this method weeds out most/90% of the bad architectures, I think it weeds out very few. If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy. By eye I integrate vertically to get the distribution of scores that would happen with a useless method. I then do the same again for (say) the top 10% of scores. Scatter plots are terrible at showing density, but it looks to me that all the probability mass is at the top of the plot, so the distributions would be very similar. The authors could have done this basic stats,.
I think the idea has merit, at least intuitively. I'd like your input on why it wouldn't work.
@@ameetrahane1445 I agree, I do think it's an interesting idea. Please note that my long comment was on fig 3 only, it's easy to have some really bad architectures out of all the 15625 combinations and I believe these are given too much weight by the use of the scatter diagram (which doesn't show density well). Thus I was commenting on specifics, not making a general statement that "it wouldn't work".
True, you have a good point. Maybe it would be worth investigating this quantitatively.
Am kinda disappointed they did not show the scores for the well performing trained networks after showing initialisation affects the score significantly. If trained networks tend to have a greater correlation between it's score and accuracy perhaps this method can be useful by somehow mitigating the randomness from initialisation. Perhaps training the network a few rounds on random data to maximise the score and use that?
A tangent: If random data does not affect the score, and the score is correlated with accuracy, what if a network is trained on random data to maximise the score, would it necessarily increase the accuracy on the actual data? This is a reason I'm skeptical on this method as idt the score is a good indication on the accuracy as it does not seem to account for the training data much.
Hi @Tony Robinson, I do not quite understand this sentence in your comment " If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy.", could you please clarify more? thanks
Is J of shape NxD or DxN (where D is the dimensionality of x)? The shape of JJ^T would be NxN and DxD respectively in these two cases. Clearly the first makes more sense in context but the J_i,n in the second line below (1) seems to indicate otherwise.
I don’t know if you’re aware, but the paper seems to have been edited/updated since you made this video with different graphs, showing correlation matrices instead of histograms, and a different equation for computing the score. Is this common for papers to be changed after publishing? Do you know if the new equation is mathematically equivalent and preferred because it’s easier to calculate? Or is it just a different score that measures approximately the same thing?
What if you first train for let's say 5 epochs and then compute the score?
I think this goes in the direction of something people do, where computing power is saved by only doing as few updates as necessary to see whether the architecture is good/ bad. If it's good after five steps you might decide to continue for another bit since good models are harder to tell apart than bad models.
The paper in this video seems to have found a better predictor of performance at convergence than is the score after five steps.
@@dermitdembrot3091 I know. I was wondering if the scores get more accurate / reliable when training the network for a few epochs and then looking at the correlations. Because if I understand it correctly, the network is just initialized and the correlations are based on the random weights. I just find it hard to understand why correlations of random weights are a good indicator of the final prediction. But I did not read the paper, just watched the video, so maybe I did not fully understand the idea.
Oh sorry I completely misread your comment. Kind of assumed you meant accuracy score. I agree that it's worth investigating whether the "gradient correlation" score improves the evaluation. Quite possibly the authors tried and didn't see an improvement.
Very interesting video once again! I have to say I thought I was going to like the "historical papers" but I have to admit that I found the present word2vec and gan videos boring and did not finish them.
Just wanted to leave that feedback.
I somehow like these one-step methods. What I do not directly understand is how this method can predict the generalization capabilities of a network-architecture (e.g. validation set accuracy) from the linear map histogram of one mini-batch.
Alternative 6-word slogan:
First, sanity check neural architecture expressivity!
I wonder if there's a way to use that score as the reward for the RL algorithms instead of the accuracy, I think that will lower down the computation time without necessarily dropping performance so much, but I might horribly wrong haha
Thought so too, should at least be interesting to find out!
Great video as always!
Did you activate ads? I don't mind them at all! I am only asking because you recently mentioned, you didn't plan to enable them.
Yes, I announced that in the latest channel update :)
@Yannick Kilcher. Great job. Thank you! Can you explain EfficientNet/EfficientDet paper?
I guess that if the loss landscape is similar to a random spin glass hamiltonian , as Yann was saying then it makes some sense to have some spreading in the orientation of the linearization... To some extent it is sad that we are basically saying that a convex function has to be discarded from the begining :) I am curious to see some changes in loss function as well.
Could you please point out the reference from Yann's point? Thank you!!
@@arnoldchen1108 it is a pretty old paper by current standards arxiv.org/abs/1412.0233 .
@@blanamaxima Definitely an interesting read though. Thanks a lot!
Great paper and great review! I wonder if we can replace the gradient w.r.t. input with gradient w.r.t. weights. The updated score can be related to NTK and to the metric over the function space. Would such change produce a better expressiveness score? Insights anyone?
interesting idea. I know too little about ntk to have an informed opinion :)
I think the correlation could be better after training for few batches for the hard tasks (ImageNet) - The lottery ticket also had similar problem with harder tasks and needed a bit of training.
Does it make sense?
Yes, totally.
Am I right in thinking that people always initialize networks with random weights (or weights from a previous training)
Has anyone done any work looking at what happens if you use some sort of "less" random values as initial ones?
Is all randomness created equally, so to speak, and is it important to be completely random in your starting point?
What happens if you initialize with a regular pattern, does it fail to train at all?
There is work in this area, but I think without significant improvement over random init.
🤔 Add this before HyperBand in keras-tuner + add Bayesian opt after HyperBand.
1. search without train to get rid of 50-80 % of the really bad architectures
2. HyperBand to quickly abandon poor architectures
3. Bayes opt (with all of the HyperBand runs as input) for the fine-search
Thanks!
I can't believe they've trained image net 50000 times 😝
Has anyone used this? Does it actually work? Please Let me know?
I thinl that this article implies some linf od contradiction if we look it in context of maniflod mixup, in that article they claimed (if I ma right) that they reduced number of meangfull eigenvalues making manifold itself more linear alike, here I am hearing exactly opposite thing
Have observed this in some places. At least in simple datasets, too much non-linear behaviour also allows over-fitting which might cause unexpected behaviours for high scores.
So we use an AI to build another AI... why does that feel so spooky...
Why do guys in ML love to rename already established concepts?... Why "linear map of data" instead of simple the first oder Taylor expansion?