This Physics Technology Trains Machine Learning 10X Faster

Поделиться
HTML-код
  • Опубликовано: 22 дек 2024
  • НаукаНаука

Комментарии • 52

  • @CompuFlair
    @CompuFlair  27 дней назад +2

    Hello friends.
    🚀 Join the CompuFlair Community! 🚀
    📈 Sign up on our website to access exclusive Data Science Roadmap pages - a step-by-step guide to mastering the essential skills for a successful career.
    💪As a member, you’ll receive emails on expert-engineered ChatGPT prompts to boost your data science tasks, be notified of our private problem-solving sessions, and get early access to news and updates.
    👉 compu-flair.com/user/register
    Also, here is the link to the Jupyter notebook presented in this video (on google colab):
    colab.research.google.com/drive/1jHJ-LevD34f1Rgf5YsuYaGD8SzD8hwNJ?usp=sharing

  • @vidal9747
    @vidal9747 27 дней назад +3

    I am convinced that if computer scientists and physicists talked to each other, we would've both achieve incredible things.

    • @CompuFlair
      @CompuFlair  27 дней назад

      This is encouraging. Thanks for that!

  • @turun_ambartanen
    @turun_ambartanen 28 дней назад +8

    stochastic gradient descent is used to train neural networks, because it scales surprisingly well with the number of parameters/weights. For example, the higher the number of parameters, the fewer local minima and flat spots will be present in the loss landscape. This was surprising for me, but it just pops out of the math for higher dimensions. Also, neural networks are trained with activation functions from which the derivative can be calculated easily at any point. So getting the gradient, which for some mathematical functions can be quite costly, is actually dirt cheap for neural networks. The time complexity of the computation is the most important thing nowadays.
    PS: I could not reproduce the results with the code shown on screen. If you have a repo with the python code that would be neat. Pro tip: seed the numpy rng to make the code deterministic.

    • @CompuFlair
      @CompuFlair  28 дней назад

      Yes, for sure, gradient descent is a great method that we can see in action today in LLMs. And no doubt they scale well.
      However, perturbation theory also scales very well. Its main advantage over gradient descent I believe, is it comes pre-computed. The whole point of numeric computation is to replace the impossible analytic calculation. Perturbation theory can reduce computation by providing an analytic solution with only a few parts unknown.
      Regarding the seed, I intentionally didn't set it so that the speed advantage is randomly tested and make sure its not accidental.

    • @2299momo
      @2299momo 27 дней назад +1

      @@CompuFlair Any idea why he couldn't reproduce your results? I don't think your response should've glossed over that talking point. Share a set seed to see if the OP can get an identical result.

    • @CompuFlair
      @CompuFlair  27 дней назад

      @@2299momo Thanks for the reminder. I just pinned my comment with the link to my notebook on google colab. Please check that out.

  • @ChaseFreedomMusician
    @ChaseFreedomMusician 28 дней назад +6

    This seems fine for basic predictions but what about very large set of thousands of columns and millions or billions of rows? Or for models that use state or sequences, like with LLMs or time series prediction? How would we apply this insight?

    • @CompuFlair
      @CompuFlair  28 дней назад +4

      To know for sure, we have to give it a try. However, larger models and more complicated datasets have complicated parameter landscapes which challenges the performance of minimization methods even more. So, my guess is this method outperforms the minimization approach even more.

    • @ChaseFreedomMusician
      @ChaseFreedomMusician 28 дней назад +5

      @@CompuFlair I get what you're saying, and sure, trying things out is always a good approach. But I think there’s something fundamental here that we shouldn’t overlook: non-linear relationships.
      Covariance matrices are great for capturing linear dependencies, but real-world data is rarely that straightforward, especially with large datasets or sequences. Think about time series data or models like LLMs. The relationships between inputs and outputs in those cases aren’t just simple, "if X goes up, Y goes up" kind of patterns. There’s context, feedback loops, and interactions across time or space that linear approaches just can’t capture.
      Take text prediction as an example. A covariance-based method might pick up that "The" is often followed by "cat," but it’ll completely miss that the verb in "The cat that chased the mouse" has to agree with "cat," not "mouse." That’s a hierarchical dependency, and it’s non-linear by nature.
      DNNs, especially with things like attention mechanisms, shine here because they don’t just find pairwise relationships. They learn layers of abstraction: simple patterns in the first layer, combinations of patterns in the next, and so on. That’s what lets them handle things like long-range dependencies in text or complex non-linear trends in time series.
      I’m not saying covariance matrices can’t be useful-they’re fast, interpretable, and great for simpler problems. But when the data is messy, interactive, and non-linear, you’re going to hit a wall where the relationships you need to model just don’t fit into a linear framework. That’s where DNNs pull ahead, even if they’re computationally heavier.
      So yeah, I’m all for trying it out. But I think the performance gap you’re suggesting would actually get worse, not better, as the dataset gets more complex.

    • @CompuFlair
      @CompuFlair  28 дней назад +4

      @@ChaseFreedomMusician Thanks for the comment. Non-linear interactions is the whole point of this video. Perturbation theory with 300 years of background is developed only because of these non-linear interactions that you greatly (and correctly) emphasized.
      I think performance gets much better in the presence of non-linear interactions because we can still find analytic approximations using perturbation theory and analytic solution usually means significant reduction in computations.

    • @ChaseFreedomMusician
      @ChaseFreedomMusician 28 дней назад +6

      ​@@CompuFlair Thanks for explaining. Perturbation theory is definitely powerful, and I can see how it might help approximate non-linear interactions. But I’m struggling to see how it connects to the example you gave.
      The method you described, using:
      cov = np.cov(data.T)
      cov_inv = np.linalg.inv(cov)
      beta = - cov_inv[:-1,1] / cov_inv[:-1,-1]
      seems to focus purely on linear relationships. The covariance matrix captures how variables linearly co-vary, and the inverse isolates direct linear dependencies. Without explicitly adding higher-order terms or interaction features (like x^2 or x * y), it doesn’t seem like this approach would naturally handle non-linear relationships. Am I missing a step where those non-linear interactions are introduced?
      Also, on the computational side, inverting a large covariance matrix becomes challenging as the dataset scales. For thousands of columns or millions of rows, this step could become a bottleneck. In contrast, while DNNs are computationally heavy, they scale well across distributed systems and handle non-linearity without requiring feature engineering.
      How would you incorporate perturbation theory into the method you showed to model non-linear relationships? Or are you suggesting a way to add those higher-order terms directly into the covariance matrix? I’m curious how this would look in practice.

    • @CompuFlair
      @CompuFlair  28 дней назад +7

      @@ChaseFreedomMusician Yes, the example I presented is linear but that is the essence of perturbation theory to solve the linear part first then add corrections one by one.
      And yes most of the time the probability (F in the exp) explicitly contains x^3 and higher terms. These are the corrections perturbation theory invented to handle approximately. So, in their presence, we have to calculate corrections to my code and add them. So, yes, the covariance matrix alone won't be there. But, the form of corrections depends on the model. For linear regression, they are zero.
      Regarding high dimensions and the inverse of a matrix in such large spaces, I'm not that worried because, in field theory, where we use perturbation theory, dimensions are not high; they are infinite, and we have techniques to find the inverse in infinite dimensions.
      Regarding how we add corrections, we first ignore them and solve the linear version. That would be what I did. Then we add the largest correction, x^3 terms, and update the previous answer. Then we add the next largest correction, x^4, and update the answer of the previous step and this loop goes on as far as we are satisfied with the accuracy of the model.
      This whole loop has a systematic mathematical machinery that I am going to cover in future videos.

  • @bojanbernard180
    @bojanbernard180 27 дней назад +2

    you hint to Feynman diagrams as an example of perturbation theory - that works because you expand in power series of fine structure constant (1/137), works worst for strong force, etc.. what kind of real world problems could be formulated in a way to assure fast convergence, cause otherwise there is little or no gain compared to SGD ?

    • @CompuFlair
      @CompuFlair  27 дней назад +1

      This is a great comment and you have mentioned a critical point. We have this max entropy principle in information theory (that works for ML) and the 2nd law of thermodynamics (that works for physics systems) that both guarantee the existence of an equilibrium state. ML works only after the system has reached this state (or probability will evolve with time after we collect data and that data can't predict future events). In this equilibrium state, perturbations are small by definition. I have covered this in 2 of the earlier videos in this playlist.

  • @drdca8263
    @drdca8263 28 дней назад +2

    3:40 : [removed thing saying essentially “I’ve heard that the problem is more often the gradient being flat, rather than gradient being zero with second derivative being positive”]
    3:51 : oh nvm you addressed it being flat as well
    3:54 : computing the gradient is costly?
    This surprises me.
    Well, I suppose computing the loss for the entire dataset is computationally costly,
    but for the loss for a single datapoint, my impression was that it was roughly as expensive as 2x the forward pass?
    You run the forward pass, computing the gradient of each neuron with respect to its inputs and its parameters, evaluated at its current inputs and parameters, storing these for each layer as a pair of matrices, right? And like, for sigmoid or tanh, the nonlinearity’s contribution to the gradient is readily computed from the activation, I thought?
    23:25 : hmm…
    One thing I’m a bit unclear on, is how we determine in general how J variables should fit into the F variable.
    Should it always be a J_i w_i term for each parameter w_n of the model? Or, J_i x_i for each input x_i of the model?
    I saw you wrote something about combining the J with x in the derivation you showed…

    • @CompuFlair
      @CompuFlair  28 дней назад +3

      "computing the gradient is costly?"
      Well we are comparing. Costly with respect to what? We need to find 1st and 2nd derivatives to find a minimum and that is extra work.
      "One thing I’m a bit unclear on, is how we determine in general how J variables should fit into the F variable"
      It must be the variable we are summing over so that when we take derivative wrt J, that variables fall down from the argument of the exp and returns expectation value

    • @drdca8263
      @drdca8263 28 дней назад +2

      @ Ah, so whatever variables we want to take moments of. Makes sense, thanks!
      (Err… where by “moments” I also mean things like the expectation of e.g. (x_1^2 times x_3) , even if that might not technically be a “moment”(?))
      I will need to think more about how that all works in the case of hidden layers.

  • @llbrunollllbll9347
    @llbrunollllbll9347 26 дней назад +1

    Could you check about Liquid Neural Networks? They also implement analytic solutions to speed up the training. There might be complementary ideas in the air that could make your work even more impactful. Very interesting! Keep pushing forward!

    • @CompuFlair
      @CompuFlair  26 дней назад

      Thanks for the comment. I'll take a look.

  • @NLPprompter
    @NLPprompter 28 дней назад +2

    Hello sir I'm wondering How does the computational complexity of this perturbation theory approach scale with increasing dataset size and model complexity (e.g., deeper neural networks)? Are there limitations where gradient descent becomes more efficient?

    • @CompuFlair
      @CompuFlair  28 дней назад +2

      This is a great question to be explored. I don't have the answer though. Just know that perturbation theory in field theory handles infinite dimensional systems. And the dataset it is being applied to is usually very large. But that is in physics. In ML we just need to explore it

    • @NLPprompter
      @NLPprompter 28 дней назад +2

      @CompuFlair ah I see... thank you for your kind reply, such wonders... this always keep me wanted to learn more. again, thanks.

    • @CompuFlair
      @CompuFlair  28 дней назад +2

      @@NLPprompter you are welcome

  • @hjups
    @hjups 28 дней назад +7

    I think you are making a false assumption that this method can practically scale to arbitrary distributions. It seems feasible for simple linear regression with a Gaussian distribution, but what about non Gaussian distributions like in classification problems? If I understand correctly, your recipe depends on 1) computing the partition function through perturbation approximations, and 2) solving the system of equations for the unknown parameters.
    How do you intend to compute the partition function for a transformer with 8 billion free parameters? Or even in the simpler classification case, think about a ResNet model with 50 million free parameters? Sure you can in /theory/ do it, just like how a Fourier series can theoretically approximate any compact or repeating function, but it would practically be equivalent to solving for an infinite series.
    And then in regards to solving the system.... this will either become far too expensive (where traditional SGD is faster), or you will need to approximate the solution using something like SGD. My guess is that the issue comes down to the dataset itself, where you considered a very small problem. Even consider something as simple of MNIST - each data point is 784 dimensional, and you have 60k of them. That is not a computationally feasible system to solve directly.
    That said, if you have some thoughts on how to deal with these more general cases, perhaps you should focus on the relatively simple classification problems of MNIST and then CIFAR10. Both can be solved with simpler MLPs, but you would have to show that you're able to produce equivalent or better classification accuracy, while also training faster than with SGD.

    • @CompuFlair
      @CompuFlair  28 дней назад +4

      Thanks for the comment and the suggestions. They are mostly on my to-do list.
      Classifications would be much easier to handle. For example, see this video where I derive the partition function of logistic regression
      ruclips.net/video/H-ydRnSZbyw/видео.html
      Also, I have already shown the mathematical equivalence of neural nets and Ising model in physics where the non-Gaussian interactions are fully investigated. See the video here:
      ruclips.net/video/T69vbMkl_uI/видео.html
      In general, I am not worried about the large dimensions or large datasets of neural nets because perturbation theory is developed to handle infinite dimensions and where it is practically used, 1 billion events (spreadsheet rows) are created every second and the experiment goes on for months. So one can imagine how large the dataset it handles is.
      Check this link out for example:
      home.cern/science/computing/storage#:~:text=Up%20to%20about%201%20billion,out%20all%20of%20these%20events.

  • @supreetsahu1964
    @supreetsahu1964 20 дней назад

    Relation to Boltzmann factor! Very smart

    • @CompuFlair
      @CompuFlair  20 дней назад

      Thanks for the comment!

  • @volpir4672
    @volpir4672 29 дней назад +2

    you rock, this is great!!!!

    • @CompuFlair
      @CompuFlair  29 дней назад

      Thanks! Glad you liked it.

  • @eloitorrents2439
    @eloitorrents2439 28 дней назад +2

    Is this approach written somewhere?

    • @CompuFlair
      @CompuFlair  28 дней назад +1

      In physics, yes. Any book on statistical field theory or quantum field theory. But, applying to ML, I guess not. The trick is to convert ML model to P = e^-F/Z which is mathematically what those books start with. Then just copy what those books have prescribed.

  • @minecraftermad
    @minecraftermad 27 дней назад +1

    how's the memory usage? afaik that's one of the main bottlenecks in any larger ML problems.

    • @CompuFlair
      @CompuFlair  27 дней назад

      That is a great question. Honestly, I haven't checked, but will keep an eye on it next time.

  • @mircorichter1375
    @mircorichter1375 27 дней назад +3

    I think the most convincing argument for another training method is not speed, but the possibility to overcome local non optimal minima, sattle points and such. Any method that is better to converge to global optima is interesting.
    However there is more. The optimas that backpropagation finds are often "robust" (the term is not strictly defined yet but it somewhat means "stable under small deformations of weights") ... This is important so any other method must have that property too.

    • @CompuFlair
      @CompuFlair  27 дней назад +1

      Thanks for the comment. Totally agree that "any other method must have that property too."

  • @TiborVass
    @TiborVass 4 дня назад

    Is there a way this could be applied for inference as well ?

    • @CompuFlair
      @CompuFlair  4 дня назад +2

      In principle, yes, but there is some work to do. For more complex models, the details of training and inferring haven't been derived (as far as I know). For linear regression (and multinomial), the prediction is just what I showed in the video.

    • @TiborVass
      @TiborVass 3 дня назад

      @@CompuFlair Is there a valid reason why scikit doesn't use this faster analytically derived algorithm for linear and logistic regression ?

    • @CompuFlair
      @CompuFlair  3 дня назад +1

      @@TiborVass For logistic regression, the details are not out yet. For linear regression, I just published them in the RUclips video a month ago (original to best of my knowledge) so they might not know about it or the minimization method is already fast enough in linear regression and 10X faster doesn't worth the effort. Just my thoughts.

    • @TiborVass
      @TiborVass 3 дня назад +1

      @@CompuFlair wow thanks! I'm pretty sure Scikit would welcome a 10x improvement :) And sorry I thought "multinomial" referred to logistic regression. Thank you for sharing all your amazing work and explanations, it's truly mind-blowing.

    • @CompuFlair
      @CompuFlair  3 дня назад

      @@TiborVass Thanks for the comment.

  • @TheRayhaller
    @TheRayhaller 29 дней назад +1

    Great video, I really appreciate the python notebook example!

  • @josephmargaryan
    @josephmargaryan 18 дней назад

    also do it for classification

    • @CompuFlair
      @CompuFlair  18 дней назад

      Thanks for the comment! That is on my to-do list

  • @volpir4672
    @volpir4672 29 дней назад +3

    are you on twitter or discord? your work here is really good, it would be nice to discuss

    • @CompuFlair
      @CompuFlair  29 дней назад +1

      Not active there but can be reached here or on LinkedIn.