Why the world NEEDS Kolmogorov Arnold Networks

Поделиться
HTML-код
  • Опубликовано: 27 сен 2024

Комментарии • 75

  • @TranquilSeaOfMath
    @TranquilSeaOfMath 4 месяца назад +6

    Great video and interesting topic. I like that you had part of the paper visible for viewers to read. You have a nice library there.

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +4

      Thank you! I’m glad you enjoyed the video

    • @esantirulo721
      @esantirulo721 3 месяца назад

      With some Marvel books! (on right of the Matlab book). Anyway, nobody is perfect, Thanks for the video!!!

  • @mohdherwansulaiman5131
    @mohdherwansulaiman5131 3 месяца назад +1

    I have tried KAN in 2 prediction problems: SOC estimation and chiller energy consumption prediction. It works.

  • @Jay-di7nl
    @Jay-di7nl 4 месяца назад +24

    I have doubts about KAN’s ability to solve real-world or complex problems. One concern is that, although this method shows its capability with simple known functions, real-world problems might involve more complex equations or may not have a clearly defined equation at all.

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +23

      I think the interpretability claims of the paper are largely bogus for this exact reason. That’s why I didn’t mention it here in this video. However, the real contribution of this method is towards resolving the curse of dimensionality, and since every continuous function has a representation like this, there is a real chance it can work out.

    • @rjbaw
      @rjbaw 4 месяца назад +1

      @@JoelRosenfeld why not just use kernels instead

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +6

      @@rjbaw kernels work great for approximations using small to moderate sized data sets. For instance, interpolation using Gaussian RBFs converges extremely fast as the data points become more and more concentrated.
      When it comes to truly large data sets, training kernel based networks can get prohibitively expensive. That’s why machine learning engineers have turned to deep neural networks with ReLU activation functions, where the identification of good weights can be computed much faster.
      KAN promises to be even faster than neural networks for training by magnitudes. They aren’t yet, but we have just started using them.

    • @rjbaw
      @rjbaw 4 месяца назад +1

      @@JoelRosenfeld I am curious as to what is the problem with very large datasets. Doesn't this largely depend on the approximation methods to compute the kernel. Especially as the dimensions increases, the data concentrates.
      Although it is true that ReLU activation functions are easy compute the main reason people switch to ReLU is because of the gradient vanishing problem.

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +3

      @@rjbaw Computing the kernel functions themselves is often not really an issue, unless you are working directly with features. That's actually their advantage, is that he computation of a kernel itself is simple, you just evaluate a function.
      The complication is when you are trying to do approximations in kernel spaces (using linear combinations of kernel functions), like for SVMs. Ultimately, you will need to invert a Gram matrix to get your weights, and that matrix inversion is costly. Moreover, as your data gets denser, you are going to get some ill conditioning to your algebra problem, and that is going to make it difficult to find accurate weights.
      So for really big data problems, you have to invert really large dense and possibly ill conditioned matrices.
      Large scale problems use a combination of neural networks and sparsity to find good solutions. Those aren't always available in the kernel setting. The linear systems you get from kernels are often not sparse, and that leads to difficulties.
      It could be I'm overlooking something right now. Just trying to get you a response right before a meeting. But that's my quick two cents on it.

  • @agranero6
    @agranero6 Месяц назад

    I have read Arnold book a few (a lot) of time. It seems trivial at the first pages and then you notice that is not. It is dense. Makes you think.
    And I realize I must read it again.

  • @mlliarm
    @mlliarm 4 месяца назад +18

    Thank you for this. I haven't noticed this paradigm shift, but upon hearing the magic names Kolmogorov & Arnold you surely got my attention. 👀👀👀

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +5

      It’s a pretty neat approach to learning theory. It’s still in the early adoption stage, but exciting nonetheless!

    • @lahirurasnayake5285
      @lahirurasnayake5285 3 месяца назад +1

      I think the pun "got my attention" was intended 😄

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +1

      @@lahirurasnayake5285 really, it’s all you need

  • @kev2582
    @kev2582 4 месяца назад +2

    "may not be learnable" is the core problem as you seem to be well aware. Function approximation at collapsed dimension with nice properties is very very hard. There are many problem beyond this level as well.

  • @morrning_group
    @morrning_group 3 месяца назад

    Thank you for this incredibly informative video on Kolmogorov Arnold Networks! 🤯💻 It's such a deep dive into machine learning concepts, and I appreciate how you break down complex ideas into understandable explanations. 🌟
    I'm curious about the future direction of your channel. 🚀 Are there plans to delve deeper into specific machine learning architectures like Kolmogorov Arnold Networks, or will you explore a broader range of topics within the field? 🤔📈 Additionally, do you have any upcoming collaborations or special projects in the pipeline that your viewers can look forward to? 🌐🔍 Keep up the fantastic work!

  • @nias2631
    @nias2631 4 месяца назад +3

    Maybe I am somehow pattern matching, but the KAN has a strong similarity to the encoding layer of a convolutional conditional neural process (CCNP), which is built upon the reproducing kernel theorem. In particular I have the sense that it is related to the Lorenz variant of the K-A theorem. I'd be curious on your thoughts in this. Good channel btw!

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +2

      I’ll have to take some time to look into it. I haven’t encountered CCNP, but I do love kernels!

  • @beaverbuoy3011
    @beaverbuoy3011 4 месяца назад +1

    Awesome, thank you.

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад

      You’re welcome! Thanks for watching!

  • @tablen2896
    @tablen2896 3 месяца назад +1

    Hey, great video.
    If you take constructive(?) criticism (and you may have already noticed when editing), you should reduce the sensitivity of focus adjustment of your recording device, or set it to a fixed value. It tries to focus the background (computer screen) and foreground (your hands) instead of you.

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +1

      lol yeah I noticed. Actually put b roll over some especially bad spots. I almost recorded again. I’ll look into it. Something I struggle with

    • @agranero6
      @agranero6 Месяц назад

      Oh that what was making the background blink and the focus continuously change. I got puzzled by it.

  • @Lilina3456
    @Lilina3456 3 месяца назад +1

    Hii thank you so much for the video. Can you please do the implementation of KANs?

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад

      That’s the plan! Grant writing and travel have slowed me down this summer, but it’s coming. I have three videos in the pipeline and that I think is the third one.

  • @johnfinlayson7559
    @johnfinlayson7559 4 месяца назад +1

    Awesome video man

  • @kabuda1949
    @kabuda1949 4 месяца назад +1

    Great video. Can you do an example of solving, let's say, pdes using kernel methods?. for example, kernel as the Gaussian function?

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +1

      Collocation methods for PDEs are in the plans. First we should build up convergence rates with kernel approximations too.

    • @kabuda1949
      @kabuda1949 4 месяца назад +1

      @@JoelRosenfeld makes sense. Can't wait

  • @Charliethephysicist
    @Charliethephysicist 3 месяца назад +2

    KAN draws inspiration from the Kolmogorov-Arnold representation theorem, though it significantly diverges from and falls below the theorem's original intent and content. It confines its form to compositions of sums of single-variable smooth functions, representing only a tiny subset of all possible smooth functions. This confinement eliminates, by design, the so-called curse of dimensionality. However, there is no free lunch. It is seriously doubtful that this subset is dense within the entire set of smooth functions --- though I have not come up with an example yet. If it is indeed not dense, KAN will not serve as a universal function approximator, unlike the multilayer perceptron. Nonetheless, it may prove valuable in fields such as scientific research, where many explicitly defined functions tend to be simple, even if they do not approximate all possible smooth functions.

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад

      Since I saw your message last night, I have been thinking about this. I think universality is going to be fine. The inner functions in the Kolmogorov Arnold Representation are each continuous. Splines can arbitrarily well approximate continuous functions over a compact set, so we could approximate each one to say some epsilon >0. The triangle inequality tells us that the overall error between those approximations is bounded by n times epsilon, where n is the dimension of the space.
      So the approximation of the inner function is fine.
      The only twist comes with the outer function. The inner functions are all continuous images of compact sets, so if we look at the outer functions restricted to these sets, we are looking for an approximation on the compact image of the inner functions. We can get a spline approximation of the outer functions like that as well that is within some prescribed epsilon.
      To make sure everything meshes together well, you need something like Lipschitz continuity on the outer functions. That has never been included in the description of them, because the theorems are for general functions, rather than being restricted to smooth functions or other classes. Picking through the proofs, I think it'd be straightforward to get Lipschitz conditions on the outer functions, when the function you are representing is also Lipschitz.
      With all of those together, I think that basically takes care of what you would need for universality.

    • @Charliethephysicist
      @Charliethephysicist 3 месяца назад +2

      @@JoelRosenfeld You rationale is based on multidimensional smooth function approximation. This is precisely what does not apply in this situation. Each function in KAN, no matter which layer, is one-variable and smooth. The latter property prevents the original proof of the Kolmogorov-Arnold theorem, and the former prevents the Taylor expansion proof for Sobolev space functions approximation which I think you are talking about, to go through. Moreover, there is no essential distinction in KAN between the inner and outer function unlike in the KA theorem. The layers are simply recursive stacking. You are trading off between the curse of dimension from universality and simplicity. There is no free lunch.

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад

      @@Charliethephysicist Ok, I'll give it some more thought. I personally think that there is a good chance of universality pulling through here. But, you never know until you have a proof or a counter example.

    • @Charliethephysicist
      @Charliethephysicist 3 месяца назад +1

      @@JoelRosenfeld Examine theorem 2.1, which is the crux theorem, of the KAN paper and see what the premise is as well as its proof steps to see universality is nowhere to be found. To be honest the authors of the paper should have made this point much clearer instead of letting only the experts decipher their claim.

  • @pierce8308
    @pierce8308 4 месяца назад +2

    Hi, What prevents us from using a linear Relu layer (deep/wide) instead of splines ? That seems much more simpler and efficient right ? Specifically, why did authors choose splines over simple linear layers (relu activated), as the authors themselves were concerned about computational efficiences.
    It seems to me that this simply adds feature expansion to deep nets

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +5

      To be honest, nothing stops you there. Splines are just really good one dimensional approximators, have clear convergence rates, and form a partition of unity.
      Another reason is probably a concern of messaging. If they included a neural network within the approximation scheme of KAN, it could have led to a lot of confusion between the NN layers and the KAN layers.

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +3

      Also, if you are working with ReLU activation functions in a single layer, then that is just a first order spline

    • @pierce8308
      @pierce8308 4 месяца назад

      @@JoelRosenfeld I see, thanks !

    • @elirane85
      @elirane85 3 месяца назад

      Because you need a novel idea to publish papers. That's the only reason why it's even a thing.

    • @agranero6
      @agranero6 Месяц назад +1

      Splines makes it more numerically stable. Using other functions like polynomials make a small change in X's cause great changes. If you try to apply backpropagation on that you probably won't get it to converge. The splines choice was not arbitrary. This is a well know property of splines. Numerical stability and splines are intimately connected. B splines are differentiable and their derivatives are too; more important, B splines can be locally controlled to adjust you can change the curve only locally (you can see why they are stable?) by changing just the control point related to that part, this is important because avoids the problem of catastrophic forgetting because global parameters changes can propagate explosively to far regions of the network; and they can be easily written on a parametric form ideal for this purpose because they change the curves and can be trained. Using splines also allows make the networks sparse that is a very important property neared to biological neural networks and very important to save resources and allow you retrain parts of the network, MLPs are a big monolith and it is difficult to locally make point wise changes and find internal representations (it is all in the article with a few additions by myself). Undertand that the activation function are learnable from their parameters no the weights.
      I learned many of those things because I was writhing a video game and fell into a rabbit hole of numerical stability, splines Catmull-Rom, etc, etc, but that is beyond the scope of my answer.
      Just a thought experiment to make you think: take an angle each line of the angle make equally spaced points draw lines between all points on each line why do you get that smooth curve doing that, a parabola for instance.
      Ignore the cynical trolls.
      The main problem was ignored in the video and the comments (including the trolls: you can't run that on GPUs)

  • @DJWESG1
    @DJWESG1 3 месяца назад

    I thought the point was to embed a core logic to reduce size.

  • @chanjohn5466
    @chanjohn5466 4 месяца назад +1

    Do you have a playlist on machine learning series?

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +1

      There is an older playlist for Data Driven Methods in Dynamical Systems. There I talk about how to use machine learning and operator theory to understand time series.
      I have a newer playlist focused on the Theory of Machine Learning. So far, it has a bunch of videos going into various aspects of Hilbert spaces and Fourier Series. But it's the one I'm working on right now.
      ruclips.net/p/PLldiDnQu2phtAR82SxYoEB46_3BUuvJKe&si=yiyvY-BZydP3tUAg

  • @sashayakubov6924
    @sashayakubov6924 3 месяца назад +1

    Hold on, I'm reading Wikipedia, turns out that Karatsuba (the one who invented his multiplication algorithm) is a student of Kolmogorov!!!

  • @petersuvara
    @petersuvara 3 месяца назад +3

    Who needs maths when AI does it all for us. Oh... yeah... it's not AI. :) It's MATHS! :)

  • @InstaKane
    @InstaKane 3 месяца назад

    Right…..

  • @sashayakubov6924
    @sashayakubov6924 3 месяца назад +3

    Both Russian scientists? Kolmogorov is OK, but "Arnold" does not sound like a Russian surname at all.

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +3

      Vladimir Arnold was indeed a citizen of the USSR. The Soviet government actually interceded to prevent him from getting a fields medal because Arnold spoke out against their treatment of dissidents.

    • @mishaerementchouk
      @mishaerementchouk 7 дней назад

      @@sashayakubov6924 it’s of the Prussian origin. They were citizens of the Russian Empire since (at least) the first half of the 19-th century.

  • @tamineabderrahmane248
    @tamineabderrahmane248 3 месяца назад +2

    i think that KAN will be stronger than MLP in physics informed neural networks

  • @peterhall6656
    @peterhall6656 3 месяца назад +2

    This sort of high level discussion is always seductive. I look forward to the nuts and bolts to see whether the first date performance can be sustained.....

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +1

      Absolutely, we are still seeing the beginnings of this method. I’m optimistic, but we will see!

  • @Adventure1844
    @Adventure1844 3 месяца назад +1

    its an old method from 2021: ruclips.net/video/eS_k6L638k0/видео.html

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +6

      The use of Kolmogorov Arnold Representations as a foundation for Neural Networks goes back at least as far as the 1980s and 1990s. In fact, I think the Kolmogorov Arnold Representation as a two layer neural network appeared in the first volume of the Journal Neural Networks. You can find more if you look into David Sprecher’s work, who has worked on this problem for 60 years.
      The innovation for this work comes in the form of layering, and the positions it as an alternative to deep neural networks, but with learnable activation functions.

  • @naninano8813
    @naninano8813 3 месяца назад +1

    4:19 i have exact same book in my library but never made a connection tbh that current KAN hype in AI world is connected the same Arnold

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад

      It’s really a great textbook. Arnold did a lot of great work, and was nominated for a Fields medal back in 1974. I’ll talk more about it in my next video.

  • @foreignconta
    @foreignconta Месяц назад +1

    Got recommended and immediately subscribed.

  • @avyuktamanjunathavummintal8810
    @avyuktamanjunathavummintal8810 3 месяца назад +3

    Great video! Eagerly awaiting your next one! :)

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +3

      Me too lol! I’ve been digging through the proofs to find one that is digestible for a RUclips video. I actually had a whole recording done only to realize it wouldn’t work too well. Trying to give the best video possible :)

    • @avyuktamanjunathavummintal8810
      @avyuktamanjunathavummintal8810 3 месяца назад +1

      @@JoelRosenfeld , (I understand your desire for perfection, but) you know, I'd rather you upload that recorded video. 😅:)

    • @JoelRosenfeld
      @JoelRosenfeld  3 месяца назад +1

      @@avyuktamanjunathavummintal8810 I appreciate that. I should have a new video up this weekend. Still need time to make that video breaking down the theory, but I have something that’ll hopefully bridge the gap a little.

  • @mrpocock
    @mrpocock 3 месяца назад

    I don't think there's anything in principle preventing a KAN layer or layers from being put inside a normal deep network. So there may be a space of interesting hybrids that do interesting things. For example, the time-wise sum of a time- wise convolution with a (2 layer?) can learn to perform attention, without needing all those horrible fourier features.

    • @agranero6
      @agranero6 Месяц назад

      They say that in the article.

  • @jmw1500
    @jmw1500 4 месяца назад

    Interesting. What are your thoughts on neural operators and operator learning?

    • @JoelRosenfeld
      @JoelRosenfeld  4 месяца назад +1

      I haven't looked too deeply into Neural Operators, so it's hard for me to really say much in that direction. As far as Operator Learning goes, that's pretty much what I have been doing for the past 6 years or so. I think there are a lot of advantages of working operators into an approximation framework, where it can give you more levers to pull from.

  • @RSLT
    @RSLT 4 месяца назад +3

    Very Cool!