FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 12 сен 2024

Комментарии • 100

  • @YannicKilcher
    @YannicKilcher  3 года назад +25

    Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix.
    OUTLINE:
    0:00 - Intro & Overview
    0:45 - Giving up on Attention
    5:00 - FNet Architecture
    9:00 - Going deeper into the Fourier Transform
    11:20 - The Importance of Mixing
    22:20 - Experimental Results
    33:00 - Conclusions & Comments
    Paper: arxiv.org/abs/2105.03824
    ADDENDUM:
    Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.

    • @aspergale9836
      @aspergale9836 3 года назад

      4:30 - Which papers did you have in mind here?

    • @emilzakirov5173
      @emilzakirov5173 3 года назад

      @@aspergale9836 he mentioned some of them later. AFAIK it's probably: "MLP-Mixer", "Pay Attention to MLPs" and "ResMLP ..." papers

    • @RobertWeikel
      @RobertWeikel 3 года назад

      KISS always wins.

    • @bediosoro7786
      @bediosoro7786 3 года назад

      Are they saying that position encoding was useless. I am not seeing that part in the paper. Can someone explain what I missed.

    • @ibrahimaba8966
      @ibrahimaba8966 2 года назад

      Hello Yannick, what do tou think about mixing strategies like fourier transform for one layer and attention for the next layer and so one?

  • @ChaiTimeDataScience
    @ChaiTimeDataScience 3 года назад +16

    Thank you Yannic. I know what I'm watching for Friday evening now, no more scrolling! :D

  • @antraxuran9
    @antraxuran9 3 года назад +17

    @17:20 You need the imaginary part to do the reverse, it contains the phase information!

    • @Sam-pd1be
      @Sam-pd1be 3 года назад +2

      Yeah, I don’t know why the authors talk about the reverse when they drop the imaginary part. That fact makes me wonder just how relevant the Fourier transforms properties could possibly be to these results. I feel like the main reason to use it might be that we have readily available fast implementations.

  • @samernoureddine
    @samernoureddine 3 года назад +70

    "I know I'm a bit late with this one" - paper hasn't even been up for two weeks

  • @galchinsky
    @galchinsky 3 года назад +43

    I'm a guy with DSP roots, and this paper causes a lot of cringe. There was an elephant (the convolution theorem) and it was totally missed. FFT offers circular convolution, which makes little sense in NLP context, so a natural solution to improve results would be to try to pad values. Also cudnn does perform FFT for large kernels, breaking news. But that's not interesting actually. The only interesting part is taking "real" part of the FFT transform. This looks like quite an exotic type of non-linearity: resetting phase. I wonder if it's only additional noise, or really adds something new. Instead of this, there are soulless words about "mixing tokens". I hope LeCun will kick their ass not only in twitter, because this "convolutions in mustache" things are starting to be frightening.

    • @narsimhar.chilkuri7290
      @narsimhar.chilkuri7290 3 года назад

      so true lol

    • @narsimhar.chilkuri7290
      @narsimhar.chilkuri7290 3 года назад +3

      Although I must say that having the kernel weights analytically defined allows us to "see" the whole sequence in just one layer. This is I believe different from how CNN architectures do it i.e they use small kernels but very deep architecurtes to have a large receptive field.

    • @aspergale9836
      @aspergale9836 3 года назад +6

      Doesn't feel like LeCun would be too opposed.
      This seems to be the style of ML papers that's generally accepted nowadays. More rigorous works are harder to read and apply, unless they come with a ready-to-use, open source framework. So the easier incremental works end up taking the spotlight, more often than not.
      I didn't have that many reservations about it when reading it, but one especially stood out when you mentioned it: "there are soulless words like 'mixing tokens'". That ties back to my "unsubstantiated" claim on lack of rigor in most well-publicized papers recently.

    • @aniruddhadatta925
      @aniruddhadatta925 3 года назад

      I agree with you my comrade

    • @alexeifigueroa
      @alexeifigueroa 3 года назад +3

      Same cringe here, I wonder if they tried first ABS() then RE() and settled for the best number, plus this intuition of multilayered FTs inverting each other in the sense of "duality" sounds plain wrong to me

  • @machinelearningdojowithtim2898
    @machinelearningdojowithtim2898 3 года назад +4

    I was intrigued by yannics comments about the existence of a universal attention (or information routing) matrix. Try visualising the DFT matrix, it's got a cool repeating circular pattern. The most interesting thing here is that not (that much) worse results can be obtained this way and that such information transfer regularities exist in language. We can almost certainly learn a better universal matrix but then we lose the FFT efficiency boost. Why don't we do something with neural program synthesis in this attention layer? We could learn a replacement for the FFT! Great video as always

  • @julianke455
    @julianke455 3 года назад +2

    Your videos are so valuable. Seriously I've learned way more watching your explanation than listening my profs at college. Thank you Yannic!

  • @ce6535
    @ce6535 3 года назад +13

    In response to the comment "I'm even open to the idea that the Fourier transform might even be the optimal mixing technique." I think the actual optimum is something that could be trained, or at least investigated empirically. Other functions, such as Legendre polynomials or Bessel functions, are 'a' Fourier basis for different differential operators/measures. It's easy to find the functional bases using Sturm-Liouville theory. It is possible that you could optimize the basis by allowing the functions that define the ODE to become parameters to the model.

    • @ce6535
      @ce6535 3 года назад +3

      To be clear, the advantage of this method is that once you have your basis, you can bake it in to the architecture. The method is trained, but the parameters you are training don't ship with the final model.

    • @alpers.2123
      @alpers.2123 3 года назад +3

      We need more weird math:)

    • @st0a
      @st0a 6 месяцев назад

      Wow, I'm dumb.

  • @carlossegura403
    @carlossegura403 3 года назад +4

    I was getting ready to read the paper, but then I said, "oh, I'll just wait for Yannic to explain it." Thank you!

  • @user-qu2oz2ut2h
    @user-qu2oz2ut2h 3 года назад +7

    I think that we should investigate the application of a fractional Fourier transform instead of a regular one.
    Fractional FT is a generalization of FT and is parametrized by an angle alpha.
    If alpha=0 function doesn't change (skip connection)
    If alpha=pi/2 it performs regular FT.
    If alpha=-pi/2 it performs inverse FT.
    If alpha=pi it gives mirror reflection of a function.
    In consecutive transforms angles of individual transforms add. pi/2 + (-pi/2)=0 It corresponds to FT followed by an inverse FT and that is identity transform.
    So we could use linear operations in different domains, parametrized by alpha. It could be organized as, say, 12 channel tensor with alpha ranging from 2*pi *0/12 to 2*pi*11/12
    then we normalize and apply fully connected layer to all these 12 channels and get skip connection, convolution, order change and 9 other promising operations
    or we could just use linear operation and then collapse these 12 channels into one by element-wise addition

    • @user-qu2oz2ut2h
      @user-qu2oz2ut2h 3 года назад

      If i get it right, element-wise multiplication in fractional Fourier domain attenuates various chirps in the signal and in case of 2d images(tensors) you could use different transform angles for x and y directions: for example transform only rows and don't transform the columns (pi/2 along x and 0 along y)

  • @LukaszWiklendt
    @LukaszWiklendt 3 года назад +16

    Why do we even need a position embedding if the DFT over the tokens already provides this? Is it because they drop the imaginary component? If so, why not keep both imaginary and real and drop the position embedding?

    • @YannicKilcher
      @YannicKilcher  3 года назад +1

      Good question! Maybe worth a try.

    • @machinelearningdojowithtim2898
      @machinelearningdojowithtim2898 3 года назад +1

      I think we would lose the information after the inverse is computed

    • @EyedMoon
      @EyedMoon 3 года назад

      That was my thought too, since the Imaginary component tells you where to shift the sines

  • @CristianGarcia
    @CristianGarcia 3 года назад +4

    There are 2 properties I think are still missing from these Attention-less alternatives:
    * Variable length inputs and outputs.
    * To be order agnostic.

  • @kirillnovik8661
    @kirillnovik8661 3 года назад +6

    Wow, you make reading papers entertaining 👏

  • @TheShadyStudios
    @TheShadyStudios 3 года назад +10

    wassup Yannic youre awesome

  • @lesumsi
    @lesumsi 3 года назад +1

    It has come to a point, where my first reaction to a hyped paper is checking whether Yannic has already published a video.

  • @sangramgaikwad7663
    @sangramgaikwad7663 3 года назад

    Thanks Yannic. Great Explaination, tradeoff between accuracy/Speedup. Linear Transforms instead of Attention mechanisms seem more practical to deploy for small-mid scale dataset.

  • @sharp7j
    @sharp7j Год назад

    Soon as I learned about attention I thought "Pretty sure the main benefit is the weight sharing per token just like in convolution, combined with looking at the entire sentence per token" Turns out I was fucking right since the linear model is about as good as BERT, I'm sure if you added a few more layers/made it fancier it would perform better. This paper is awesome and your summary is amazingly intuitive

  • @freasy123
    @freasy123 3 года назад +2

    How about a multi-scale approach by using wavelet transform?

  • @SimonJackson13
    @SimonJackson13 3 года назад

    The transform is a 2D version as the axis pair are different. Yes the FFT is a self inverse if scaling is done right when the same data axis is used. The fact that convolution becomes a multiplication filter also likely helps to extract data.

  • @andrewgrebenisan6141
    @andrewgrebenisan6141 3 года назад +1

    I read this last weekend! Really interesting work

  • @user-wg6ch7wx2x
    @user-wg6ch7wx2x 6 месяцев назад

    Fantastic work.

  • @sampruden6684
    @sampruden6684 3 года назад +1

    Do we expect this to work equally well with ViTs, or is this Fourier magic likely to be limited to the NLP domain? That might be an obvious next paper.

  • @ashoknp
    @ashoknp 3 года назад +2

    Thank you Yannic!!!

  • @SDKIM0211
    @SDKIM0211 3 года назад

    Hi Yannic, thanks for your nice explanation. May I ask which tool and device did you use to record this video? (e.g., copy and paste pdf on Onenote and scribble on it)

  • @rabago85
    @rabago85 3 года назад +1

    Was waiting for this.

  • @Mrbits01
    @Mrbits01 2 года назад +1

    A very nitpicky error in the video: at around 10:44, when taking about Fourier and inverse Fourier transforms, you say "inverse transform is simply if you don't do the negative sign right here (the complex exponential)"
    That's not entirely correct, in inverse DFT, x_n and X_k are interchanged, and the summation goes from k=0,....,N-1 instead. (Although this summation thingy doesn't matter much, esp when doing DFT of a real-valued signal, it's necessary to keep the notation uniform)

  • @akimtsvigun8783
    @akimtsvigun8783 3 года назад

    Thank you, Yannic! Amazing video.

  • @AIology2022
    @AIology2022 3 года назад +1

    So maybe wavelet can be used instead of Fourier, with this difference that wavelet parameters can be tuned by network.

  • @ankitaharwal5886
    @ankitaharwal5886 3 года назад +3

    I think attention mechanism is much more than just information sharing between datapoints. Like, weights in attention are computed at run time and variable. That makes attention much more general.
    So, how about we try to Fourier transform and apply attention layer above that.

  • @kushasareen3259
    @kushasareen3259 3 года назад +2

    Someone can correct me if I'm wrong but applying the DFT twice and taking the real part will absolutely not give you back the real signal however it does something close. Taking the real part will immediately throw out all of the phase information. Taking the FT twice actually returns f(-x), essentially reversing the signal (this can be seen pretty clearly from the definition of the FT and its inverse). Taking the FT 4 times however will give you back the signal but I don't think this reversing really plays a role on the learning since the signal is identically reversed each time. I think once you take the real part only, the transformation will roughly approximate this with phase information lost.

  • @MsFearco
    @MsFearco 3 года назад +1

    I love attention but I was saying fourier is the key to everything since like forever. Do we need attention? Idk. Probably yes. But fourier is also needed. In what context can we combine these ? I have no idea.

  • @ericadar
    @ericadar 3 года назад +1

    why throw away half the Fourier output by only considering the real part? What would be the effect of doubling the number of tokens/nodes of the Fourier output later by splitting into real and imaginary parts?

  • @JTMoustache
    @JTMoustache 3 года назад

    Whole sequence + Fully connected + Conditionned meta parameters is all you need.

  • @Coolguydudeness1234
    @Coolguydudeness1234 3 года назад +7

    instead of doing 2 1D fourier transforms like total virgins they should just stack the vectors and do a 2D fourier transform

    • @Coolguydudeness1234
      @Coolguydudeness1234 3 года назад +11

      which is equivalent to just using a CONV layer with a large kernel 🤡

    • @bertchristiaens6355
      @bertchristiaens6355 3 года назад +1

      isn't it equivalent but easier to compute? "a 2D Fourier transform is achieved by first transforming each row, i.e. replacing each row with its 1D Fourier transform. This first step yields an intermediary 'picture' in which the horizontal axis is frequency f and the vertical axis is space y. The second step is to apply 1D Fourier transform individually to the vertical line of the intermediate image. This new image will be the 2D Fourier transformation of the initial image."

  • @chihochan7345
    @chihochan7345 3 года назад

    Eq3 bascially is 2D FFT and it only keeps real part. I guess it simplifies the computation and the real part of FFT is related to the power spectrum of FFT. In fact, the power spectrum of FFT is autocorrleation. Self attention is the softmax of the cross-correlation of signal pairs. Therefore, I think they are equivalent in some sense.

  • @sarvagyagupta1744
    @sarvagyagupta1744 3 года назад

    The Nx value that's there in the architecture, does it work in serial fashion or parallel? Is it replacing multi-head attention or just increasing the number of encoding layers? Because in the implementations, it works in a serial fashion. So do let me know.

  • @sieyk
    @sieyk 3 года назад +1

    Glad to see attention go. I never could make sense of it! 😂

  • @tinski
    @tinski 3 года назад +2

    Sorry, if this has been asked a million times already, but does anyone know what pdf annotation software Yannic is using? Looks so clean!

  • @Guesstahw
    @Guesstahw 3 года назад

    The way I see it, the success of attention in transformers or models alike was in explicit "bilinear" nature of information flow between tokens and hence the O(n^2) issue. I dont see how replacing such a nonlinear interaction with a weighted sum (they might as well used an MLP) could bring in the same expressive power.
    On a different note the frequencies for sequences of different lengths would mean differently and hence probably one would have to resort to STFT-like transformations which would not resolve the variable sequence length.

  • @MrMIB983
    @MrMIB983 3 года назад +1

    Great video

  • @CristianGarcia
    @CristianGarcia 3 года назад +2

    gMLP is also interesting!

  • @abhilashnandy
    @abhilashnandy 3 года назад

    Thanks, very cool! Is there any huggingface model on this?

  • @mbunooo
    @mbunooo 3 года назад +2

    excelent video!! we should note here that they are using the discrete Fourier Transform… Ir only lasts that someone makes an interpretation on how exactly is that FT mixing the tokens vs linear mixing… how the time vs frecuency applies to tokens? what means frecuency on discrete tokens? but looks like not even the authors have figured out that😂😂😂

  • @sarvagyagupta1744
    @sarvagyagupta1744 3 года назад

    What is the "Type" embedding? Can I get more information on that?

  • @alpers.2123
    @alpers.2123 3 года назад +2

    Instead of learning to route information in quadratic complexity, cant we train a layer to output an index of a permutation table?

    • @alpers.2123
      @alpers.2123 3 года назад +1

      Partially structured data like natural languages probably would have a small subset if possible routes. So it can be discretised.

    • @dandan-gf4jk
      @dandan-gf4jk 3 года назад +1

      @@alpers.2123 No because it's not differentiable

    • @herp_derpingson
      @herp_derpingson 3 года назад

      I havent found a good implementation of an efficient differentiable permutation algorithm.

    • @alpers.2123
      @alpers.2123 3 года назад

      Permutation is a matrix multiplication with a binary matrix

    • @dandan-gf4jk
      @dandan-gf4jk 3 года назад +1

      @@alpers.2123 Oh and where will you get the binary matrix from?

  • @ecitslos
    @ecitslos 3 года назад +2

    People have investigated predefined basis functions before in the early days of deep learning (the paper even pointed that out). They performed badly (even against regular dense layers). This paper does the same thing and again predefined basis lost to dense layers. This paper is just getting attention (pun intended) it does not deserve due to it being from Google. I think the various works on local attention should be much better than the idea in this paper.

  • @pratik245
    @pratik245 2 года назад

    Although this is game theory, i can see for each token, complexity is o(1), for each attention layer above it is o(1) ,but real worl concepts are not so neatly segregated. But, that is mysterious part

    • @pratik245
      @pratik245 2 года назад

      Thats infinte dimensional space

    • @pratik245
      @pratik245 2 года назад

      We could if possible do inverse Fourier transform of real world, it would tell us why we are what we are. The cure of all diseases, the solution of all mysteries

  • @G12GilbertProduction
    @G12GilbertProduction 3 года назад

    I realize yourself what it could be this O(n² × log14²) attention matrix work in superposition difference equation by VGA.

  • @Skythedragon
    @Skythedragon 3 года назад +1

    Next paper: Fourier transforms are all you need
    I actually wonder, how long will it take to go full circle and have a "neural network training thing" that literally generates code that does a specific task?

  • @leecarraher
    @leecarraher 5 месяцев назад

    Feels like we want locality and sparsity,why not trade floating point complex fft with walsh hadamard.

  • @herp_derpingson
    @herp_derpingson 3 года назад +1

    26:30 Why is FF-only worse than random?
    .
    Also, I wonder if someone tried a similar idea before. It sounds like a very obvious thing to do.

    • @YannicKilcher
      @YannicKilcher  3 года назад +1

      I guess any mixing is still preferred. It's random, but always the same random afaik.

  • @aspergale9836
    @aspergale9836 3 года назад

    4:30 - Which papers?

  • @mathematicalninja2756
    @mathematicalninja2756 7 месяцев назад

    Wavelet transform is the most effective non parametric mixing, I bet. They are mathematical microscopes.

  • @christianc8265
    @christianc8265 Год назад

    did they completely drop the whole decoder part?

  • @aniruddhadatta925
    @aniruddhadatta925 3 года назад

    Now it's my time to shine with my ece degree

  • @NeoShameMan
    @NeoShameMan 3 года назад +2

    Isn't attention mechanism with query, value, key, the evolution of neural turing architecture? Can someone clue me into this?

  • @natthaphonhongcharoen
    @natthaphonhongcharoen 3 года назад +1

    3:47 Anything but n squared? Let's go n power n.

  • @justinwhite2725
    @justinwhite2725 3 года назад

    'for the longest time'. Wasn't 'Attention all you need' published in 2015? And wasn't FNET published in 2019? We're talking 4 years.

  • @user-vx1fj9uf6w
    @user-vx1fj9uf6w Год назад

    17:20 if you did not tf to infinite. It cannot be perfectly restored

  • @adamweishaupt7074
    @adamweishaupt7074 3 года назад

    "The verb at the end attends to the second word."
    Human language was a huge mistake.

  • @Ronnypetson
    @Ronnypetson 3 года назад

    30:45 LOL

  • @XOPOIIIO
    @XOPOIIIO 3 года назад +1

    So attention is not what you need after all.

  • @pascalhuber157
    @pascalhuber157 3 года назад

    I don‘t really get the hype about this paper. Their results are apparantly much worse than with Attention. In my opinion this is an interesting approach, but ultimately nothing but hot air.