TransformerFAM: Feedback attention is working memory

Поделиться
HTML-код
  • Опубликовано: 27 апр 2024
  • Paper: arxiv.org/abs/2404.09173
    Abstract:
    While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
    Authors: Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno Mengibar
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    RUclips: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • НаукаНаука

Комментарии • 135

  • @dongseonghwang7870
    @dongseonghwang7870 16 дней назад +213

    I'm the author :) As self-taught ML practitioner, when I started studying deep learning, Yannic was my favorite teacher. I've learned so much from you. I'm so honored that Yannic featured my paper today. After watching your review, I made some changes to the paper and re-uploaded it to arXiv.
    I'd like to add some supplementary explanations to Yannic's comments:
    1. Feedback loop vs. Recurrent
    As Yannic mentioned, it's essentially incorporating recurrent features into the Transformer architecture. RNNs were the first to implement feedback loop in a sequence-by-sequence manner, while the paper implements feedback using attention. The paper emphasize feedback as RNN means more like Markov process or autoregressive concept. The paper mentions that LSTMs and GRUs successfully implemented feedback loops within the RNN framework.
    2. Stop Gradient and Gradient Checkpointing
    Sorry for bad english. I rewrote the part. The idea is that previous research suggested using stop gradient to improve computational performance. However, with gradient checkpointing, all computations need to be redone anyway. So, whether or not you use stop gradient, there's no noticeable impact on speed or memory usage. This is why I recommend eliminating the use of stop gradient.
    3. 18:00 simpler algorithm
    Thank you for Yannic's compliment on Appendix C "Don't". In C.3, I addressed the simpler algorithm that Yannic suggested. Actually, this was the very first method I tried. Unfortunately, this algorithm failed to remember PassKey for long sequences, and the algorithm became a bit more convoluted, as Yannic mentioned.

    • @frankyvincent366
      @frankyvincent366 16 дней назад +10

      Congrats for your work. Feel free to keep in touch if you need some kind of independant reviewing / advices on your strategy.

    • @arnebinder1406
      @arnebinder1406 15 дней назад +2

      Are you aware of ERNIE-DOC: A Retrospective Long-Document Modeling Transformer (Ding et al, 2021)? They also introduce recurrence by passing the state to the next step, but one layer below. Found it via lucidrains x-transformers library :)

    • @omarnomad
      @omarnomad 15 дней назад +9

      This is science at its best

    • @roupen66
      @roupen66 15 дней назад +6

      Having studied neuroscience in undergrad and now an ML practitioner, I really appreciate how your team tried to relate the architecture of your transformer to neuro! There should be a huge push for this given that we have seen such huge success with even simple adoption of neurons and neural networks.

    • @dongseonghwang7870
      @dongseonghwang7870 14 дней назад +1

      @@arnebinder1406 I didn't know ERNIE-DOC, but the idea is very same to "C.2. Feedback Memory Segment" in "Appendix C. Don't" section. Thank you for letting me know. I'll cite the paper. BTW, this architecture failed to remember PassKey. In my opinion, individual token has confusion to carry both local and global info together.

  • @Aziqfajar
    @Aziqfajar 16 дней назад +41

    At this point, Yannic is in desperate need of somebody, someone to not reinvent and instead create a new invention. Here, a back pat for you for your contribution to sharing academic papers with us.

  • @kacperkubicki1101
    @kacperkubicki1101 16 дней назад +78

    I think in the next two to three years, the moment will finally come, when someone reinvents attention and calls it something like "self-aware in-context persistent memory", sprinkling it with some neuroscience mumbo-jumbo gibberish.

    • @adama7752
      @adama7752 16 дней назад +15

      Persistently
      Engaging
      Neurological
      Initiatives
      Stimulation

    • @syncrossus
      @syncrossus 16 дней назад +1

      i think your timeline is optimistic. I give it at least 5 years

    • @marshallmcluhan33
      @marshallmcluhan33 16 дней назад +3

      Mamba #5

    • @Wobbothe3rd
      @Wobbothe3rd 16 дней назад +4

      No, you're wrong. No one moment will ever come where an AI will suddenly be discreetly self aware. Self aware is one of those vague concepts that either has no concrete meaning, or to the extent it can be well definied it will be a matter of degrees (ie "model A is more self aware than model B"). Your comment is ironic in that you attack actual scientific paper's terminology as "mumbo jumbo" - I would argue the phrase "self-aware" ITSELF is mumbo jumbo!

    •  16 дней назад +4

      @@marshallmcluhan33 selective scan sctuctured state space sequence model AKA linear RNN

  • @omarnomad
    @omarnomad 16 дней назад +23

    All paths leads to Schmidhuber

  • @blocksofwater4758
    @blocksofwater4758 15 дней назад +2

    RNN: "You cannot live with your own failure. Where did that lead you? Right back to me."

  • @BerntGranbacke
    @BerntGranbacke 15 дней назад +1

    Thank you Yannic för making these papers understandable, and breaking them down with your insight and understanding. Very interesting. 🙏

  • @michalchik
    @michalchik 16 дней назад +11

    So I think you make a pretty convincing argument that this is a repackaged form of recurrent neural network. And yes you're right that a long time ago people were using these as neurally inspired architectures that weren't anywhere as successful as transformers. Now what I'm wondering is if they failed because they didn't have the transformer architecture underneath them which was more similar to organized long-term memory and learning. Maybe recurrent neural networks are almost useless by themselves, but built on top of transformers they provide the powerful equivalent of what we in Neuroscience call working memory and that the combined architecture, combined in this way can take things to the next level. I know that performance metrics can be gamed and can be very misleading but ultimately it doesn't matter if we're doing something similar to what we did in the past, if this particular arrangement leads to significant performance gains in certain kinds of tasks it still might be valuable even if it's recycled. A lot of technological progress occurs with the repurposing of old inventions in a new context. I apologize if this is a very off base comment I'm just really getting to learn this stuff and there's a lot of holes in my background because I'm coming at this from more of a neuroscience perspective. I can say that interrupting the cortico thalamic loop produces humans that might know a lot of stuff and even give appropriate responses, but leads to knowledgeable people that are just reactive entities and can't get anything done and lose track of where they are all the time. Those kinds of problems seem present in the current entities like chat GPT 4.0 and Claude opus which is what I have the most experience with.

    • @edu833
      @edu833 15 дней назад

  • @KevinHorecka
    @KevinHorecka 14 дней назад +1

    It's interesting how for working memory they describe layer wise interaction this way (at least according to these authors). In flexible hippocampally dependent memory we see multiple recurrence across many system layers as well as layer-wise specialization in things like Pattern Separation (DG/CA3), Pattern Completion (CA3), and Relational Binding (CA1). We know the hippocampus is critical in humans for creative flexible reasoning. I feel like it's the brain region and memory system we're missing, not PFC mediated working memory...

  • @DanielCardenas1
    @DanielCardenas1 15 дней назад

    Your style is very entertaining. I laughed at the section : feature is not a bug. ~34 minutes into the video.

  • @drdca8263
    @drdca8263 16 дней назад +9

    So if the paper framed itself as, “because of neuroscience reasons, the way to improve transformers should be to combine them with an RNN (specifically, in a way like this)”, specifically describing it as a form of, or variation on, RNNs, would that have alleviated most of your issues with it?

  • @joseponce9567
    @joseponce9567 13 дней назад

    guau great paper revision you enlight so well the key mechanism of the net, great job in scintific dissemination

  • @mriz
    @mriz 16 дней назад

    I feel like being heard by you! Thanks you!

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 16 дней назад +29

    I want models that can modify or add weights on the fly. I want it to have better long term memory without having to go back and read over the rags and hope it got what I want.

    • @chispun2
      @chispun2 16 дней назад +13

      I want a unicorn 🦄

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 16 дней назад +9

      ​@@chispun2 But only if it has wings.

    • @drdca8263
      @drdca8263 16 дней назад +2

      @@zyxwvutsrqponmlkhI think that’s called a pegasus ?

    • @jsalsman
      @jsalsman 16 дней назад +2

      LORA is kind of like that, but interfaces to update it are in their infancy because researchers are sorting out the right way to do it.

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 16 дней назад +2

      ​@@drdca8263 Pegasus doesn't have the horn.

  • @MonkeySimius
    @MonkeySimius 16 дней назад +5

    Thanks for the explanation.
    Total aside ... I downloaded llama-3 and set the context window to 8k and it worked fine. I boosted it up to 9k and got pure gibberish. That was the first time using too big of a context size had that happen. The closest I've had before when I had an error for using too big of a context size my program simply crashed as I ran out of memory.
    Happily fine tuned(?) models have come out with longer context lengths. But I found it interesting. I obviously didn't understand what's going on underneath the hood to fully understand why, exactly, that happened... But videos like this give me a better foundation to make guesses.

  • @lone0017
    @lone0017 13 дней назад

    Funny that I thought of the same idea right after watching your last review on infini attention lol

  • @AndreAmorim-AA
    @AndreAmorim-AA 16 дней назад

    The biological neural point of single/small group neurons in the loop diagram at 2:17 reminded me of how we remember the past and imagine the future, as well as the sci-fi movie ‘Inception’ (2010).

  • @longinjanlatecki4025
    @longinjanlatecki4025 16 дней назад +1

    This framework reminds me a [cls] token added at the end of each block. As shown in the following paper, adding a few of [cls] tokens, called registers, at the beginning improves performance. VISION TRANSFORMERS NEED REGISTERS, ICLR 2024.

  • @Balorng
    @Balorng 16 дней назад

    Well, I do think all those new papers have one thing in common: create "a hierarchy of memory" with increasing granularity and compute from less recent to most recent data. What's missing, I think, is not just going with hierarcy from linear to quadratic, but from linear to quadratic to *cubic*.

  • @bernardoramos9409
    @bernardoramos9409 15 дней назад

    The problem with the simple algorithm you proposed is when the generation reaches the block (or context) length in the middle of a sentence, then it is not so straightforward to generate the next token without the previous ones, while maintaining the sentence structure

  • @ScottVanKirk
    @ScottVanKirk 16 дней назад +1

    Hey Dr Kilcher, you should put your reputation where your snark is!😮😂 show us how you would architect your recurrent neural network to act as short-term memory. Snark aside, I really would like to see how that might work. Would you just prepend an RNN to a transformer?

  • @johanlarsson9805
    @johanlarsson9805 15 дней назад

    Well, I can instantly say they are onto something, but not sure if correct way.
    I've been experimenting with ANNs for 15 years. For the last 5 I've been working on a structure where the signal never dies.. I want the "loop", the amalgamation of the last states pulses through the net, meaning different things to different areas, preserving the "state" and working memory until the next input comes, so that the input is affected by the current "mood" of the net.
    Still strugling with this approach, but it is what is needed for real AI. In us, the signal never dies, it needs to be there to propegate more signal.
    I was blown away when people solved my difficult task in a simple ways with LLM, "just have many agents interacting with eachother". Yeah, together they then acchive what I want my single net to do, the signal never dying, but I want it in a single net.

    • @MarkDStrachan
      @MarkDStrachan 13 дней назад +1

      I've been contemplating similar ideas--my intuition is that the big limitation with the transformer design is the linear request/response structure. What's needed is a physical loop of layers with an input structure and an output structure woven into the loop-creating a signal of updates continually propagating around the loop. You want to establish a standing wave state around the loop that can be perturbed by the inputs and monitored by the outputs. The signal perturbation would be the communication--i.e. when humans talk we're modulating frequency, so send a signal in, let it integrate into the feedback loop, and monitor the perturbations at the exit. Toss out the single pass forward propagation and replace it with these self-oscillating feedback loops on a large scale--the entire structure a loop of layers, which would need inhibitory function to prevent runaway feedback. Imagine speaking into it, and hearing a voice coming out of it.

    • @johanlarsson9805
      @johanlarsson9805 13 дней назад

      @@MarkDStrachan EXACTLY! That sounds like my thinking

  • @AM-yk5yd
    @AM-yk5yd 15 дней назад +1

    It's closer to Block-Recurrent Transformers than transformersxl. TransformersXL reused output of previous layer rather than its own. Which makes sense for anything than Alberta where each layer is the same. Also the only reason xl stop backprop is publishing year. They simply had no resources to propogate that much.
    Also BRT authors were like "oh,yeah,we have vram now, so we let backprop through more CTX window" they called it "Slide" model.
    OK,after watching all video,its closer to RMT except they route memory to the same layer. I don't understand why they didn't went with RMT route and just computed them normally(appending memory) without second attn

  • @caimansaurus5564
    @caimansaurus5564 16 дней назад +21

    Why does it bother you so much that these papers are basically reprising RNNs? I mean yes, that's what they're doing, but they're doing it in different ways (there's countless variations on the RNN itself after all), so what's the problem?
    RNNs were always an intuitively good idea anyway, held back by vanishing gradient/info loss over thousands of tokens. All these papers, by "recurring" over big chunks rather than token by token, basically solve this. I think it's really exciting.

    • @nnnik3595
      @nnnik3595 16 дней назад

      That is extremely slow though. RWKV is a better approach

    • @eliaweiss1
      @eliaweiss1 16 дней назад +9

      To answer your question:
      1. They don't refer to rnn, instead they are blabbering about neuroscience
      2. They don't compare to rnn, instead they compare to block wise, which is clearly preform worse, and even than the improvement is minor
      Like yanic says, it is clear that they missed something crucial and they are blurring it with blabla

    • @kayemni
      @kayemni 16 дней назад +5

      The problem is that it doesn't acknowledge that their approach are basically RNNs, if they presented it as an RNN + attention variant, it would have garnered less attention but would be more honest, and if they effectively compared to RNNs and showed even slight increase in performances it would have been really good. The problem here is that they are obfuscating their contribution and how it relates to previous knowledge in the field, and don't even compare to RNNs... And don't get me started on the redundancy they are introducing, which is never good.

    • @andreaterlizzi
      @andreaterlizzi 16 дней назад

      Also, they do all of this fancy talk about working memory in neuroscience, which is basically BS since that isn't what working memory actually is, not even close. Real working memory is unbounded with respect to the input size, (theoretically) similar to a Turing machine or the RAM of a Von Neumann machine, this kind of "working memory" that is in RNNs is linearly bounded with respect to the input size, which is much more similar to linear Turing machines and stack-atutomatas.

    • @AM-yk5yd
      @AM-yk5yd 15 дней назад +1

      When mamba, rnn, comes with claims of outperforming transformers, I kinda like seeing benchmarks against rnn proper

  • @JoeTaber
    @JoeTaber 16 дней назад +1

    Transformers use masking during training and are trained to predict the next token given previous tokens. I wonder if the memory mechanism/RNN node should also be trained to do the opposite: mask off the prefix and given the current tokens and the previous memory, predict the previous tokens.

  • @LunaProtege
    @LunaProtege 15 дней назад

    I think I actually have thought about a similar idea of a handful of output tokens being paired with input tokens; and it sounds like that's what's happening here... And you say this is basically an RNN but for this kind of transformer system? Alright, fair enough.
    I often also propose something to pair with it to give the most versatility; have a sort of data-table akin to a notebook it can write to, some of its outputs are akin to coordinates on this table, some are "here's what to write to it" as well as one that simply determines whether or not to write or not, and another set of coordinates for what to read for the next loop of the neural network. Having this kind of "Notepad" as well as a means to make a short term memory by doing a direct loop from output straight to input, could allow it to better remember both long term and short term information at the same time.
    I imagine its probably simple enough for all this to be implemented in a single system, especially in a system where this "RNN" functionality as you've described it is already implemented.

  • @YasserAder
    @YasserAder 15 дней назад

    can you do a video telling how you analyze papers , how to find the limitations of pares , something like that ? critical analysis in computer science papers , if that possible .?

  • @-mwolf
    @-mwolf 16 дней назад +1

    What you are drwawing at 17:00, isn't that exactly what "Vision Transformers Need Registers" (DINOv2) proposed?

  • @seraphiusNoctis
    @seraphiusNoctis 15 дней назад +2

    Isn’t reprompting with a model’s own output just this, but at a higher level? (it’s effectively concatenating a state to a transformer, that came from a transformer…)

    • @oncedidactic
      @oncedidactic 15 дней назад

      This is what I’m pickling on too…. next token generation would seem to supersede and generalize hidden state juggling. The working memory is the “let’s think step by step”.

  • @fox_7765
    @fox_7765 16 дней назад

    This field bounces back and forth between pure engineering and cognitive sciences: first they were inspired by parallel-distributed processing and neuron-like units like the brain, then it was all about optimisation and infrastructure (cognitive-science was irrelevant), now they've realised they'll have to go back and get more inspiration for cognitive/biological sciences to achieve AGI. IMO feedback loops were inevitable from the start. How much impact will this paper really have?

  • @P1XeLIsNotALittleSquare
    @P1XeLIsNotALittleSquare 15 дней назад +1

    just ask chat-gpt to write summary of the conversation after each answer and call it a day lol

  • @d_b_
    @d_b_ 15 дней назад

    Proposal at 17:00 seems natural, has it been done? Its just as parallelizable as a regular attention mechanism, isn't it?

  • @dimitriognibene8945
    @dimitriognibene8945 10 дней назад

    Maybe the limited dimensions is a form of regularization?

  • @Neomadra
    @Neomadra 14 дней назад

    My suspicion why so many people are reinventing RNNs is the lack of proper academic education and peer review process. People are just learning about transformers and the basic math and believe they have seen it all. Just a hunch, but nowadays everyone can upload stuff on arxiv without any quality control.

    • @erongjoni3464
      @erongjoni3464 13 дней назад

      I'd be surprised if Google researchers weren't well aware of RNNs.
      I think it's more likely that a lot of people feel that SOME form of recurrence is going to be necessary for a model capable of system 2 thinking.

  • @TravellingTheWorldWideAndLarge
    @TravellingTheWorldWideAndLarge 14 дней назад

    I think it is outrageous when journals don't require the submission of the code for the acceptance of the algorithm.
    What if as part of the price for the publication, the journal offers cloud computational resources for anyone who wants to test the algorithm?

  • @john_blues
    @john_blues 16 дней назад +1

    What's up FAM?

  • @MasamuneX
    @MasamuneX 16 дней назад

    the shaved head makes the LLM's work better the power is building

    • @clray123
      @clray123 13 дней назад

      but wait, wasn't shaving head supposed to deprive of power

  • @user-uc2qy1ff2z
    @user-uc2qy1ff2z 15 дней назад

    Okay, we get it. Transformer need some sort of latent representation to be able to think coherently about huge chunks of data.
    Okay, expectable.
    But why are there four works, which imply it, but call it by different names?

  • @lexer_
    @lexer_ 16 дней назад +10

    To some degree I kind of get why you harp on this point a lot that this is just reinventing recurrent neural networks. And its really quite strange that nobody actually talks about these basically just being rnn architectures. But on the other hand these are at least novel in that they manage to combine the magic of transformer attention with the magic of rnns in a way that supposedly actually works well. Does it really matter if the "new" component they are introducing has been invented before separately? The only real benefit of seeing this connection is that you might be able to transfer some of the rnn experience over to these new hybrid architectures. Or is it just the annoyance that it seems like they haven't properly studied the older literature on ML and are kind of these transformer kiddies that claim having invented stuff based on ignorance?
    I am not trying to start a fight here! I am just curious where this frustration comes from.

    • @btnt5209
      @btnt5209 16 дней назад +11

      Combing RNNs + Attention is the precursor to the Attention is All You Need paper (hence the "All You Need" part...). The commonality in all these papers seems to be "we claim good results on this particular dataset in this particular setting", but in reality, it's very hard to reproduce the same good results in real-world settings or even on other datasets or environments

    • @axelmarora6743
      @axelmarora6743 16 дней назад +2

      @@btnt5209 I still don't understand why this is still an active area of research when State Space Models have solved the Quadratic scaling problem (or so I thought). SSMs allow for optimal linear transfer of information between adjacent FFN, which is what this paper tries to do.

    • @kayemni
      @kayemni 16 дней назад +1

      Going back to Attention + RNN is not bad in itself, if they improve upon things and show that Attention isn't all what you need after all (at least not for all use cases) then it's great, but not acknowledging the huge similarities between their approach and what already exist (and was the default for some time) is quite disingenuous and should be pointed out. Not only does it introduce redundancy in the research in the field, but it also obfuscates the contribution to overall knowledge and the comparisons to existing approaches, and yes it does also inflate their contribution. Just take a look at OpenReview and how harsh they are on papers that don't effectively cite similar work and compare their contribution to theirs

    • @esalexander5807
      @esalexander5807 16 дней назад +1

      Research as a process is reliant on building from and comparing with the prior art - "the point" is to improve the existing understanding. By ignoring well-established concepts from the literature (intentionally or through ignorance) the novelty and merit of the presented work are unnecessarily hard to determine, and doing so requires effort from each reader that the authors could reasonably have been expected to do (once) themselves.
      If the paper had instead been presented as a neuro-focused take on combining RNNs with transformers, with examples illustrating how and when that is successful, there would be less redundant information and likely more valuable insights and/or comparisons.

    • @descai10
      @descai10 16 дней назад +1

      @@axelmarora6743 I'm wondering this myself as well. SSMs come out boasting 100x speed ups for large models and now it's just crickets with everyone still using regular transformers. That being said, I did hear that they perform poorly at copywriting.

  • @syncrossus
    @syncrossus 16 дней назад +3

    i thought attention was all i needed lol

  • @egor.okhterov
    @egor.okhterov 15 дней назад

    Please start reviewing papers without backpropagation

  • @spencerfunk6697
    @spencerfunk6697 День назад

    we need to focus on kan

  • @CharlesVanNoland
    @CharlesVanNoland 16 дней назад

    They're on the right track, except for the fact that it's all predicated on backprop/gradientdescent/automaticdifferentiation and thus totally incapable of online learning! It can only work with what it has been trained on.

  • @nathan9771
    @nathan9771 16 дней назад

    woah

  • @johnnytshi
    @johnnytshi 16 дней назад +2

    Lets put RNN back

  • @gregmattson2238
    @gregmattson2238 16 дней назад

    I get his frustration, but really IMO the era ot 'totally new' approaches is likely dead. We are now in the refinement stage, with low-level techniques being minor variants on each other and the innovations being tweaks on existing paradigms.
    this may change and there may be totally novel inventions down the pass, but i'd be unsurprised if nothing new comes for years to come. what's important is the performance. If the performance is there, it is well worth publishing and reviewing.

  • @sharannagarajan4089
    @sharannagarajan4089 16 дней назад +4

    If it isn’t a big deal, why are you reviewing it?

  • @SimonJackson13
    @SimonJackson13 16 дней назад

    Integral and differential terms?

    • @SimonJackson13
      @SimonJackson13 16 дней назад

      I said it kind of before. "The thing about an integral is the gradient is related."

  • @erickmarin6147
    @erickmarin6147 16 дней назад +1

    I think there should be more work on identifying redundancy in the field, probably using AI itself

    • @erickmarin6147
      @erickmarin6147 16 дней назад +1

      Maybe an RHF dataset filled by academics only

  • @scottmiller2591
    @scottmiller2591 15 дней назад

    Unrolling w backprop is a mistake - more local learning, more sums of geometric series.

  • @anishbhanushali
    @anishbhanushali 16 дней назад

    this is roast + tutorial ... a Roastorial !!

  • @juanjesusligero391
    @juanjesusligero391 16 дней назад +4

    It's 23:24 where I live, please let me sleep XD

  • @meselfobviouslyme6292
    @meselfobviouslyme6292 16 дней назад +1

    Thank you Mr Yannik for your explanation of TransformerFAM: Feedback attention is working memory.

  • @asnaeb2
    @asnaeb2 16 дней назад +2

    No cap on a stack fr

    • @ryzikx
      @ryzikx 16 дней назад

      no hat on a tower

  • @dinogodor7210
    @dinogodor7210 14 дней назад

    Hello, I didn't finish the video yet, but I already wanna add that it's unfair to complain about people using different terms for neural architectures when they use a known concept at a different place in the architecture. Try to compare chip design. If you design a minimal chip that is turing complete all other things you'd design around it just seem distortions of it - yet a processor isn't just a minimal turing machine but has a lot of machines in it that are optimized in some way, like ALUs or FPUs etc. which you could use as a fundamental bulding block to implement a turing machine itself. My point is that what seems the same from a theoretical standpoint has absolutely different consequences in effect and makes it necessary to give it another name from an engineering point of view.

  • @naninano8813
    @naninano8813 16 дней назад

    i am a strange cortical thalamic loop

  • @ArnaldurBjarnason
    @ArnaldurBjarnason 16 дней назад +3

    Another episode of Yannic explaining how a paper is just RNN 😆

  • @Sirmrmeowmeow
    @Sirmrmeowmeow 15 дней назад +1

    Could you please do a video on the Larimar Paper by IBM. Also has a bit of brain lingo in it.
    Think it could be used to get important dense context across inference(s) in a way that is almost semi-stateful instead of just of 'knowledge updates'? 🧐
    Because if the next inference could be informed by the memory unit of important context, that might be helpful for Long Term Planning and more stateful, coherent responses, better stitching the inferences together so important context survives inf.
    ~Like makes me wonder if they could have further RLHFd it to use the memory unit to use context stored and further RLHF to refines it's use appropriately. ie learning to maintain important information relative to tasks and current goals across inf as needed. :x
    arxiv pdf/2403.11901
    arxiv abs/2403.11901

  • @mennovanlavieren3885
    @mennovanlavieren3885 16 дней назад +4

    I'm 7:48 minutes in and you've convinced me not to waste my time by continuing watching.
    I get your point, but if you want to increase viewer count you need to sell the paper better. 🙃

    • @cajampa
      @cajampa 16 дней назад

      Thanks for saving me the time.
      Seem like "it just another RNN paper" video from the comment.

    • @kayemni
      @kayemni 16 дней назад +1

      I'd rather he don't and keep his content honest especially considering the academic nature of the content, baiting ppl into watching useless content shouldn't be normalized but punished!

    • @ChaoticNeutralMatt
      @ChaoticNeutralMatt 16 дней назад

      I don't think that's per se his goal to begin with? You don't have to watch.

  • @yakmage8085
    @yakmage8085 15 дней назад

    He don’t care

  • @zerotwo7319
    @zerotwo7319 16 дней назад

    I like the attempt to suggest some biological inspiration, "working memory" but could it is just not that. would be nice to see something truly biological inspired.

    • @mennovanlavieren3885
      @mennovanlavieren3885 16 дней назад +1

      Look into the work of Jeff Hawkins (A Thousand Brains Theory). He's interviewed by Lex Fridman a couple of times and has a research company Numenta which publishes a lot of their work.

    • @drdca8263
      @drdca8263 16 дней назад +1

      Has anyone gotten transformers to work with spiking neural networks?

    • @zerotwo7319
      @zerotwo7319 16 дней назад

      @@mennovanlavieren3885 There is a lot of pre assumptions in 'the brain makes a model of the world' - (it is just platonism for the masses) I personaly don't work with that theory because our models could be wrong (example, have magic, faith) and ... people be considered inteligent, it has to have something to do with motion because the motor cortex is just right besides the prefrontal cortex. and it have to do with more ancient parts of the brain.... the cortex is overrated. If you look in other brains they don't need such larger cortex and still are inteligent. (also most research in this area convinently have something to do with old research or a math model they can improvise or update, not something new).
      Also, all that information is not only coded in the cortex. many other parts of the brain have super rigid functions or double, triple functions it is a mess of wires down there.
      This 'systems of systems' are more like what inteligence would look like, not 'convinently the cortex do everything' - it is not feasible it have a model of an object for each column. Lex friedman is just a podcast host, they shut up and let the interviwer talk wathever. I can't blame him. it is a peaceful life.
      TL;DR. it has to be something to do with the cerebellum- thalamus - motor cortext circuit frist.

    • @bernardoramos9409
      @bernardoramos9409 16 дней назад

      @@drdca8263 yes, search for SpikeFormer. There is more than 1 implementation

  • @transquantrademarkquantumf8894
    @transquantrademarkquantumf8894 16 дней назад +1

    Great show you truth and brutal honesty are incredibly refreshing your your noting of origins justifies Ori once again you're giving credit for work and participation is proper one of your best shows for the year even though the discoveries may not seem to be the greatest I thank you very much for bringing illumination and putting it on the table
    sincerely Michael Crosby CRYPTEWORLD

  • @CM-mo7mv
    @CM-mo7mv 16 дней назад +1

    wonder when they finally reinvent ART 🙄 (no not artistic pictures)

    • @martinschulze5399
      @martinschulze5399 16 дней назад

      hehe. People have come up with DL architectures for that as well.