I'm the author :) As self-taught ML practitioner, when I started studying deep learning, Yannic was my favorite teacher. I've learned so much from you. I'm so honored that Yannic featured my paper today. After watching your review, I made some changes to the paper and re-uploaded it to arXiv. I'd like to add some supplementary explanations to Yannic's comments: 1. Feedback loop vs. Recurrent As Yannic mentioned, it's essentially incorporating recurrent features into the Transformer architecture. RNNs were the first to implement feedback loop in a sequence-by-sequence manner, while the paper implements feedback using attention. The paper emphasize feedback as RNN means more like Markov process or autoregressive concept. The paper mentions that LSTMs and GRUs successfully implemented feedback loops within the RNN framework. 2. Stop Gradient and Gradient Checkpointing Sorry for bad english. I rewrote the part. The idea is that previous research suggested using stop gradient to improve computational performance. However, with gradient checkpointing, all computations need to be redone anyway. So, whether or not you use stop gradient, there's no noticeable impact on speed or memory usage. This is why I recommend eliminating the use of stop gradient. 3. 18:00 simpler algorithm Thank you for Yannic's compliment on Appendix C "Don't". In C.3, I addressed the simpler algorithm that Yannic suggested. Actually, this was the very first method I tried. Unfortunately, this algorithm failed to remember PassKey for long sequences, and the algorithm became a bit more convoluted, as Yannic mentioned.
Are you aware of ERNIE-DOC: A Retrospective Long-Document Modeling Transformer (Ding et al, 2021)? They also introduce recurrence by passing the state to the next step, but one layer below. Found it via lucidrains x-transformers library :)
Having studied neuroscience in undergrad and now an ML practitioner, I really appreciate how your team tried to relate the architecture of your transformer to neuro! There should be a huge push for this given that we have seen such huge success with even simple adoption of neurons and neural networks.
@@arnebinder1406 I didn't know ERNIE-DOC, but the idea is very same to "C.2. Feedback Memory Segment" in "Appendix C. Don't" section. Thank you for letting me know. I'll cite the paper. BTW, this architecture failed to remember PassKey. In my opinion, individual token has confusion to carry both local and global info together.
At this point, Yannic is in desperate need of somebody, someone to not reinvent and instead create a new invention. Here, a back pat for you for your contribution to sharing academic papers with us.
I think in the next two to three years, the moment will finally come, when someone reinvents attention and calls it something like "self-aware in-context persistent memory", sprinkling it with some neuroscience mumbo-jumbo gibberish.
No, you're wrong. No one moment will ever come where an AI will suddenly be discreetly self aware. Self aware is one of those vague concepts that either has no concrete meaning, or to the extent it can be well definied it will be a matter of degrees (ie "model A is more self aware than model B"). Your comment is ironic in that you attack actual scientific paper's terminology as "mumbo jumbo" - I would argue the phrase "self-aware" ITSELF is mumbo jumbo!
6 месяцев назад+4
@@marshallmcluhan33 selective scan sctuctured state space sequence model AKA linear RNN
So if the paper framed itself as, “because of neuroscience reasons, the way to improve transformers should be to combine them with an RNN (specifically, in a way like this)”, specifically describing it as a form of, or variation on, RNNs, would that have alleviated most of your issues with it?
So I think you make a pretty convincing argument that this is a repackaged form of recurrent neural network. And yes you're right that a long time ago people were using these as neurally inspired architectures that weren't anywhere as successful as transformers. Now what I'm wondering is if they failed because they didn't have the transformer architecture underneath them which was more similar to organized long-term memory and learning. Maybe recurrent neural networks are almost useless by themselves, but built on top of transformers they provide the powerful equivalent of what we in Neuroscience call working memory and that the combined architecture, combined in this way can take things to the next level. I know that performance metrics can be gamed and can be very misleading but ultimately it doesn't matter if we're doing something similar to what we did in the past, if this particular arrangement leads to significant performance gains in certain kinds of tasks it still might be valuable even if it's recycled. A lot of technological progress occurs with the repurposing of old inventions in a new context. I apologize if this is a very off base comment I'm just really getting to learn this stuff and there's a lot of holes in my background because I'm coming at this from more of a neuroscience perspective. I can say that interrupting the cortico thalamic loop produces humans that might know a lot of stuff and even give appropriate responses, but leads to knowledgeable people that are just reactive entities and can't get anything done and lose track of where they are all the time. Those kinds of problems seem present in the current entities like chat GPT 4.0 and Claude opus which is what I have the most experience with.
26:53 - "I have a different suspicion of why all of this, or any of this might work... there's a lot of things carried by the residual connections... you have a completely pass-through path..." BOOOoOOOM!
I want models that can modify or add weights on the fly. I want it to have better long term memory without having to go back and read over the rags and hope it got what I want.
It's interesting how for working memory they describe layer wise interaction this way (at least according to these authors). In flexible hippocampally dependent memory we see multiple recurrence across many system layers as well as layer-wise specialization in things like Pattern Separation (DG/CA3), Pattern Completion (CA3), and Relational Binding (CA1). We know the hippocampus is critical in humans for creative flexible reasoning. I feel like it's the brain region and memory system we're missing, not PFC mediated working memory...
It's closer to Block-Recurrent Transformers than transformersxl. TransformersXL reused output of previous layer rather than its own. Which makes sense for anything than Alberta where each layer is the same. Also the only reason xl stop backprop is publishing year. They simply had no resources to propogate that much. Also BRT authors were like "oh,yeah,we have vram now, so we let backprop through more CTX window" they called it "Slide" model. OK,after watching all video,its closer to RMT except they route memory to the same layer. I don't understand why they didn't went with RMT route and just computed them normally(appending memory) without second attn
Thanks for the explanation. Total aside ... I downloaded llama-3 and set the context window to 8k and it worked fine. I boosted it up to 9k and got pure gibberish. That was the first time using too big of a context size had that happen. The closest I've had before when I had an error for using too big of a context size my program simply crashed as I ran out of memory. Happily fine tuned(?) models have come out with longer context lengths. But I found it interesting. I obviously didn't understand what's going on underneath the hood to fully understand why, exactly, that happened... But videos like this give me a better foundation to make guesses.
The biological neural point of single/small group neurons in the loop diagram at 2:17 reminded me of how we remember the past and imagine the future, as well as the sci-fi movie ‘Inception’ (2010).
To some degree I kind of get why you harp on this point a lot that this is just reinventing recurrent neural networks. And its really quite strange that nobody actually talks about these basically just being rnn architectures. But on the other hand these are at least novel in that they manage to combine the magic of transformer attention with the magic of rnns in a way that supposedly actually works well. Does it really matter if the "new" component they are introducing has been invented before separately? The only real benefit of seeing this connection is that you might be able to transfer some of the rnn experience over to these new hybrid architectures. Or is it just the annoyance that it seems like they haven't properly studied the older literature on ML and are kind of these transformer kiddies that claim having invented stuff based on ignorance? I am not trying to start a fight here! I am just curious where this frustration comes from.
Combing RNNs + Attention is the precursor to the Attention is All You Need paper (hence the "All You Need" part...). The commonality in all these papers seems to be "we claim good results on this particular dataset in this particular setting", but in reality, it's very hard to reproduce the same good results in real-world settings or even on other datasets or environments
@@btnt5209 I still don't understand why this is still an active area of research when State Space Models have solved the Quadratic scaling problem (or so I thought). SSMs allow for optimal linear transfer of information between adjacent FFN, which is what this paper tries to do.
Going back to Attention + RNN is not bad in itself, if they improve upon things and show that Attention isn't all what you need after all (at least not for all use cases) then it's great, but not acknowledging the huge similarities between their approach and what already exist (and was the default for some time) is quite disingenuous and should be pointed out. Not only does it introduce redundancy in the research in the field, but it also obfuscates the contribution to overall knowledge and the comparisons to existing approaches, and yes it does also inflate their contribution. Just take a look at OpenReview and how harsh they are on papers that don't effectively cite similar work and compare their contribution to theirs
Research as a process is reliant on building from and comparing with the prior art - "the point" is to improve the existing understanding. By ignoring well-established concepts from the literature (intentionally or through ignorance) the novelty and merit of the presented work are unnecessarily hard to determine, and doing so requires effort from each reader that the authors could reasonably have been expected to do (once) themselves. If the paper had instead been presented as a neuro-focused take on combining RNNs with transformers, with examples illustrating how and when that is successful, there would be less redundant information and likely more valuable insights and/or comparisons.
@@xxlvulkann6743 I'm wondering this myself as well. SSMs come out boasting 100x speed ups for large models and now it's just crickets with everyone still using regular transformers. That being said, I did hear that they perform poorly at copywriting.
This framework reminds me a [cls] token added at the end of each block. As shown in the following paper, adding a few of [cls] tokens, called registers, at the beginning improves performance. VISION TRANSFORMERS NEED REGISTERS, ICLR 2024.
Transformers use masking during training and are trained to predict the next token given previous tokens. I wonder if the memory mechanism/RNN node should also be trained to do the opposite: mask off the prefix and given the current tokens and the previous memory, predict the previous tokens.
Isn’t reprompting with a model’s own output just this, but at a higher level? (it’s effectively concatenating a state to a transformer, that came from a transformer…)
This is what I’m pickling on too…. next token generation would seem to supersede and generalize hidden state juggling. The working memory is the “let’s think step by step”.
Well, I can instantly say they are onto something, but not sure if correct way. I've been experimenting with ANNs for 15 years. For the last 5 I've been working on a structure where the signal never dies.. I want the "loop", the amalgamation of the last states pulses through the net, meaning different things to different areas, preserving the "state" and working memory until the next input comes, so that the input is affected by the current "mood" of the net. Still strugling with this approach, but it is what is needed for real AI. In us, the signal never dies, it needs to be there to propegate more signal. I was blown away when people solved my difficult task in a simple ways with LLM, "just have many agents interacting with eachother". Yeah, together they then acchive what I want my single net to do, the signal never dying, but I want it in a single net.
I've been contemplating similar ideas--my intuition is that the big limitation with the transformer design is the linear request/response structure. What's needed is a physical loop of layers with an input structure and an output structure woven into the loop-creating a signal of updates continually propagating around the loop. You want to establish a standing wave state around the loop that can be perturbed by the inputs and monitored by the outputs. The signal perturbation would be the communication--i.e. when humans talk we're modulating frequency, so send a signal in, let it integrate into the feedback loop, and monitor the perturbations at the exit. Toss out the single pass forward propagation and replace it with these self-oscillating feedback loops on a large scale--the entire structure a loop of layers, which would need inhibitory function to prevent runaway feedback. Imagine speaking into it, and hearing a voice coming out of it.
The problem with the simple algorithm you proposed is when the generation reaches the block (or context) length in the middle of a sentence, then it is not so straightforward to generate the next token without the previous ones, while maintaining the sentence structure
Hey Dr Kilcher, you should put your reputation where your snark is!😮😂 show us how you would architect your recurrent neural network to act as short-term memory. Snark aside, I really would like to see how that might work. Would you just prepend an RNN to a transformer?
This field bounces back and forth between pure engineering and cognitive sciences: first they were inspired by parallel-distributed processing and neuron-like units like the brain, then it was all about optimisation and infrastructure (cognitive-science was irrelevant), now they've realised they'll have to go back and get more inspiration for cognitive/biological sciences to achieve AGI. IMO feedback loops were inevitable from the start. How much impact will this paper really have?
Well, I do think all those new papers have one thing in common: create "a hierarchy of memory" with increasing granularity and compute from less recent to most recent data. What's missing, I think, is not just going with hierarcy from linear to quadratic, but from linear to quadratic to *cubic*.
Why does it bother you so much that these papers are basically reprising RNNs? I mean yes, that's what they're doing, but they're doing it in different ways (there's countless variations on the RNN itself after all), so what's the problem? RNNs were always an intuitively good idea anyway, held back by vanishing gradient/info loss over thousands of tokens. All these papers, by "recurring" over big chunks rather than token by token, basically solve this. I think it's really exciting.
To answer your question: 1. They don't refer to rnn, instead they are blabbering about neuroscience 2. They don't compare to rnn, instead they compare to block wise, which is clearly preform worse, and even than the improvement is minor Like yanic says, it is clear that they missed something crucial and they are blurring it with blabla
The problem is that it doesn't acknowledge that their approach are basically RNNs, if they presented it as an RNN + attention variant, it would have garnered less attention but would be more honest, and if they effectively compared to RNNs and showed even slight increase in performances it would have been really good. The problem here is that they are obfuscating their contribution and how it relates to previous knowledge in the field, and don't even compare to RNNs... And don't get me started on the redundancy they are introducing, which is never good.
Also, they do all of this fancy talk about working memory in neuroscience, which is basically BS since that isn't what working memory actually is, not even close. Real working memory is unbounded with respect to the input size, (theoretically) similar to a Turing machine or the RAM of a Von Neumann machine, this kind of "working memory" that is in RNNs is linearly bounded with respect to the input size, which is much more similar to linear Turing machines and stack-atutomatas.
I think I actually have thought about a similar idea of a handful of output tokens being paired with input tokens; and it sounds like that's what's happening here... And you say this is basically an RNN but for this kind of transformer system? Alright, fair enough. I often also propose something to pair with it to give the most versatility; have a sort of data-table akin to a notebook it can write to, some of its outputs are akin to coordinates on this table, some are "here's what to write to it" as well as one that simply determines whether or not to write or not, and another set of coordinates for what to read for the next loop of the neural network. Having this kind of "Notepad" as well as a means to make a short term memory by doing a direct loop from output straight to input, could allow it to better remember both long term and short term information at the same time. I imagine its probably simple enough for all this to be implemented in a single system, especially in a system where this "RNN" functionality as you've described it is already implemented.
can you do a video telling how you analyze papers , how to find the limitations of pares , something like that ? critical analysis in computer science papers , if that possible .?
Okay, we get it. Transformer need some sort of latent representation to be able to think coherently about huge chunks of data. Okay, expectable. But why are there four works, which imply it, but call it by different names?
They're on the right track, except for the fact that it's all predicated on backprop/gradientdescent/automaticdifferentiation and thus totally incapable of online learning! It can only work with what it has been trained on.
My suspicion why so many people are reinventing RNNs is the lack of proper academic education and peer review process. People are just learning about transformers and the basic math and believe they have seen it all. Just a hunch, but nowadays everyone can upload stuff on arxiv without any quality control.
I'd be surprised if Google researchers weren't well aware of RNNs. I think it's more likely that a lot of people feel that SOME form of recurrence is going to be necessary for a model capable of system 2 thinking.
Hello, I didn't finish the video yet, but I already wanna add that it's unfair to complain about people using different terms for neural architectures when they use a known concept at a different place in the architecture. Try to compare chip design. If you design a minimal chip that is turing complete all other things you'd design around it just seem distortions of it - yet a processor isn't just a minimal turing machine but has a lot of machines in it that are optimized in some way, like ALUs or FPUs etc. which you could use as a fundamental bulding block to implement a turing machine itself. My point is that what seems the same from a theoretical standpoint has absolutely different consequences in effect and makes it necessary to give it another name from an engineering point of view.
I'm 7:48 minutes in and you've convinced me not to waste my time by continuing watching. I get your point, but if you want to increase viewer count you need to sell the paper better. 🙃
I'd rather he don't and keep his content honest especially considering the academic nature of the content, baiting ppl into watching useless content shouldn't be normalized but punished!
I get his frustration, but really IMO the era ot 'totally new' approaches is likely dead. We are now in the refinement stage, with low-level techniques being minor variants on each other and the innovations being tweaks on existing paradigms. this may change and there may be totally novel inventions down the pass, but i'd be unsurprised if nothing new comes for years to come. what's important is the performance. If the performance is there, it is well worth publishing and reviewing.
Could you please do a video on the Larimar Paper by IBM. Also has a bit of brain lingo in it. Think it could be used to get important dense context across inference(s) in a way that is almost semi-stateful instead of just of 'knowledge updates'? 🧐 Because if the next inference could be informed by the memory unit of important context, that might be helpful for Long Term Planning and more stateful, coherent responses, better stitching the inferences together so important context survives inf. ~Like makes me wonder if they could have further RLHFd it to use the memory unit to use context stored and further RLHF to refines it's use appropriately. ie learning to maintain important information relative to tasks and current goals across inf as needed. :x arxiv pdf/2403.11901 arxiv abs/2403.11901
I like the attempt to suggest some biological inspiration, "working memory" but could it is just not that. would be nice to see something truly biological inspired.
Look into the work of Jeff Hawkins (A Thousand Brains Theory). He's interviewed by Lex Fridman a couple of times and has a research company Numenta which publishes a lot of their work.
@@mennovanlavieren3885 There is a lot of pre assumptions in 'the brain makes a model of the world' - (it is just platonism for the masses) I personaly don't work with that theory because our models could be wrong (example, have magic, faith) and ... people be considered inteligent, it has to have something to do with motion because the motor cortex is just right besides the prefrontal cortex. and it have to do with more ancient parts of the brain.... the cortex is overrated. If you look in other brains they don't need such larger cortex and still are inteligent. (also most research in this area convinently have something to do with old research or a math model they can improvise or update, not something new). Also, all that information is not only coded in the cortex. many other parts of the brain have super rigid functions or double, triple functions it is a mess of wires down there. This 'systems of systems' are more like what inteligence would look like, not 'convinently the cortex do everything' - it is not feasible it have a model of an object for each column. Lex friedman is just a podcast host, they shut up and let the interviwer talk wathever. I can't blame him. it is a peaceful life. TL;DR. it has to be something to do with the cerebellum- thalamus - motor cortext circuit frist.
Great show you truth and brutal honesty are incredibly refreshing your your noting of origins justifies Ori once again you're giving credit for work and participation is proper one of your best shows for the year even though the discoveries may not seem to be the greatest I thank you very much for bringing illumination and putting it on the table sincerely Michael Crosby CRYPTEWORLD
I think it is outrageous when journals don't require the submission of the code for the acceptance of the algorithm. What if as part of the price for the publication, the journal offers cloud computational resources for anyone who wants to test the algorithm?
I'm the author :) As self-taught ML practitioner, when I started studying deep learning, Yannic was my favorite teacher. I've learned so much from you. I'm so honored that Yannic featured my paper today. After watching your review, I made some changes to the paper and re-uploaded it to arXiv.
I'd like to add some supplementary explanations to Yannic's comments:
1. Feedback loop vs. Recurrent
As Yannic mentioned, it's essentially incorporating recurrent features into the Transformer architecture. RNNs were the first to implement feedback loop in a sequence-by-sequence manner, while the paper implements feedback using attention. The paper emphasize feedback as RNN means more like Markov process or autoregressive concept. The paper mentions that LSTMs and GRUs successfully implemented feedback loops within the RNN framework.
2. Stop Gradient and Gradient Checkpointing
Sorry for bad english. I rewrote the part. The idea is that previous research suggested using stop gradient to improve computational performance. However, with gradient checkpointing, all computations need to be redone anyway. So, whether or not you use stop gradient, there's no noticeable impact on speed or memory usage. This is why I recommend eliminating the use of stop gradient.
3. 18:00 simpler algorithm
Thank you for Yannic's compliment on Appendix C "Don't". In C.3, I addressed the simpler algorithm that Yannic suggested. Actually, this was the very first method I tried. Unfortunately, this algorithm failed to remember PassKey for long sequences, and the algorithm became a bit more convoluted, as Yannic mentioned.
Congrats for your work. Feel free to keep in touch if you need some kind of independant reviewing / advices on your strategy.
Are you aware of ERNIE-DOC: A Retrospective Long-Document Modeling Transformer (Ding et al, 2021)? They also introduce recurrence by passing the state to the next step, but one layer below. Found it via lucidrains x-transformers library :)
This is science at its best
Having studied neuroscience in undergrad and now an ML practitioner, I really appreciate how your team tried to relate the architecture of your transformer to neuro! There should be a huge push for this given that we have seen such huge success with even simple adoption of neurons and neural networks.
@@arnebinder1406 I didn't know ERNIE-DOC, but the idea is very same to "C.2. Feedback Memory Segment" in "Appendix C. Don't" section. Thank you for letting me know. I'll cite the paper. BTW, this architecture failed to remember PassKey. In my opinion, individual token has confusion to carry both local and global info together.
At this point, Yannic is in desperate need of somebody, someone to not reinvent and instead create a new invention. Here, a back pat for you for your contribution to sharing academic papers with us.
I think in the next two to three years, the moment will finally come, when someone reinvents attention and calls it something like "self-aware in-context persistent memory", sprinkling it with some neuroscience mumbo-jumbo gibberish.
Persistently
Engaging
Neurological
Initiatives
Stimulation
i think your timeline is optimistic. I give it at least 5 years
Mamba #5
No, you're wrong. No one moment will ever come where an AI will suddenly be discreetly self aware. Self aware is one of those vague concepts that either has no concrete meaning, or to the extent it can be well definied it will be a matter of degrees (ie "model A is more self aware than model B"). Your comment is ironic in that you attack actual scientific paper's terminology as "mumbo jumbo" - I would argue the phrase "self-aware" ITSELF is mumbo jumbo!
@@marshallmcluhan33 selective scan sctuctured state space sequence model AKA linear RNN
Thank you Yannic för making these papers understandable, and breaking them down with your insight and understanding. Very interesting. 🙏
So if the paper framed itself as, “because of neuroscience reasons, the way to improve transformers should be to combine them with an RNN (specifically, in a way like this)”, specifically describing it as a form of, or variation on, RNNs, would that have alleviated most of your issues with it?
So I think you make a pretty convincing argument that this is a repackaged form of recurrent neural network. And yes you're right that a long time ago people were using these as neurally inspired architectures that weren't anywhere as successful as transformers. Now what I'm wondering is if they failed because they didn't have the transformer architecture underneath them which was more similar to organized long-term memory and learning. Maybe recurrent neural networks are almost useless by themselves, but built on top of transformers they provide the powerful equivalent of what we in Neuroscience call working memory and that the combined architecture, combined in this way can take things to the next level. I know that performance metrics can be gamed and can be very misleading but ultimately it doesn't matter if we're doing something similar to what we did in the past, if this particular arrangement leads to significant performance gains in certain kinds of tasks it still might be valuable even if it's recycled. A lot of technological progress occurs with the repurposing of old inventions in a new context. I apologize if this is a very off base comment I'm just really getting to learn this stuff and there's a lot of holes in my background because I'm coming at this from more of a neuroscience perspective. I can say that interrupting the cortico thalamic loop produces humans that might know a lot of stuff and even give appropriate responses, but leads to knowledgeable people that are just reactive entities and can't get anything done and lose track of where they are all the time. Those kinds of problems seem present in the current entities like chat GPT 4.0 and Claude opus which is what I have the most experience with.
❤
RNN: "You cannot live with your own failure. Where did that lead you? Right back to me."
All paths leads to Schmidhuber
26:53 - "I have a different suspicion of why all of this, or any of this might work... there's a lot of things carried by the residual connections... you have a completely pass-through path..."
BOOOoOOOM!
Your style is very entertaining. I laughed at the section : feature is not a bug. ~34 minutes into the video.
I want models that can modify or add weights on the fly. I want it to have better long term memory without having to go back and read over the rags and hope it got what I want.
I want a unicorn 🦄
@@chispun2 But only if it has wings.
@@zyxwvutsrqponmlkhI think that’s called a pegasus ?
LORA is kind of like that, but interfaces to update it are in their infancy because researchers are sorting out the right way to do it.
@@drdca8263 Pegasus doesn't have the horn.
It's interesting how for working memory they describe layer wise interaction this way (at least according to these authors). In flexible hippocampally dependent memory we see multiple recurrence across many system layers as well as layer-wise specialization in things like Pattern Separation (DG/CA3), Pattern Completion (CA3), and Relational Binding (CA1). We know the hippocampus is critical in humans for creative flexible reasoning. I feel like it's the brain region and memory system we're missing, not PFC mediated working memory...
It's closer to Block-Recurrent Transformers than transformersxl. TransformersXL reused output of previous layer rather than its own. Which makes sense for anything than Alberta where each layer is the same. Also the only reason xl stop backprop is publishing year. They simply had no resources to propogate that much.
Also BRT authors were like "oh,yeah,we have vram now, so we let backprop through more CTX window" they called it "Slide" model.
OK,after watching all video,its closer to RMT except they route memory to the same layer. I don't understand why they didn't went with RMT route and just computed them normally(appending memory) without second attn
Thanks for the explanation.
Total aside ... I downloaded llama-3 and set the context window to 8k and it worked fine. I boosted it up to 9k and got pure gibberish. That was the first time using too big of a context size had that happen. The closest I've had before when I had an error for using too big of a context size my program simply crashed as I ran out of memory.
Happily fine tuned(?) models have come out with longer context lengths. But I found it interesting. I obviously didn't understand what's going on underneath the hood to fully understand why, exactly, that happened... But videos like this give me a better foundation to make guesses.
guau great paper revision you enlight so well the key mechanism of the net, great job in scintific dissemination
The biological neural point of single/small group neurons in the loop diagram at 2:17 reminded me of how we remember the past and imagine the future, as well as the sci-fi movie ‘Inception’ (2010).
To some degree I kind of get why you harp on this point a lot that this is just reinventing recurrent neural networks. And its really quite strange that nobody actually talks about these basically just being rnn architectures. But on the other hand these are at least novel in that they manage to combine the magic of transformer attention with the magic of rnns in a way that supposedly actually works well. Does it really matter if the "new" component they are introducing has been invented before separately? The only real benefit of seeing this connection is that you might be able to transfer some of the rnn experience over to these new hybrid architectures. Or is it just the annoyance that it seems like they haven't properly studied the older literature on ML and are kind of these transformer kiddies that claim having invented stuff based on ignorance?
I am not trying to start a fight here! I am just curious where this frustration comes from.
Combing RNNs + Attention is the precursor to the Attention is All You Need paper (hence the "All You Need" part...). The commonality in all these papers seems to be "we claim good results on this particular dataset in this particular setting", but in reality, it's very hard to reproduce the same good results in real-world settings or even on other datasets or environments
@@btnt5209 I still don't understand why this is still an active area of research when State Space Models have solved the Quadratic scaling problem (or so I thought). SSMs allow for optimal linear transfer of information between adjacent FFN, which is what this paper tries to do.
Going back to Attention + RNN is not bad in itself, if they improve upon things and show that Attention isn't all what you need after all (at least not for all use cases) then it's great, but not acknowledging the huge similarities between their approach and what already exist (and was the default for some time) is quite disingenuous and should be pointed out. Not only does it introduce redundancy in the research in the field, but it also obfuscates the contribution to overall knowledge and the comparisons to existing approaches, and yes it does also inflate their contribution. Just take a look at OpenReview and how harsh they are on papers that don't effectively cite similar work and compare their contribution to theirs
Research as a process is reliant on building from and comparing with the prior art - "the point" is to improve the existing understanding. By ignoring well-established concepts from the literature (intentionally or through ignorance) the novelty and merit of the presented work are unnecessarily hard to determine, and doing so requires effort from each reader that the authors could reasonably have been expected to do (once) themselves.
If the paper had instead been presented as a neuro-focused take on combining RNNs with transformers, with examples illustrating how and when that is successful, there would be less redundant information and likely more valuable insights and/or comparisons.
@@xxlvulkann6743 I'm wondering this myself as well. SSMs come out boasting 100x speed ups for large models and now it's just crickets with everyone still using regular transformers. That being said, I did hear that they perform poorly at copywriting.
This framework reminds me a [cls] token added at the end of each block. As shown in the following paper, adding a few of [cls] tokens, called registers, at the beginning improves performance. VISION TRANSFORMERS NEED REGISTERS, ICLR 2024.
What you are drwawing at 17:00, isn't that exactly what "Vision Transformers Need Registers" (DINOv2) proposed?
Transformers use masking during training and are trained to predict the next token given previous tokens. I wonder if the memory mechanism/RNN node should also be trained to do the opposite: mask off the prefix and given the current tokens and the previous memory, predict the previous tokens.
Isn’t reprompting with a model’s own output just this, but at a higher level? (it’s effectively concatenating a state to a transformer, that came from a transformer…)
This is what I’m pickling on too…. next token generation would seem to supersede and generalize hidden state juggling. The working memory is the “let’s think step by step”.
Well, I can instantly say they are onto something, but not sure if correct way.
I've been experimenting with ANNs for 15 years. For the last 5 I've been working on a structure where the signal never dies.. I want the "loop", the amalgamation of the last states pulses through the net, meaning different things to different areas, preserving the "state" and working memory until the next input comes, so that the input is affected by the current "mood" of the net.
Still strugling with this approach, but it is what is needed for real AI. In us, the signal never dies, it needs to be there to propegate more signal.
I was blown away when people solved my difficult task in a simple ways with LLM, "just have many agents interacting with eachother". Yeah, together they then acchive what I want my single net to do, the signal never dying, but I want it in a single net.
I've been contemplating similar ideas--my intuition is that the big limitation with the transformer design is the linear request/response structure. What's needed is a physical loop of layers with an input structure and an output structure woven into the loop-creating a signal of updates continually propagating around the loop. You want to establish a standing wave state around the loop that can be perturbed by the inputs and monitored by the outputs. The signal perturbation would be the communication--i.e. when humans talk we're modulating frequency, so send a signal in, let it integrate into the feedback loop, and monitor the perturbations at the exit. Toss out the single pass forward propagation and replace it with these self-oscillating feedback loops on a large scale--the entire structure a loop of layers, which would need inhibitory function to prevent runaway feedback. Imagine speaking into it, and hearing a voice coming out of it.
@@MarkDStrachan EXACTLY! That sounds like my thinking
The problem with the simple algorithm you proposed is when the generation reaches the block (or context) length in the middle of a sentence, then it is not so straightforward to generate the next token without the previous ones, while maintaining the sentence structure
Funny that I thought of the same idea right after watching your last review on infini attention lol
Hey Dr Kilcher, you should put your reputation where your snark is!😮😂 show us how you would architect your recurrent neural network to act as short-term memory. Snark aside, I really would like to see how that might work. Would you just prepend an RNN to a transformer?
do you give RWKV any chance ?
This field bounces back and forth between pure engineering and cognitive sciences: first they were inspired by parallel-distributed processing and neuron-like units like the brain, then it was all about optimisation and infrastructure (cognitive-science was irrelevant), now they've realised they'll have to go back and get more inspiration for cognitive/biological sciences to achieve AGI. IMO feedback loops were inevitable from the start. How much impact will this paper really have?
Well, I do think all those new papers have one thing in common: create "a hierarchy of memory" with increasing granularity and compute from less recent to most recent data. What's missing, I think, is not just going with hierarcy from linear to quadratic, but from linear to quadratic to *cubic*.
Maybe the limited dimensions is a form of regularization?
Proposal at 17:00 seems natural, has it been done? Its just as parallelizable as a regular attention mechanism, isn't it?
Why does it bother you so much that these papers are basically reprising RNNs? I mean yes, that's what they're doing, but they're doing it in different ways (there's countless variations on the RNN itself after all), so what's the problem?
RNNs were always an intuitively good idea anyway, held back by vanishing gradient/info loss over thousands of tokens. All these papers, by "recurring" over big chunks rather than token by token, basically solve this. I think it's really exciting.
That is extremely slow though. RWKV is a better approach
To answer your question:
1. They don't refer to rnn, instead they are blabbering about neuroscience
2. They don't compare to rnn, instead they compare to block wise, which is clearly preform worse, and even than the improvement is minor
Like yanic says, it is clear that they missed something crucial and they are blurring it with blabla
The problem is that it doesn't acknowledge that their approach are basically RNNs, if they presented it as an RNN + attention variant, it would have garnered less attention but would be more honest, and if they effectively compared to RNNs and showed even slight increase in performances it would have been really good. The problem here is that they are obfuscating their contribution and how it relates to previous knowledge in the field, and don't even compare to RNNs... And don't get me started on the redundancy they are introducing, which is never good.
Also, they do all of this fancy talk about working memory in neuroscience, which is basically BS since that isn't what working memory actually is, not even close. Real working memory is unbounded with respect to the input size, (theoretically) similar to a Turing machine or the RAM of a Von Neumann machine, this kind of "working memory" that is in RNNs is linearly bounded with respect to the input size, which is much more similar to linear Turing machines and stack-atutomatas.
When mamba, rnn, comes with claims of outperforming transformers, I kinda like seeing benchmarks against rnn proper
it's a form of convergent evolution for ML. Silicon valley after some time either reinvents the train or an RNN
If it isn’t a big deal, why are you reviewing it?
I think I actually have thought about a similar idea of a handful of output tokens being paired with input tokens; and it sounds like that's what's happening here... And you say this is basically an RNN but for this kind of transformer system? Alright, fair enough.
I often also propose something to pair with it to give the most versatility; have a sort of data-table akin to a notebook it can write to, some of its outputs are akin to coordinates on this table, some are "here's what to write to it" as well as one that simply determines whether or not to write or not, and another set of coordinates for what to read for the next loop of the neural network. Having this kind of "Notepad" as well as a means to make a short term memory by doing a direct loop from output straight to input, could allow it to better remember both long term and short term information at the same time.
I imagine its probably simple enough for all this to be implemented in a single system, especially in a system where this "RNN" functionality as you've described it is already implemented.
I feel like being heard by you! Thanks you!
can you do a video telling how you analyze papers , how to find the limitations of pares , something like that ? critical analysis in computer science papers , if that possible .?
Please start reviewing papers without backpropagation
Okay, we get it. Transformer need some sort of latent representation to be able to think coherently about huge chunks of data.
Okay, expectable.
But why are there four works, which imply it, but call it by different names?
They're on the right track, except for the fact that it's all predicated on backprop/gradientdescent/automaticdifferentiation and thus totally incapable of online learning! It can only work with what it has been trained on.
transformer-xl attends to layer below not same layer
My suspicion why so many people are reinventing RNNs is the lack of proper academic education and peer review process. People are just learning about transformers and the basic math and believe they have seen it all. Just a hunch, but nowadays everyone can upload stuff on arxiv without any quality control.
I'd be surprised if Google researchers weren't well aware of RNNs.
I think it's more likely that a lot of people feel that SOME form of recurrence is going to be necessary for a model capable of system 2 thinking.
the shaved head makes the LLM's work better the power is building
but wait, wasn't shaving head supposed to deprive of power
I think there should be more work on identifying redundancy in the field, probably using AI itself
Maybe an RHF dataset filled by academics only
It is well known that the research work evolves as a spiral.
Hello, I didn't finish the video yet, but I already wanna add that it's unfair to complain about people using different terms for neural architectures when they use a known concept at a different place in the architecture. Try to compare chip design. If you design a minimal chip that is turing complete all other things you'd design around it just seem distortions of it - yet a processor isn't just a minimal turing machine but has a lot of machines in it that are optimized in some way, like ALUs or FPUs etc. which you could use as a fundamental bulding block to implement a turing machine itself. My point is that what seems the same from a theoretical standpoint has absolutely different consequences in effect and makes it necessary to give it another name from an engineering point of view.
What's up FAM?
I'm 7:48 minutes in and you've convinced me not to waste my time by continuing watching.
I get your point, but if you want to increase viewer count you need to sell the paper better. 🙃
Thanks for saving me the time.
Seem like "it just another RNN paper" video from the comment.
I'd rather he don't and keep his content honest especially considering the academic nature of the content, baiting ppl into watching useless content shouldn't be normalized but punished!
I don't think that's per se his goal to begin with? You don't have to watch.
Thank you Mr Yannik for your explanation of TransformerFAM: Feedback attention is working memory.
Unrolling w backprop is a mistake - more local learning, more sums of geometric series.
Lets put RNN back
It's 23:24 where I live, please let me sleep XD
move!
Same time zone here. How dare he! 😄
lol another sleepless night here in italy
i thought attention was all i needed lol
Integral and differential terms?
I said it kind of before. "The thing about an integral is the gradient is related."
this is roast + tutorial ... a Roastorial !!
Transformers needs some new kind of memory ?
we need to focus on kan
I get his frustration, but really IMO the era ot 'totally new' approaches is likely dead. We are now in the refinement stage, with low-level techniques being minor variants on each other and the innovations being tweaks on existing paradigms.
this may change and there may be totally novel inventions down the pass, but i'd be unsurprised if nothing new comes for years to come. what's important is the performance. If the performance is there, it is well worth publishing and reviewing.
Another episode of Yannic explaining how a paper is just RNN 😆
Could you please do a video on the Larimar Paper by IBM. Also has a bit of brain lingo in it.
Think it could be used to get important dense context across inference(s) in a way that is almost semi-stateful instead of just of 'knowledge updates'? 🧐
Because if the next inference could be informed by the memory unit of important context, that might be helpful for Long Term Planning and more stateful, coherent responses, better stitching the inferences together so important context survives inf.
~Like makes me wonder if they could have further RLHFd it to use the memory unit to use context stored and further RLHF to refines it's use appropriately. ie learning to maintain important information relative to tasks and current goals across inf as needed. :x
arxiv pdf/2403.11901
arxiv abs/2403.11901
No cap on a stack fr
no hat on a tower
I like the attempt to suggest some biological inspiration, "working memory" but could it is just not that. would be nice to see something truly biological inspired.
Look into the work of Jeff Hawkins (A Thousand Brains Theory). He's interviewed by Lex Fridman a couple of times and has a research company Numenta which publishes a lot of their work.
Has anyone gotten transformers to work with spiking neural networks?
@@mennovanlavieren3885 There is a lot of pre assumptions in 'the brain makes a model of the world' - (it is just platonism for the masses) I personaly don't work with that theory because our models could be wrong (example, have magic, faith) and ... people be considered inteligent, it has to have something to do with motion because the motor cortex is just right besides the prefrontal cortex. and it have to do with more ancient parts of the brain.... the cortex is overrated. If you look in other brains they don't need such larger cortex and still are inteligent. (also most research in this area convinently have something to do with old research or a math model they can improvise or update, not something new).
Also, all that information is not only coded in the cortex. many other parts of the brain have super rigid functions or double, triple functions it is a mess of wires down there.
This 'systems of systems' are more like what inteligence would look like, not 'convinently the cortex do everything' - it is not feasible it have a model of an object for each column. Lex friedman is just a podcast host, they shut up and let the interviwer talk wathever. I can't blame him. it is a peaceful life.
TL;DR. it has to be something to do with the cerebellum- thalamus - motor cortext circuit frist.
@@drdca8263 yes, search for SpikeFormer. There is more than 1 implementation
woah
i am a strange cortical thalamic loop
corthaloop
He don’t care
Great show you truth and brutal honesty are incredibly refreshing your your noting of origins justifies Ori once again you're giving credit for work and participation is proper one of your best shows for the year even though the discoveries may not seem to be the greatest I thank you very much for bringing illumination and putting it on the table
sincerely Michael Crosby CRYPTEWORLD
I think it is outrageous when journals don't require the submission of the code for the acceptance of the algorithm.
What if as part of the price for the publication, the journal offers cloud computational resources for anyone who wants to test the algorithm?
wonder when they finally reinvent ART 🙄 (no not artistic pictures)
hehe. People have come up with DL architectures for that as well.
just ask chat-gpt to write summary of the conversation after each answer and call it a day lol