Thanks for discussing our paper on EM-LLM! Glad you find it interesting!! I also have a video from the first author here: ruclips.net/video/gWoh_5fsZpA/видео.html - if you are interested in getting more insights.
My course on Human Memory was taught by Michael Kahana, one of the names in the citations that kept popping up. Very interesting to see our in-class temporal contiguity effect demonstration playing out in an AI neuroscience context, wow! Small world in academia :)
In theory it works, but not practically. Systems like these, need to be coupled with thinking tokens, so the most semantically contextual segments are retrieved based on attention, more specifically model reasoning like humans, and instead of relative segment similarity…BUT there are a lot of ideas I took from this part. Like NLL for novel observations and event boundary detection. FYI this is what I used to actually make quiet-star useful, explicitly but autonomously allowing the model to generate useful thoughts, not to mention I use it for the basis for this new style of meta self-supervision I created for the offline token re-weighting phase. So, all and all - pretty amazing ideas in this paper, the value from some of the underlying principles are vastly understated. Great vid bro. No paper is safe lol. I see you meant that ha. Keep them coming bro.
I personally think that the mechanism behind human episodic memory is far more complicated than this. When humans return to a specific situation, they can instantly recall things that happened decades ago. Does the human brain really store kv caches for decades? I don't believe it.
I think we encode some kinda sparse representation. Also our dreams etc seem to really help us form long term memories, so maybe it requires a kinda dreaming so to speak and a lot of reflection and updating our connections and weights dynamically. Plus there are many aspects we don't understand at all yet.
Aren't we also sortve remembering memories of memories? Like the memory gets rewritten everytime it comes up. Off the top of my head I can't remember anything I haven't remembered in years, unless specifically triggered by something like a smell or music
Indeed, memories are not stored "in the physical brain" as we understand locality, physicality and the brain, but instead in the so-called biofield, the magnetosphere and other internally coherent, nested, interpenetrating domains of a spatiotemporally distributed holarchy which comprises the physical correlates of our past, present and future selves. The intricacies of these concepts are well beyond your command, in stark contrast to the straightforward matter of your car's extended warranty, which we'd like to discuss with you now.
this might solve the 'frame problem' which early more procedural approaches to AI found difficult. Context is all about working out what is important, and an expanding context window would effectively be a solution to the basic problem of working out what IS relevant information in a certain situation.
I tried something that I think is similar to this (without the math part). My idea was to convert conversations into tokens for storage, and when a new prompt would be entered it would look up past events and pull things that matched closely, in theory be memories of related topics based on the token vectors. It didn't work because I don't know enough about the intricacies of tokenization and math (basically wasn't as plug and play as I was hoping for) so I did the next best thing and stored these past conversations as text logs which I then would look through with each prompt to find similar topics. In the end I actually used the LLM to do this analysis search first, then pulled the first few random good matches and incorporated them into the prompting. Even with the much less effective method it does seem to remember things. I think it only worked because I used an uncensored model that had no limit for input. I was hoping for a different approach to try but as you went through the paper a lot of it felt familiar in the general approach. I do think the token direction would work a lot better and faster, since it's a much better way to compare concepts than textual search.
This sounds pretty simple to implement (at least as this type of paper goes.) It would be really useful when writing narrative text simulations. E.g. ... (history of simulation for all characters up to 10:00) ... "What happens between 10:00 and 10:05 from the perspective of ", ... "What happens between 10:00 and 10:05 from the perspective of " ... "Eliminate contradictions" ...
this sounds more like efficient RAG-like memories, or a RAG-like successor; I suppose it is a kind of episodic memory, but hmm...It's not using this type of memory to be actively in the now per se, well.. eh. Suppose i'm actually looking for a kind of working memory of sorts. eh i feel like continuity and coherence of intent and of tasks/problem solving should be maintained, not just retrieval of past events from the previous inference, but also that the current inference should have the "why" from a previous infs or some kind of direct "knowledge update" that informs the current inference. Prob going to be either some kind of autoencoder like memory unit informed inference -like LARIMAR ++ but trained for coherence & continuity over time and for knowledge updates (and storing, and retrieval and proper use thereof for) tasks or some kind of stateful possibly recurrent complicated system... True memory esp episodic memory is going to be awesome for agents if/when it happens. An inference that doesn't start over every time... one can dream...
Long term storage of EM-LLM memory segments can probably be managed in a graph structure, similar to vector storage within neo4j graph databases. A related development is the release of Falcon Mamba 7B. Apparently, increasing the amount of context included in the prompt does not increase the requirement for RAM.
Not an expert or even competent with graphs & vector storage, but my impression is that that should work and be a great opportunity to expand the scalability of this technique. Pretty sure the authors mentioned vaguely the potential for significant improvements in that area Yes that's true for any Mamba model, but the problem with those is that the model has to not only decide what info is important (which can be loosely described as a problem of having to predict what info will turn out to be relevant/useful later), but any info deemed not important gets permanently lost. In contrast, here the memories not currently in context brought back up again if they end up being useful at some future time period, meaning the model does not have to guess as to what is going to be useful in the future. To be fair the specifics of this method involve only keeping the surprising memories and actually losing anything outside of that context buffer around said memories, but that could be easily changed for anyone looking to make their own version, they'd just adjust the hyper parameters until practically everything is put into a memory (which would mean a more expensive qk memory lookup, probably not worth it).
Mamba is a state space model, it doesn't work based on attention, rather it does something more like learning a differential function that maps the context to a latent space, then integrating the function over long sequence lengths. That's how it scales linearly in context rather than polynomials. Attention compares every token in the sequence to every other token in a giant table.
right! and i didn’t look heavily into how they chose said single one, i think it had to do with its “representativeness” according to some metric used in the graph grouping stuff they did to tune the memories. i’d be interested to see ablations and compute comparisons between this single token, a sum & norm, and some more sophisticated pooling mechanism (attention based?)
I dont understand anything but does this mean llms will be able to do more things without needing training or specially created vectors to help them understand what we are trying to do, coz i could wait for that
Lol, apparently it's only historically miss aligned companies involved as if tuning is more for censorship less about letting congruent line of measure go
@Tunadorable I don't recall you mentioning chemistry. Lol miss aligned measure does exist in this topic & understanding in our public domain even experts struggle with it . a lot of miss aligned 1 Teaching methodology 2 poor diagnosis 3 weak understanding of intelligence, intellectual iq testing, etc etc etc
@Tunadorable if anything, your strengthening evidence that memory is less chemical and more thermodynamical. If a human doesn't see value to enter a memory in the first place, it won't ever be remembered or encoded. No matter what reasons
@Tunadorable so like, in 3.2 and 3.3 they said something about the positional embedding that could improve the robustness of the model sine its a fixed positional embedding, now, instead of a fixed positional embedding, you just got static noise as it is, and sometimes, the static repeat in some pattern that can move the token in their latent space ( similar to how adhd can relate seemingly random topic) And 3.3 they have theorised that the event can be recall most efficiently is the event that is correlate in some fields, but since our embedding is noise, sometimes, the noise can move the token that makes it related (like how apple being moved close to phone due to random noise and we got iphone, how episodic memory got moved to AI and somehow we got adhd ?) Im aware that the dimenson of the latent space is so large that we cant just deploy numpy.random to move token around, we need sthg something random but still predictable to some degree to maybe mimic adhd brain ?
haha i assume simultaneous creation is a real jerk but if you’re being literal about them ripping code from your github i would love to see said code. can’t remember if i checked whether they open sourced theirs for comparison to be able to make that claim
@@Tunadorable yeah definitely, it happens all the time. Super common in AI where the LLMs love to share anything they've picked up and get trained on their own past conversations a lot of the time.
Thanks for discussing our paper on EM-LLM! Glad you find it interesting!! I also have a video from the first author here: ruclips.net/video/gWoh_5fsZpA/видео.html - if you are interested in getting more insights.
glad you enjoyed the video!
@@Tunadorable It was amazing :D
My course on Human Memory was taught by Michael Kahana, one of the names in the citations that kept popping up. Very interesting to see our in-class temporal contiguity effect demonstration playing out in an AI neuroscience context, wow! Small world in academia :)
In theory it works, but not practically. Systems like these, need to be coupled with thinking tokens, so the most semantically contextual segments are retrieved based on attention, more specifically model reasoning like humans, and instead of relative segment similarity…BUT there are a lot of ideas I took from this part. Like NLL for novel observations and event boundary detection. FYI this is what I used to actually make quiet-star useful, explicitly but autonomously allowing the model to generate useful thoughts, not to mention I use it for the basis for this new style of meta self-supervision I created for the offline token re-weighting phase. So, all and all - pretty amazing ideas in this paper, the value from some of the underlying principles are vastly understated. Great vid bro. No paper is safe lol. I see you meant that ha. Keep them coming bro.
ponderocity rambanctious reciprocity segmentation
I personally think that the mechanism behind human episodic memory is far more complicated than this. When humans return to a specific situation, they can instantly recall things that happened decades ago. Does the human brain really store kv caches for decades? I don't believe it.
I think we encode some kinda sparse representation. Also our dreams etc seem to really help us form long term memories, so maybe it requires a kinda dreaming so to speak and a lot of reflection and updating our connections and weights dynamically. Plus there are many aspects we don't understand at all yet.
Aren't we also sortve remembering memories of memories? Like the memory gets rewritten everytime it comes up. Off the top of my head I can't remember anything I haven't remembered in years, unless specifically triggered by something like a smell or music
Indeed, memories are not stored "in the physical brain" as we understand locality, physicality and the brain, but instead in the so-called biofield, the magnetosphere and other internally coherent, nested, interpenetrating domains of a spatiotemporally distributed holarchy which comprises the physical correlates of our past, present and future selves. The intricacies of these concepts are well beyond your command, in stark contrast to the straightforward matter of your car's extended warranty, which we'd like to discuss with you now.
@@attilaszekeres7435 human magnetospehre, Biofield? bullshit. Try science not some magic.
Humans "in-paint" memories from bits and pieces, many studies have demonstrated eye-witnesses are not reliable.
this might solve the 'frame problem' which early more procedural approaches to AI found difficult. Context is all about working out what is important, and an expanding context window would effectively be a solution to the basic problem of working out what IS relevant information in a certain situation.
Thank you!
I tried something that I think is similar to this (without the math part). My idea was to convert conversations into tokens for storage, and when a new prompt would be entered it would look up past events and pull things that matched closely, in theory be memories of related topics based on the token vectors. It didn't work because I don't know enough about the intricacies of tokenization and math (basically wasn't as plug and play as I was hoping for) so I did the next best thing and stored these past conversations as text logs which I then would look through with each prompt to find similar topics. In the end I actually used the LLM to do this analysis search first, then pulled the first few random good matches and incorporated them into the prompting.
Even with the much less effective method it does seem to remember things. I think it only worked because I used an uncensored model that had no limit for input. I was hoping for a different approach to try but as you went through the paper a lot of it felt familiar in the general approach. I do think the token direction would work a lot better and faster, since it's a much better way to compare concepts than textual search.
ah sounds like you did a prompt engineering/automation version of this same general (intuition/pattern/structure/idea/methodilogy), very cool
This sounds pretty simple to implement (at least as this type of paper goes.) It would be really useful when writing narrative text simulations. E.g. ... (history of simulation for all characters up to 10:00) ... "What happens between 10:00 and 10:05 from the perspective of ", ... "What happens between 10:00 and 10:05 from the perspective of " ... "Eliminate contradictions" ...
this sounds more like efficient RAG-like memories, or a RAG-like successor; I suppose it is a kind of episodic memory, but hmm...It's not using this type of memory to be actively in the now per se, well.. eh. Suppose i'm actually looking for a kind of working memory of sorts. eh
i feel like continuity and coherence of intent and of tasks/problem solving should be maintained, not just retrieval of past events from the previous inference, but also that the current inference should have the "why" from a previous infs or some kind of direct "knowledge update" that informs the current inference.
Prob going to be either some kind of autoencoder like memory unit informed inference -like LARIMAR ++ but trained for coherence & continuity over time and for knowledge updates (and storing, and retrieval and proper use thereof for) tasks
or some kind of stateful possibly recurrent complicated system...
True memory esp episodic memory is going to be awesome for agents if/when it happens. An inference that doesn't start over every time... one can dream...
Long term storage of EM-LLM memory segments can probably be managed in a graph structure, similar to vector storage within neo4j graph databases.
A related development is the release of Falcon Mamba 7B. Apparently, increasing the amount of context included in the prompt does not increase the requirement for RAM.
Not an expert or even competent with graphs & vector storage, but my impression is that that should work and be a great opportunity to expand the scalability of this technique. Pretty sure the authors mentioned vaguely the potential for significant improvements in that area
Yes that's true for any Mamba model, but the problem with those is that the model has to not only decide what info is important (which can be loosely described as a problem of having to predict what info will turn out to be relevant/useful later), but any info deemed not important gets permanently lost. In contrast, here the memories not currently in context brought back up again if they end up being useful at some future time period, meaning the model does not have to guess as to what is going to be useful in the future. To be fair the specifics of this method involve only keeping the surprising memories and actually losing anything outside of that context buffer around said memories, but that could be easily changed for anyone looking to make their own version, they'd just adjust the hyper parameters until practically everything is put into a memory (which would mean a more expensive qk memory lookup, probably not worth it).
Mamba is a state space model, it doesn't work based on attention, rather it does something more like learning a differential function that maps the context to a latent space, then integrating the function over long sequence lengths. That's how it scales linearly in context rather than polynomials. Attention compares every token in the sequence to every other token in a giant table.
It's odd they didn't try to accumulate the tokens in an episode and chose a single one instead
right! and i didn’t look heavily into how they chose said single one, i think it had to do with its “representativeness” according to some metric used in the graph grouping stuff they did to tune the memories. i’d be interested to see ablations and compute comparisons between this single token, a sum & norm, and some more sophisticated pooling mechanism (attention based?)
Finally! Hierarchical attention.
Not really if there is only one level of selection
@@deltamico I know, but it's a start
Thanks!
thank you☺️
I dont understand anything but does this mean llms will be able to do more things without needing training or specially created vectors to help them understand what we are trying to do, coz i could wait for that
Plz link induction head vid
this should be needed lots of... visualization... in my mind and i failed to visualize them, so I'm failed to understand this
i do need to get into the habit of pulling out the ipad pencil more often
So it can recommend paragraph, section and chapter breaks? And from that build an index? Finally, an AI boredom graph.
neat
So, technically, the next gpt could have adhd ? And if so, did we just solved the mathematical form for adhd ?
Lol, apparently it's only historically miss aligned companies involved as if tuning is more for censorship less about letting congruent line of measure go
Maybe I just forgot something I said in the video, but I'm curious as to why you drew a relation between this and ADHD. Could you elaborate?
@Tunadorable I don't recall you mentioning chemistry. Lol
miss aligned measure does exist in this topic & understanding in our public domain even experts struggle with it .
a lot of miss aligned
1 Teaching methodology
2 poor diagnosis
3 weak understanding of intelligence, intellectual iq testing, etc etc etc
@Tunadorable if anything, your strengthening evidence that memory is less chemical and more thermodynamical.
If a human doesn't see value to enter a memory in the first place, it won't ever be remembered or encoded.
No matter what reasons
@Tunadorable so like, in 3.2 and 3.3 they said something about the positional embedding that could improve the robustness of the model sine its a fixed positional embedding, now, instead of a fixed positional embedding, you just got static noise as it is, and sometimes, the static repeat in some pattern that can move the token in their latent space ( similar to how adhd can relate seemingly random topic)
And 3.3 they have theorised that the event can be recall most efficiently is the event that is correlate in some fields, but since our embedding is noise, sometimes, the noise can move the token that makes it related (like how apple being moved close to phone due to random noise and we got iphone, how episodic memory got moved to AI and somehow we got adhd ?) Im aware that the dimenson of the latent space is so large that we cant just deploy numpy.random to move token around, we need sthg something random but still predictable to some degree to maybe mimic adhd brain ?
They should change the title from “infinite” context to “unbounded” context, as “infinite context” implies something physically impossible.
ppl do love that word haha
bro who stole my work
I literally own the copyright to the code for this.
lmao relatable
haha i assume simultaneous creation is a real jerk but if you’re being literal about them ripping code from your github i would love to see said code. can’t remember if i checked whether they open sourced theirs for comparison to be able to make that claim
@@Tunadorable yeah definitely, it happens all the time. Super common in AI where the LLMs love to share anything they've picked up and get trained on their own past conversations a lot of the time.