@@franklydoodle350 6/8 authors started their own startups. One is at OpenAI. Only one (Llion Jones) is at Google. He is leaving Google later this month to start his own startup.
The parallels to human cognition are really interesting. The "lost in the middle problem" is very much a feature of human memory - you always remember the beginning and end of a sequence the best (e.g the beginning of the book you're reading and the part you've read last, but things in the middle are fuzzier).
Some reserch attributes the decay is due to interference as there's little time to rehearse before new information is introduced. Would probably look more like a diminishing gradient without interference.
that's because most of the stuff in the middle is either entirely irrelevant filler or filler that just expounds on the actually relevant information or the actual idea being presented. if you are expounding, you aren't doing so so that the person you are talking to can remember that stuff in particular, you are doing it so that they can more easily form their own take on an idea...their own model. once they have done that, like the training data for modern ais, it is completely useless unless some other connection to some other information that is relevant or intrinsically interesting to the listener can be made, (like a funny joke, a fun fact, etc).
My intuition is that LLMs need to be stateful. That might allow them to pick out relevant information from the input and compress it to their internal representation. Trying to fight the O(N^2) curve for both training and inference isn't gonna lead to much progress. That state could be separable from the core LLM just like the prompt, but the LLM needs to be able to manage it. Kind of like memory module that you'd pass along with the prompt, but unlike the prompt it isn't converted back to tokens and LLM modifies it. Much closer to how humans are allowed to process entire books worth of data 'at once'. First internalize and then query. Training something like this would be really hard though.
this is similar to what they're all suggesting (open AI etc), which is to focus on smaller and cleaner data for specific use cases. I feel like the ceiling is hit with broad models and it's not lke we even need to further that because it just needs enouh information to know what we mean and to then pipe into the smaller models as you've described. They all suggest it but all this research seems to have a fetish for some lord of the rings type of model to rule them all.
@@Jay-kb7if Any experiments with "intra-layer memory neurons" or "dedicated memory neurons"? For the purpose of remembering previous HS or being able to focus to on certain activations more than others? (a little more than just coping old activations in an orthogonal layer(s))
@@Sirmrmeowmeow possibly something lik ethat, it's hard to be specific as it's all theoretical. GPT is a good traffic warden because it has strong inference on language, it really doesn't need to know facts, just what you are trying to do and how to access smaller and specific models with technical information. Like for instance, I like to dabble in vision stuff, so imagine a single context with the entire openCV library/documentattion and the most popular github repos as context, It should be pretty good at sourcing any function you want and piecing them together from many scripts. I suspect GPT is probably already doing something like this. This is what Open AI are all promoting people do, but open AI are not your friend so always be suspicous. They are trying to vendor-lock as many businesses as possible by having their infrastructure configured to their APIs, and this proposed solution they promote is also a way to retain GPT as an overarching traffic warden.
I don’t think larger context lengths is necessarily what people want. People want to process longer sequences, and there are other ways to do that, namely memory. Current memory methods leave a lot to be desired, but they are linear in time, and humans manage to do it.
Currently, memory for LLMs are just a string concantation of previous inputs or a vector search for relevant terms. Both are really useless when conversations grows larger in length, you cutoff information one or another way
I really don't like RNN's, but my intuition says that to solve the context problems we will need to go back to some sort of RNN. It just looks crazy to me that we are feeding those models entire books in one shot and expecting them to be able to answer all questions about it with 100% accuracy.
I could be wrong cos I'm the dumbest guy in the room here, but to me it seems like a conflict between data aggregation and data-weighting. RNNs seem like a purely weighted approach and transformers as an aggregating approach, but also some weighting, in a very particular way if at all. I think it can be easily misleading to think they both weight tokens but transformers to me (and again I'm stupid) seem to continually adjusts positions with the more information its given. Like a moving average. Thought of a different way, forget the actual word and consider its underlying meaning that becomes more defined based on its position with the meaning of other tokens. RNNs are purely probabilistic in an outcome like cramming for tomorrow's math test by rote learning. Done well enough you can cite the exact phrase you repeated ad nauseum. Transformers on the other hand are constantly having to reorientate what each tokens means so it might "fence-sit" comfortably between several "correct" answers, so it will always lack that precision.
It is like working with a person who is not thinking very hard but is very smart, asking them about detail can result in small errors like numbers or just be wrong if they put little thought into it. So you need to ask it to consider the answer it is giving. We do a lot on auto pilot from system one that is similar to chat, so we should be able to give it larger amounts of context if we reduce the detail except on what you want it to do, and force it to give it the consideration needed on what we are looking for.
to add to this, and I hope you read this because I think about this as much as you do; Hinton is big on negative data, and cross entropy is also not just looking at what is high in attention but gets the confidence for low attention. IF they do not assess what has low attention because they simply do not bother to evaluate all tokens in a context, then it's not going to appropraitely strtify tokens within a context.
I solved it. What we need is to do, is make another AI that simplifies all the context info and then makes a TikTok-style video for the actual model to process and use in generating an actually good answer.
Okay, I just finished the video and these are my thoughts. Yeah, the O(n^2) nature of attention in transformers is really what’s holding the tech back at this point. If we could somehow get that into even a linear complexity that would open up so many doors of possibilities for LLMs, context length, etc. I see a lot of people trading space in the form of vector db embeddings as a way to offset the problem without completely addressing it, which works to some extent for long term use cases, but ultimately doesn’t make the problem go away. At the end of the day we’re all essentially needing to chunk things at some level of the LLM pipeline. Ultimately, I do think a breakthrough with the architecture is possible, especially if we go down the route of trying to scale these models horizontally instead of vertically with techniques like MoE from OpenAI. I think once we get to the point where we have tiny LLMs automated together with a Kubernetes-like software for managing tiny neural networks we’ll be in better shape.
It feels like embeddings aren't all that special and just wrap around the initial prompt. The process of generating embeddings and having an encoded embedding table through open AIs API is no different to what they would do anyway with normal text prompts. It's just to sound fancy.
My second comment was about the overall architecture of the whole model. Do we need the full width of the context length all the way up? Or can you simply make higher layers narrower. Somewhat like a pyramid scheme? The one output that matters is either a single CLS token in the front or a probability of next token near the end. maybe you just have small transformers and then chain them with LSTMs or something.
What’s interesting is that if you go for an interview they say either be the first or the last want to interview. By doing that you were going to be remembered the best it’s weird that this curve happens at the beginning and the end of the context, it makes you wonder how close we are to real human thought
I wonder if this has to do with an inherent bias in the dataset. Think of any kind of "completed" written text (book, comment, function), i would venture to say that all of these have the most relevant information at the start and at the end.
@@janbiel900i think that’s exactly what’s happening. for predicting one of the last tokens it needs to read the end of the text and to understand the task it needs to read the beginning. super obvious in my opinion
arent the models trained on human conversations which inherently contain this phenomenon? shouldnt we expect them to reproduce the patterns they are trained on?
I am happy that as you explained your thoughts on the new attention mechanism they were similar to my thoughts. So I feel reassured that my understanding of it is not total nonsense.
I agree with a lot here; I spent a lot of my working time in the past four years on researching extending the context window. In the end, our solution (I am co-founder at a startup called Neuralfinity) is indeed a redesigned attention mechanism. I sadly can't reveal how we did it, but we will release a paper end of the year/ beginning of next, when our next generation LLM is ready.
Nice informative summary! I've been doing lately structured data mining from chemistry papers with llms, and I am not unhappy with the map-reduce hackery with openai 4k chatgpt. In fact I tried to feed the full paper with the 16k models, and the results were far worse. I found a sweet spot of the chunk I fed to the model to get the best extraction around 3k. Some recurrent hybridation and a differentiably trained retrieval index to automate all this hackery of the map reduce and the next neighbour embeddings, looks like the low hanging fruit of improvement for me
I think something that *might* work is if you took a Mixture Of Experts approach, where each expert has a different attention dilation map. Probably not ideal for computer architecture (which wants powers of 2) but at least in principle, it might make sense to choose each expert with a dilation factor that's a prime number, so you get nice phase coverage across a wide area of tokens. Of course that also means you need more memory for each such expert. But if you have like 8k Tokens for each expert, where one sees every single one of the first 8k tokens, one sees the first 4k and then 2k worth of every second token and 1k worth of every fourth and so on, and another agent that dilates in steps of three, and five, and seven - you probably get a fairly dense coverage even at fairly high numbers. Or alternatively, you could just stick to powers of 2 but add a "phase shift" between experts so they see "the other half" or "the other three quarters" etc.
These issues remind me of problems from Operating Systems design. Maybe a few concepts from OS Design might be thought provoking. Swap space is an obvious memory management technique that might be useful for the limited RAM but when a need for larger amounts of memory exists. In the vain of how long does it run, maybe thinking about how OS design uses context switching could be useful. Just throwing out some food for thought. Got to get those creative juices flowing!
Might be a dumb question, but could we use an LLM to recursively summarize the conversation context over the course of the conversation, and use that summary as the context for a given prompt? Basically just as the conversation progresses, a background process would create a summarized, and therefore a sort of lossy compressed version of the conversation. Obviously might not be the most efficient but maybe a cool idea.
suffers the same issue with reduced context length. Essentially providing less information (albiet more acute) to generate responses, but it seems very plausible to me, but I am dumb. Likely GPT is already doing this stuff though.
@Jay-kb7if true, I'm sure they're working on ways to compress the context themselves. As for the problem, very true that it will reduce the information that it pulls from, however, I'm thinking that there could be different modes. A mode that would make a short summary of every individual message so far, with the goal of being able to understand what has been discussed in the conversation for long periods of time. And a mode that will simply generate a couple paragraphs explaining the essence of the conversation so far, preserving key points and important phrases that were uttered. Different compression modes may yield different results. We'll see though, if I make a repo I'll link it here.
We are starting to reach a peak in performance. The differences will start to be 1% - 2% per year moving forwards, till entirely something new comes. Maybe fusion models and transformer mix.. New datasets, more data, better compute units, deeper models, larger models. Thats gonna be the game, till the technology saturates.
what i envision is large foundation models spinning up volatile sub-llms, generating a training regimen and abstracted fitness function as well as a goal and a directive to execute p amount of system power and t amount of time on RLHF (not human, but you know) and to return the results of those fine-tuned models.
They also used post-norm instead of pre-norm for the attention which is the same implementation as the original transformer architecture design, but not what state of the art gpt's use (which is pre-norm). This can affect performance since post norm models will need to be trained for longer than pre-norm models before they reach similar accuracy. Because they didn't reveal the exact time they trained the models for, this may not be quite reflective of real world use.
You can upload scientific documents to code interpreter. The document size limit is 100MB. I downloaded a whole book into it, and it was bale to answer questions for me.
I'm not gonna pretend to understand any of this. But it sounds like we are pushing up against the limits of processing information without it being compressed first.. is that right? I know we aren't computers, and computers aren't us - but we have various levels at which we process information, top level all the way down to the unconscious level. Are we missing the equivalent with our current tools?
Yeah you and his wallet. Just wondering did you pledge a monthly support to him or are you one of those people who feel entitled to everything for free?
I seems to me that it would be surprising if it was not like that: Since the very inception of AI field, (e.g. Rosenblatt's perceptron) the system have been modeled after human nervous system, and have been trained in human generated data, it seems pretty natural that the system at least in a high level view, would display human psychology like phenomena.
'Tis a great video. It's quite a task to put into context everything that was ever written, but I feel like, with the correct architecture and enough processing you will find ... I guess it's just gaussian fields of ever rising dimensions of what works in which situation. But if we have the ability to actually question it well enough, we could evolve our linguistic capabilities as well. I for one would love a book that lists the best allegories for all situations.
Maybe Larger Context is all we need for Even Better (LLMs). I was thinking that maybe integrating the RNN layer withing the Transformer architecture could help in achieving this. For example if the input is split into 8k chunks and each one passes the first layer of the attention then the output is concatenated and passed through the RNN then doing this again and again until the end where everything is passed to the dense layer. So in this case we have the performance of the full attention for each chunk and we have the performance of the RNN in processing the verly long output representation. What do you think?
I think the easiest way to improve the context length situation for now would be a compression of the input tokens. Eg to solve most math problems, you will NOT need the prose or syntax related information of the question. That's just baggage to the AI. So ideally we could design a maximally dense language, which purely contains factual statements, no fluff words, no articles, no superfluous prepositions, etc. We could then convert a user's input into this intermediate language, process and generate output in that intermediate language, and then convert it back to English. Sure, it would sound dry and boring, but we don't need author level prose in say our meta analysis of papers. This way we could easily double our context length, and likely improve accuracy along the way.
Exactly, i also think that gathering some useful information after all those distributed attention mechanisms is kinda hard or impossible. How model will know which attentions were the most important... I think generalize it will be super hard. Possibly if there would be some better pre-processing and possibly even some models before this big model, which would separate semantics of the input and distribute by semantic. Then delegate input by semantic into a specific attention responsible for that semantic, then that would possibly lead to some generalization of the model in the end.
Recurrent Memory Transformers and RWKV got up to 1 million tokens. Magic LTM-1 manages 5 million. They had some pretty interesting optimizations for getting around some of these problems too
Probably has to be the first video where I'm not even slightly annoyed by the ad at the end. Neural nets from scratch for the win, I'll definitely have a dig there thank you!!
I don't know if it is already possible but I think it is time to start using quantum computing for those kinds of things. Another alternative is to maybe use different architectures like RMT(recurrent memory transformers - paper propose 1M Tokens), or gnn (maybe can be better but will also consume a lot of resources), longNet (1 billion tokens). But independent of architecture, i notice most models are not well optimized to use gpu, I saw many models with the same amount of params but with different memory usage. So I believe for starting there are 3 options that can help better: 1 - Improve model to better resource utilization 2 - maybe migrate to another language faster and that uses less resources like c++ or rust or even go. 3 - to not be necessary to migrate to another language, community could go together and help to improve python performance.
Love your content, and got your Machine Learning PDF - awesome stuff good sir. Do you have any local LLM recommendation to help with programming, or a cost-effective way of doing it via something like ChatGPT?
I really think we need to emulate attention at the hardware level. And by this I don't mean an accelerator that operates at the instruction level, but at the architecture level. I don't think there is any other workaround and what I don't understand is why bigger companies haven't invested in the development of this sooner..
I think we should just ditch Attention as the main information processing feature entirely. Attention will always require to have all tokens available in memory, so the memory required will always scale linearly with the context size, even if we bring down the time complexity of Attention to O(n) (and that will always imply missing some pairwise token relations or simply some of the tokens). A smarter replacement would be to use Attention with a smaller window, but let the model "place" the window anywhere it wants in the context, as needed, and the model will only need this subset of tokens in memory. Of course this would require to get back to RNNs in order to let the model update the location of the Attention window in the context, and that would increase computation times quite a bit.
Some kind of RNN-Attention composite would be kind of cool, but it's possible that attention is the final feature. A clever enough retrieval system with a vector database or the like might be able to pull off an adequately sophisticated memory system long term.
@@chasebrower7816 RNN's take way longer to train than the equivalent performing Transformer, mostly because attention can be computed in one step, whereas RNN necessarily needs multiple steps. For RNN's to be viable again I think you need to fix that problem first.
@chasebrower7816 you would still need to make the model learn to use the mechanism for reading and writing from the vector database or the memory system, that would probably be recurrent anyways. @joeboyle7390 I don't think that is really a problem, there are quite a few methods that were proposed to make RNN training much more efficient. I imagined one where the model would only require the data of two successive time steps, allowing a lot of parallelism along the batch dimension.
people needa think why these models work so well and in some ways it's the only true machine learning approach. RNNs are literally just a fancy regression analysis and in hindsight, it's hard to believe how we relied on least squared error to make predictions and expected any kind of sophstication. It's important to think of transformers in context. Language is meaning and rather than word frequency, transformers consider word association. Maybe I'm not explaning that last bit right, but RNNs do not consider the meaning at all and merely where it belongs in a sentence. Your approach is a little more challenging to put into practice and is what transformers alreadyh do. transformers are actually pretty simple in that it looks at the distribution of all tokenz in the context and attends to the highest (or around that depending on temperature) and then again and again. Maybe a dynamic context length? I'm just rambling and talking out of my arse BTW, so forgive me if nothing I'm saying is making sense and completely wrong, lol.
@@Jay-kb7if I don't think there is any difference of the way meaning is learned in transformers compared to RNN, they optimise the exact same loss. Both are performing "fancy regression analysis" as you say, they just process the context and retain information differently. I think the issue with RNN based LLM is that the state vector is simply too small to store enough relevant information without forgetting it, and that they are difficult to train because of vanishing/exploding gradient. Both of these issues can be solved, and it is important to remember that the human brain is a giant RNN (*not* a transformer), so we know it is possible to make RNN work.
What I'm realising the last few months is that there is ultimately only so much you can do with LLMs. They are very useful and will become even more useful, but in isolation they will always have some limitations. In future (or indeed already) we will have networks of LLMs that work together and networks of LLMs that decide which LLM to call. The human brain works with symbols, even at the deepest levels of meaning, emotion, its all symbolic representation of information. Look at savantism/savants. It's almost like they are less finely and/or more finely tuned tuned LLMs. Interesting times...
Obviously bits of information have to be dropped to fit the data into sparser representations. The dropped data might be crucial for the understanding of the whole context. I wonder if the model will be able to direct the attention to the "ground level" when necessary, to obtain and process all relevant details.
Maybe the answer is to tokenize a whole concept . Ie when I listen to you , I'm not storing every word in my head, I'm filtering for facts and context to form a concept of what you are talking about. So, once you have defined the subject, I store that as a concept and recall it when necessary, not the long waffle getting there. If that whole waffle can be condensed to a single token , you have a vast space opening up. E.G I only have to say 'Lap-time' for you to be triggered into racing car mode . Am I right? 8-)
@@MouldySoul well yes , but the point is Lap is the subject ( could be lapping at milk , a occupant of lapland, or your thighs), Time provides context. In your world model that Concept leads to hairy trousers, in Harrison's it's hammering a car around a track.. It is a shortcut to a place in the model space , from where the model can start navigating towards the next generated token. If the LLM had a way to save and recall a marker , it wouldn't have to navigate all the previous prompts to get back to current concept. I suppose the real problems is whether such a marker could be made smaller than the array of tokens that lead to that position.
what is a concept though? A token shouldn't be seen as a word but the smallest meaningful unit of information (so forgetting the actual word, it has its own specific meaning, and in the same context the same word or segment of word as 1 token can be very different).
@@Jay-kb7if see my comments below. I said Token because it fits into the input stream like any other token , but this marker token's job is to preset/load the context like a sign -post . The pre-prompt gets the model to place-A, your prompt moves it on to place-B, the model navigates to place-C etc. The idea is that the marker would allow direct access the place-X without having to pass through A-W .As I said in the other comment, it may require the marker to be as large as the sum of tokens that got it there, but if there was a way to compress or shortcut it then there is potential for considerable savings.
I think longnet should actually do better with this middle out problem (silicone valley) Because it's not just doing the additional computations in parallel, it's also the layering they show a pretty interesting mathematical proof that the layers required for 100% coverage are logarithmic. But I think the more interesting part is that the attention heads themselves can attend to different segments of the graph independently which should actually solve that middle problem.
Am I missing something. The perplexity score goes down with increasing context size when the batch size is 16… if it continues to go down for larger contexts doesn’t that give us very large context windows without performance drop off? 12:39
Is it so complicated the make attention iterative though? Like how humans do, they're aware that something exists not specifically with all the detail and if needed the parse it again with higher level of detail. It's really not that complicated if you make the system dynamic. But then ofc it's rnn's all over again
different to what they do now it would be. I have the same thoughts as you though with dynamic context lengths. Do we really need another iteration of 1 million tokens for highly specific words, it's just going to make 99.99% of them -0.00000000001
I lost my previous comment, so I will split it up. I am working on a code generation evaluation benchmark that will support multiple tasks. And a difficult decision for me is what to allow as model context. And also do I write a variant that works for instruction finetuned models...
Haven't read the research paper regarding remembering information in the middle . But could it be that the stuff in the middle is a lot of "filler" information and therefore not worth remebering ? Is it just an inherent property of text that the stuff in the middle is less important than the beginning and end ? Not sure
Yup, this is a problem. I think a good attempt is to do what we humans do: incrementally drop irrelevant (=not worth attention) tokens. If you split a 2k span window in a series of 8x256 token segments, feeding in each segment 1/2 of tokens coming out of the previous segment, the "virtual" attention span expands to 256 + 512 + 1024 ... =~ 64k tokens.
I had this simple idea a while ago to improve attention, just take a normal transformer, with like a relatively small context, and apply it to your whole large context like you would with a convolution filter in a CNN, and either by changing the stride or with max pooling or something, reduce the size of your input context. Do that over multiple layers, and you can in theory compress your context, divide its size by two or four at every step, until it fits in that 2048 window. I wonder if something like this has been tried
@@joeboyle7390 well you replace the filters (simple multiplications) with a whole ass transformer, and have a big transformer at the end instead of the fully connected layer. It's a convolutional transformer
@@YEASTY_COMMIE Aha, I think I see what you're proposing. That sounds like something that people would have experimented with, but if not sounds like an interesting research project!
@@joeboyle7390 every time I have an ML idea, I realize a few months later that it was invented like 2 years ago and was a banger (I thought about something like GANs when I was 16, then realized they had been invented 2 years earlier, same thing happened with ResNets, and a bunch of other ideas). Either that or something similar comes out the next month. Always makes me feel like I missed an opportunity, but on the other hand I probably couldn't have done something that competes with what those teams of researchers produce anyways, so I try to be content with my ideas being vaguely validated
Could something sorta like a "mipmap" of the context, with varying levels of "convolution" (ideally some sort of semantic compression if that's a possibility), combined with streaming from disk to read individual details at full resolution when needed, perhaps something sorta analog to Unreal Engine 5's Nanite, perhaps be a possibility?
Do you think that liquid neural networks is a marketing move. It seems to be so amazing, but there is almost no github repositories on it. There are some paper here and there. But if its so revolutionizing why not everybody jumping on it?
Very smart very important indeed, here are some leads how to do it -*smart forgetting*: (GPT4 seems to have no controll over forgetting it can even tell what info is more important but loses it even when empty text is added if the important info is on the edge of its old token context window. Forgetting to the least importat tokens should theoretically lead to a density of relevant and important tokens to increase in effect i creasing the relevant token context lenght, freezing the use of untelated tokens aka reducibg the weight depending on task also could help 2: Sorting and compressing to different degrees of data loss. For rerrad/regaining of context based on a multitude of sorted context memorys in different ways for different purposes (ad RL on top and u have a selfimprovibg system a mix of hardcode and lerning can i crease stability as well as the ability to choose what to use based on selfobservation 3: Dynmaic weighing of token importance by attention depending on guesses pf importance and muslitble systems that do that by different metrics and methods and metasystems that choode the % of each method depending on result and past result experience (small decider lerning networks) 4: (simple diy) just have multible modeös that each save some context and the. Reconstruct by talking to each other (hey do you know about x? Did the user talk about that?) Mayve a finetuned „memorize facts“ network a „memorize X“ network. 5) layered categorisation with zooming in on info continually being sprted. etc. Depends on the usecase understanding the model and what bottleneck is likeö not to change soon or payed too little attention to then should helps in deciding where you can add value. Bonus: delfreminders or reminders baded on context might be able to repromt thibgs ouside of context window the LLM could use it as a „unversifyed plugin“ inside of chatGPT for example, weaviate is trying to fevelop such a plugin which is in alpha right now maybe they value contributors since there method in isolation could use help from creative systems in symbiosis that compliment each i other i thi k personally guessing ad to what is under there hood
I'm thinking if pretraining is long term memory, if you could store all the information of a data set in the weights, and had a perfect memory, it would not be necessary to have long context. instead you would just "fine tune" the pretrained model with your 100 page document from your prompt and it would perfectly know the document. in other words, if we would overfit the model perfectly during training, and every prompt would be a perfectly overfitted fine tuning, it would solve the problem of short term memory. the trade-off would then be its reasoning abilities because of overfitting. but if you have vast amounts of data, that could potentially be solved. perhaps this solution would require more than double precision weights. I think it is possible with enough data and compute, without altering transformers, to solve AGI. it probably won't happen this way, but it shows that there are many ways to reach it
Why would longnet go public if it didn't address those points? Does the sagging attention curve have anything to do with the data? More specifically, what is it empirically related to? If it's the model itself and the calculations that's one thing if it's simply a product of the data and the format that's different. One thing I have noticed is that the "good" data all has a common theme/format. It seems very likely to me that the curve was a learned shortcut. I'm even more convinced of this by the simple inclusion of RLHF. There is a very specific way most people choose to communicate, especially in writing, and that curve that you mentioned matches it perfectly. But that is not how educational books or scientific papers are written.
It's kinda crazy that to produce one token, it must pay attention to all of its previous context. If you need to compress information we might as well do finetuning with the context?
This is imho where the current transformer errs. There's no information gained by comparing some important content later in the document, with completely unrelated content in the introduction. We need layered attention that is local to a sentence, paragraph, section/chapter etc..
They need attention on the level of sentences and sections in a text. It's ridiculous that the whole context is prioritized using only token attention. If we have attention on several layers, we no longer need a big context and could even reduce context size to < 1K for speedier inference. Longer context is NOT the answer.
It will possibly be human solution. A group of people read a million tokens of text and the ones with the best comprehension and fastest times could be queried about their technique. I think the Wheel of Time is a good example to try with, with 4.4 million words. The great dictionaries are another with up 60 million words, but humans could never read it all, apparently.
"I dilate down to the 1st and last token, so I can do 10^10000 tokens now; it just takes longer than the heat death of the universe to read them in." Is this really useful?
FoT focused transformer has shown better training with larger context length by using positive and negative examples to help with this issue. Check it out and let me know what you think.
Half way through this video and I feel like I'm watching a healtjy gamer video on how attention and adhd works, not an video about ai I think with massivly improved hardware the only solution is to have something like memory and information source for the ai to work wirh (I guess something like the paper said, but I didn't get it since I'm not a sience guy). Like a human solving a problem the ai needs to work with the data to brake down the task into chunks it can hold in memory. Split that beginnimg and end into many more beginnings and ends like a human working on a todo list involving many research-, understanding- and execution- steps. For this to work the process would need to move away from running of memory alone to memory+source aswell as creating a specialised checkpoint of that model just for that task
Like stated in the video, there are models that go beyond 16K and 32K. We also see an example from Microsoft that shows you could have 1B tokens. The point is, scaled out, attention just doesn't work well, both in terms of processing time but also in the actual output quality from that attention.
Splitting attention into segments doesn't make much sense to me. What if in the second segment you needed the context from the first segment to comprehend it?
Wait, so there's O(N^2) complexity when those models process text prompts? Why is so much hype about chat GPT4 but nobody talks about this fact? It's a huge constraint, seriously limiting the capatibilities and possible use cases.
all the research is crappy trying to do 1 million context length, it just removes so much variability and sparsly evaluates tokens within a context or not at all.
Instead of taking every Nth word, maybe some way of only focusing on meaningful words could help The above would become: “ instead every Nth focus meaningful “ Although that is still 5 tokens
Interestingly (or not), every one of the authors of the original Attention Is All You Need paper have since left Google.
Where are they now? Stanford or OpenAI?
@@franklydoodle350 6/8 authors started their own startups. One is at OpenAI. Only one (Llion Jones) is at Google. He is leaving Google later this month to start his own startup.
@@MouliSankarS name of the startups?
@@Kazekoge101 cohere,
AdeptAILabs,
character_ai, near inc, inceptive
Why are they leaving?
You have my attention.
Good one.
What more does he need.
The parallels to human cognition are really interesting. The "lost in the middle problem" is very much a feature of human memory - you always remember the beginning and end of a sequence the best (e.g the beginning of the book you're reading and the part you've read last, but things in the middle are fuzzier).
Yeah I was thinking the same.
I guess it's some kind of working memory conservation technique
Some reserch attributes the decay is due to interference as there's little time to rehearse before new information is introduced. Would probably look more like a diminishing gradient without interference.
Most of the information in a sequence is "in the middle", i.e. not markedly near the ends.
Recency and primacy effect.
that's because most of the stuff in the middle is either entirely irrelevant filler or filler that just expounds on the actually relevant information or the actual idea being presented. if you are expounding, you aren't doing so so that the person you are talking to can remember that stuff in particular, you are doing it so that they can more easily form their own take on an idea...their own model. once they have done that, like the training data for modern ais, it is completely useless unless some other connection to some other information that is relevant or intrinsically interesting to the listener can be made, (like a funny joke, a fun fact, etc).
My intuition is that LLMs need to be stateful. That might allow them to pick out relevant information from the input and compress it to their internal representation. Trying to fight the O(N^2) curve for both training and inference isn't gonna lead to much progress. That state could be separable from the core LLM just like the prompt, but the LLM needs to be able to manage it. Kind of like memory module that you'd pass along with the prompt, but unlike the prompt it isn't converted back to tokens and LLM modifies it. Much closer to how humans are allowed to process entire books worth of data 'at once'. First internalize and then query. Training something like this would be really hard though.
this is similar to what they're all suggesting (open AI etc), which is to focus on smaller and cleaner data for specific use cases. I feel like the ceiling is hit with broad models and it's not lke we even need to further that because it just needs enouh information to know what we mean and to then pipe into the smaller models as you've described. They all suggest it but all this research seems to have a fetish for some lord of the rings type of model to rule them all.
@@Jay-kb7if Any experiments with "intra-layer memory neurons" or "dedicated memory neurons"? For the purpose of remembering previous HS or being able to focus to on certain activations more than others? (a little more than just coping old activations in an orthogonal layer(s))
"Trying to fight the O(N^2) curve for both training and inference" - no
@@Sirmrmeowmeow possibly something lik ethat, it's hard to be specific as it's all theoretical. GPT is a good traffic warden because it has strong inference on language, it really doesn't need to know facts, just what you are trying to do and how to access smaller and specific models with technical information. Like for instance, I like to dabble in vision stuff, so imagine a single context with the entire openCV library/documentattion and the most popular github repos as context, It should be pretty good at sourcing any function you want and piecing them together from many scripts. I suspect GPT is probably already doing something like this. This is what Open AI are all promoting people do, but open AI are not your friend so always be suspicous. They are trying to vendor-lock as many businesses as possible by having their infrastructure configured to their APIs, and this proposed solution they promote is also a way to retain GPT as an overarching traffic warden.
I don’t think larger context lengths is necessarily what people want. People want to process longer sequences, and there are other ways to do that, namely memory. Current memory methods leave a lot to be desired, but they are linear in time, and humans manage to do it.
Currently, memory for LLMs are just a string concantation of previous inputs or a vector search for relevant terms.
Both are really useless when conversations grows larger in length, you cutoff information one or another way
ChatGPT blows my mind and infuriates at the same time when it spits out completely whack responses
I really don't like RNN's, but my intuition says that to solve the context problems we will need to go back to some sort of RNN. It just looks crazy to me that we are feeding those models entire books in one shot and expecting them to be able to answer all questions about it with 100% accuracy.
RWKV seems interesting tbf, based around a RNN, and riffing off Apple's Attention Free Transformer paper.
I could be wrong cos I'm the dumbest guy in the room here, but to me it seems like a conflict between data aggregation and data-weighting. RNNs seem like a purely weighted approach and transformers as an aggregating approach, but also some weighting, in a very particular way if at all. I think it can be easily misleading to think they both weight tokens but transformers to me (and again I'm stupid) seem to continually adjusts positions with the more information its given. Like a moving average. Thought of a different way, forget the actual word and consider its underlying meaning that becomes more defined based on its position with the meaning of other tokens. RNNs are purely probabilistic in an outcome like cramming for tomorrow's math test by rote learning. Done well enough you can cite the exact phrase you repeated ad nauseum. Transformers on the other hand are constantly having to reorientate what each tokens means so it might "fence-sit" comfortably between several "correct" answers, so it will always lack that precision.
It is like working with a person who is not thinking very hard but is very smart, asking them about detail can result in small errors like numbers or just be wrong if they put little thought into it. So you need to ask it to consider the answer it is giving. We do a lot on auto pilot from system one that is similar to chat, so we should be able to give it larger amounts of context if we reduce the detail except on what you want it to do, and force it to give it the consideration needed on what we are looking for.
to add to this, and I hope you read this because I think about this as much as you do; Hinton is big on negative data, and cross entropy is also not just looking at what is high in attention but gets the confidence for low attention. IF they do not assess what has low attention because they simply do not bother to evaluate all tokens in a context, then it's not going to appropraitely strtify tokens within a context.
are we going be getting anymore videos on the neural networks from scratch series?
I solved it.
What we need is to do, is make another AI that simplifies all the context info and then makes a TikTok-style video for the actual model to process and use in generating an actually good answer.
Okay, I just finished the video and these are my thoughts.
Yeah, the O(n^2) nature of attention in transformers is really what’s holding the tech back at this point. If we could somehow get that into even a linear complexity that would open up so many doors of possibilities for LLMs, context length, etc.
I see a lot of people trading space in the form of vector db embeddings as a way to offset the problem without completely addressing it, which works to some extent for long term use cases, but ultimately doesn’t make the problem go away. At the end of the day we’re all essentially needing to chunk things at some level of the LLM pipeline.
Ultimately, I do think a breakthrough with the architecture is possible, especially if we go down the route of trying to scale these models horizontally instead of vertically with techniques like MoE from OpenAI.
I think once we get to the point where we have tiny LLMs automated together with a Kubernetes-like software for managing tiny neural networks we’ll be in better shape.
It feels like embeddings aren't all that special and just wrap around the initial prompt. The process of generating embeddings and having an encoded embedding table through open AIs API is no different to what they would do anyway with normal text prompts. It's just to sound fancy.
I completely agree with you
Context almost needs to be stored in a tree architecture instead of a single 1-D line
My second comment was about the overall architecture of the whole model. Do we need the full width of the context length all the way up? Or can you simply make higher layers narrower. Somewhat like a pyramid scheme? The one output that matters is either a single CLS token in the front or a probability of next token near the end. maybe you just have small transformers and then chain them with LSTMs or something.
What’s interesting is that if you go for an interview they say either be the first or the last want to interview. By doing that you were going to be remembered the best it’s weird that this curve happens at the beginning and the end of the context, it makes you wonder how close we are to real human thought
I wonder if this has to do with an inherent bias in the dataset. Think of any kind of "completed" written text (book, comment, function), i would venture to say that all of these have the most relevant information at the start and at the end.
ima spit out a billion tokens next interview till they hallucinate
@@janbiel900i think that’s exactly what’s happening. for predicting one of the last tokens it needs to read the end of the text and to understand the task it needs to read the beginning. super obvious in my opinion
Primacy and recency effect
arent the models trained on human conversations which inherently contain this phenomenon? shouldnt we expect them to reproduce the patterns they are trained on?
I am happy that as you explained your thoughts on the new attention mechanism they were similar to my thoughts. So I feel reassured that my understanding of it is not total nonsense.
Good to see you back with amazing content
I agree with a lot here; I spent a lot of my working time in the past four years on researching extending the context window. In the end, our solution (I am co-founder at a startup called Neuralfinity) is indeed a redesigned attention mechanism. I sadly can't reveal how we did it, but we will release a paper end of the year/ beginning of next, when our next generation LLM is ready.
Looking forward to it!
arxiv?
Nice informative summary! I've been doing lately structured data mining from chemistry papers with llms, and I am not unhappy with the map-reduce hackery with openai 4k chatgpt. In fact I tried to feed the full paper with the 16k models, and the results were far worse. I found a sweet spot of the chunk I fed to the model to get the best extraction around 3k. Some recurrent hybridation and a differentiably trained retrieval index to automate all this hackery of the map reduce and the next neighbour embeddings, looks like the low hanging fruit of improvement for me
Had no idea about the U-shaped attention problem, but I've definitely come across it. That valley is where GPT's hallucinations live, and thrive.
I think something that *might* work is if you took a Mixture Of Experts approach, where each expert has a different attention dilation map.
Probably not ideal for computer architecture (which wants powers of 2) but at least in principle, it might make sense to choose each expert with a dilation factor that's a prime number, so you get nice phase coverage across a wide area of tokens.
Of course that also means you need more memory for each such expert.
But if you have like 8k Tokens for each expert, where one sees every single one of the first 8k tokens, one sees the first 4k and then 2k worth of every second token and 1k worth of every fourth and so on, and another agent that dilates in steps of three, and five, and seven - you probably get a fairly dense coverage even at fairly high numbers.
Or alternatively, you could just stick to powers of 2 but add a "phase shift" between experts so they see "the other half" or "the other three quarters" etc.
MoE has been proven to be dogshit
These issues remind me of problems from Operating Systems design. Maybe a few concepts from OS Design might be thought provoking. Swap space is an obvious memory management technique that might be useful for the limited RAM but when a need for larger amounts of memory exists. In the vain of how long does it run, maybe thinking about how OS design uses context switching could be useful. Just throwing out some food for thought. Got to get those creative juices flowing!
Might be a dumb question, but could we use an LLM to recursively summarize the conversation context over the course of the conversation, and use that summary as the context for a given prompt? Basically just as the conversation progresses, a background process would create a summarized, and therefore a sort of lossy compressed version of the conversation. Obviously might not be the most efficient but maybe a cool idea.
suffers the same issue with reduced context length. Essentially providing less information (albiet more acute) to generate responses, but it seems very plausible to me, but I am dumb. Likely GPT is already doing this stuff though.
yes its possible, but answers will be dumb sometimes. Langchain has a summarization chain which can be used for your task
@Jay-kb7if true, I'm sure they're working on ways to compress the context themselves. As for the problem, very true that it will reduce the information that it pulls from, however, I'm thinking that there could be different modes. A mode that would make a short summary of every individual message so far, with the goal of being able to understand what has been discussed in the conversation for long periods of time. And a mode that will simply generate a couple paragraphs explaining the essence of the conversation so far, preserving key points and important phrases that were uttered. Different compression modes may yield different results. We'll see though, if I make a repo I'll link it here.
@kasvith true, but to be fair they're already dumb sometimes 🤣
@@aidantilgner even with a small context they are dumb
We are starting to reach a peak in performance. The differences will start to be 1% - 2% per year moving forwards, till entirely something new comes. Maybe fusion models and transformer mix.. New datasets, more data, better compute units, deeper models, larger models. Thats gonna be the game, till the technology saturates.
what i envision is large foundation models spinning up volatile sub-llms, generating a training regimen and abstracted fitness function as well as a goal and a directive to execute p amount of system power and t amount of time on RLHF (not human, but you know) and to return the results of those fine-tuned models.
They also used post-norm instead of pre-norm for the attention which is the same implementation as the original transformer architecture design, but not what state of the art gpt's use (which is pre-norm). This can affect performance since post norm models will need to be trained for longer than pre-norm models before they reach similar accuracy. Because they didn't reveal the exact time they trained the models for, this may not be quite reflective of real world use.
f*** sake, really, I've been doing post-norm, I didn't realise it was slower to train ffs
Thanks for video and explaining why it's difficult to scale context size
You can upload scientific documents to code interpreter. The document size limit is 100MB. I downloaded a whole book into it, and it was bale to answer questions for me.
I'm not gonna pretend to understand any of this. But it sounds like we are pushing up against the limits of processing information without it being compressed first.. is that right?
I know we aren't computers, and computers aren't us - but we have various levels at which we process information, top level all the way down to the unconscious level.
Are we missing the equivalent with our current tools?
I will never forget the abandoned neural network from scratch project. Some of the best content on this channel but never finished.
The last video (P.9) is chapter 6. Which was a year ago. I have the book. Guess we'll just have to do it the hard way, read it all !
Yeah you and his wallet. Just wondering did you pledge a monthly support to him or are you one of those people who feel entitled to everything for free?
@@avi7278 lol easy dude no one tryna fight here. He's charging 100 dollars for the book and claimed the videos would be part of the package.
@@avi7278 plus I don't understand where u see the entitlement. I expressed a opinion. I didn't demand anything.
@@CEOofTheHood Dude, you full well know the e-book is $29.00. That's more than a reasonable price for the content.
Love your videos! What do you use to keep track of papers? You mentioned you have a tool that summarises research papers for you
Have you seen the new paper about long convolutions and toeplitz matrices? I didn’t quite get the toeplitz matrix thing but it sounded interesting
I seems to me that it would be surprising if it was not like that:
Since the very inception of AI field, (e.g. Rosenblatt's perceptron) the system have been modeled after human nervous system, and have been trained in human generated data, it seems pretty natural that the system at least in a high level view, would display human psychology like phenomena.
'Tis a great video. It's quite a task to put into context everything that was ever written, but I feel like, with the correct architecture and enough processing you will find ... I guess it's just gaussian fields of ever rising dimensions of what works in which situation. But if we have the ability to actually question it well enough, we could evolve our linguistic capabilities as well. I for one would love a book that lists the best allegories for all situations.
Maybe Larger Context is all we need for Even Better (LLMs). I was thinking that maybe integrating the RNN layer withing the Transformer architecture could help in achieving this. For example if the input is split into 8k chunks and each one passes the first layer of the attention then the output is concatenated and passed through the RNN then doing this again and again until the end where everything is passed to the dense layer. So in this case we have the performance of the full attention for each chunk and we have the performance of the RNN in processing the verly long output representation. What do you think?
I think the easiest way to improve the context length situation for now would be a compression of the input tokens. Eg to solve most math problems, you will NOT need the prose or syntax related information of the question. That's just baggage to the AI. So ideally we could design a maximally dense language, which purely contains factual statements, no fluff words, no articles, no superfluous prepositions, etc. We could then convert a user's input into this intermediate language, process and generate output in that intermediate language, and then convert it back to English.
Sure, it would sound dry and boring, but we don't need author level prose in say our meta analysis of papers.
This way we could easily double our context length, and likely improve accuracy along the way.
Exactly, i also think that gathering some useful information after all those distributed attention mechanisms is kinda hard or impossible. How model will know which attentions were the most important... I think generalize it will be super hard.
Possibly if there would be some better pre-processing and possibly even some models before this big model, which would separate semantics of the input and distribute by semantic. Then delegate input by semantic into a specific attention responsible for that semantic, then that would possibly lead to some generalization of the model in the end.
Part 10 of Neural Net from Scratch, about analytical derivatives??? Please bring the series back!
Recurrent Memory Transformers and RWKV got up to 1 million tokens. Magic LTM-1 manages 5 million.
They had some pretty interesting optimizations for getting around some of these problems too
Probably has to be the first video where I'm not even slightly annoyed by the ad at the end. Neural nets from scratch for the win, I'll definitely have a dig there thank you!!
I don't know if it is already possible but I think it is time to start using quantum computing for those kinds of things.
Another alternative is to maybe use different architectures like RMT(recurrent memory transformers - paper propose 1M Tokens), or gnn (maybe can be better but will also consume a lot of resources), longNet (1 billion tokens). But independent of architecture, i notice most models are not well optimized to use gpu, I saw many models with the same amount of params but with different memory usage. So I believe for starting there are 3 options that can help better:
1 - Improve model to better resource utilization
2 - maybe migrate to another language faster and that uses less resources like c++ or rust or even go.
3 - to not be necessary to migrate to another language, community could go together and help to improve python performance.
Love your content, and got your Machine Learning PDF - awesome stuff good sir. Do you have any local LLM recommendation to help with programming, or a cost-effective way of doing it via something like ChatGPT?
Are you going to continue the neural network from scratch series ? :(
I really think we need to emulate attention at the hardware level. And by this I don't mean an accelerator that operates at the instruction level, but at the architecture level. I don't think there is any other workaround and what I don't understand is why bigger companies haven't invested in the development of this sooner..
I think we should just ditch Attention as the main information processing feature entirely. Attention will always require to have all tokens available in memory, so the memory required will always scale linearly with the context size, even if we bring down the time complexity of Attention to O(n) (and that will always imply missing some pairwise token relations or simply some of the tokens). A smarter replacement would be to use Attention with a smaller window, but let the model "place" the window anywhere it wants in the context, as needed, and the model will only need this subset of tokens in memory. Of course this would require to get back to RNNs in order to let the model update the location of the Attention window in the context, and that would increase computation times quite a bit.
Some kind of RNN-Attention composite would be kind of cool, but it's possible that attention is the final feature. A clever enough retrieval system with a vector database or the like might be able to pull off an adequately sophisticated memory system long term.
@@chasebrower7816 RNN's take way longer to train than the equivalent performing Transformer, mostly because attention can be computed in one step, whereas RNN necessarily needs multiple steps. For RNN's to be viable again I think you need to fix that problem first.
@chasebrower7816 you would still need to make the model learn to use the mechanism for reading and writing from the vector database or the memory system, that would probably be recurrent anyways.
@joeboyle7390 I don't think that is really a problem, there are quite a few methods that were proposed to make RNN training much more efficient. I imagined one where the model would only require the data of two successive time steps, allowing a lot of parallelism along the batch dimension.
people needa think why these models work so well and in some ways it's the only true machine learning approach. RNNs are literally just a fancy regression analysis and in hindsight, it's hard to believe how we relied on least squared error to make predictions and expected any kind of sophstication. It's important to think of transformers in context. Language is meaning and rather than word frequency, transformers consider word association. Maybe I'm not explaning that last bit right, but RNNs do not consider the meaning at all and merely where it belongs in a sentence. Your approach is a little more challenging to put into practice and is what transformers alreadyh do. transformers are actually pretty simple in that it looks at the distribution of all tokenz in the context and attends to the highest (or around that depending on temperature) and then again and again. Maybe a dynamic context length? I'm just rambling and talking out of my arse BTW, so forgive me if nothing I'm saying is making sense and completely wrong, lol.
@@Jay-kb7if I don't think there is any difference of the way meaning is learned in transformers compared to RNN, they optimise the exact same loss. Both are performing "fancy regression analysis" as you say, they just process the context and retain information differently. I think the issue with RNN based LLM is that the state vector is simply too small to store enough relevant information without forgetting it, and that they are difficult to train because of vanishing/exploding gradient. Both of these issues can be solved, and it is important to remember that the human brain is a giant RNN (*not* a transformer), so we know it is possible to make RNN work.
Looks like an issue that was with the image data in before the convolutions arrived.
Do you have any sources/links to further research the topic of attention's U shaped graph?
What I'm realising the last few months is that there is ultimately only so much you can do with LLMs. They are very useful and will become even more useful, but in isolation they will always have some limitations. In future (or indeed already) we will have networks of LLMs that work together and networks of LLMs that decide which LLM to call. The human brain works with symbols, even at the deepest levels of meaning, emotion, its all symbolic representation of information. Look at savantism/savants. It's almost like they are less finely and/or more finely tuned tuned LLMs. Interesting times...
Obviously bits of information have to be dropped to fit the data into sparser representations. The dropped data might be crucial for the understanding of the whole context. I wonder if the model will be able to direct the attention to the "ground level" when necessary, to obtain and process all relevant details.
Can't models use distributed GPU for inference? I thought that this is already implemented in some frameworks...
Better attention is all I need.. ain't that the truth!
Stop watching shorts 😡
Maybe the answer is to tokenize a whole concept . Ie when I listen to you , I'm not storing every word in my head, I'm filtering for facts and context to form a concept of what you are talking about. So, once you have defined the subject, I store that as a concept and recall it when necessary, not the long waffle getting there. If that whole waffle can be condensed to a single token , you have a vast space opening up.
E.G I only have to say 'Lap-time' for you to be triggered into racing car mode . Am I right? 8-)
Lap time sounds like something you'd say to your dog. "Time for some lap time buddy"
@@MouldySoul well yes , but the point is Lap is the subject ( could be lapping at milk , a occupant of lapland, or your thighs), Time provides context. In your world model that Concept leads to hairy trousers, in Harrison's it's hammering a car around a track.. It is a shortcut to a place in the model space , from where the model can start navigating towards the next generated token. If the LLM had a way to save and recall a marker , it wouldn't have to navigate all the previous prompts to get back to current concept.
I suppose the real problems is whether such a marker could be made smaller than the array of tokens that lead to that position.
what is a concept though? A token shouldn't be seen as a word but the smallest meaningful unit of information (so forgetting the actual word, it has its own specific meaning, and in the same context the same word or segment of word as 1 token can be very different).
@@Jay-kb7if see my comments below. I said Token because it fits into the input stream like any other token , but this marker token's job is to preset/load the context like a sign -post . The pre-prompt gets the model to place-A, your prompt moves it on to place-B, the model navigates to place-C etc. The idea is that the marker would allow direct access the place-X without having to pass through A-W .As I said in the other comment, it may require the marker to be as large as the sum of tokens that got it there, but if there was a way to compress or shortcut it then there is potential for considerable savings.
it's quadratic (n^2), not exponential (a^n)
I think longnet should actually do better with this middle out problem (silicone valley) Because it's not just doing the additional computations in parallel, it's also the layering they show a pretty interesting mathematical proof that the layers required for 100% coverage are logarithmic. But I think the more interesting part is that the attention heads themselves can attend to different segments of the graph independently which should actually solve that middle problem.
I also agree with @talis1063 comments internal state is likely important to make concepts spatially invariant
Am I missing something. The perplexity score goes down with increasing context size when the batch size is 16… if it continues to go down for larger contexts doesn’t that give us very large context windows without performance drop off? 12:39
Is it so complicated the make attention iterative though?
Like how humans do, they're aware that something exists not specifically with all the detail and if needed the parse it again with higher level of detail.
It's really not that complicated if you make the system dynamic.
But then ofc it's rnn's all over again
different to what they do now it would be. I have the same thoughts as you though with dynamic context lengths. Do we really need another iteration of 1 million tokens for highly specific words, it's just going to make 99.99% of them -0.00000000001
I lost my previous comment, so I will split it up.
I am working on a code generation evaluation benchmark that will support multiple tasks. And a difficult decision for me is what to allow as model context. And also do I write a variant that works for instruction finetuned models...
Haven't read the research paper regarding remembering information in the middle . But could it be that the stuff in the middle is a lot of "filler" information and therefore not worth remebering ?
Is it just an inherent property of text that the stuff in the middle is less important than the beginning and end ? Not sure
Yup, this is a problem. I think a good attempt is to do what we humans do: incrementally drop irrelevant (=not worth attention) tokens. If you split a 2k span window in a series of 8x256 token segments, feeding in each segment 1/2 of tokens coming out of the previous segment, the "virtual" attention span expands to 256 + 512 + 1024 ... =~ 64k tokens.
I had this simple idea a while ago to improve attention, just take a normal transformer, with like a relatively small context, and apply it to your whole large context like you would with a convolution filter in a CNN, and either by changing the stride or with max pooling or something, reduce the size of your input context. Do that over multiple layers, and you can in theory compress your context, divide its size by two or four at every step, until it fits in that 2048 window. I wonder if something like this has been tried
That just sounds like a convolutional network to me, how is it different?
@@joeboyle7390 well you replace the filters (simple multiplications) with a whole ass transformer, and have a big transformer at the end instead of the fully connected layer. It's a convolutional transformer
@@YEASTY_COMMIE Aha, I think I see what you're proposing. That sounds like something that people would have experimented with, but if not sounds like an interesting research project!
@@joeboyle7390 every time I have an ML idea, I realize a few months later that it was invented like 2 years ago and was a banger (I thought about something like GANs when I was 16, then realized they had been invented 2 years earlier, same thing happened with ResNets, and a bunch of other ideas). Either that or something similar comes out the next month. Always makes me feel like I missed an opportunity, but on the other hand I probably couldn't have done something that competes with what those teams of researchers produce anyways, so I try to be content with my ideas being vaguely validated
@@YEASTY_COMMIEthis is what I want to be. Wanna swap brains?
Could something sorta like a "mipmap" of the context, with varying levels of "convolution" (ideally some sort of semantic compression if that's a possibility), combined with streaming from disk to read individual details at full resolution when needed, perhaps something sorta analog to Unreal Engine 5's Nanite, perhaps be a possibility?
Do you think Tesla's dojo will enable building much larger models? Maybe not initially, because it will be used just for Tesla needs, but in general.
The perceiver model is not a potentially viable solution?
This is a good video, thank you very much.
Attention scales quadratically, not exponentially. Other than that, great video!
@04:30 Bidens' auto prompt?
I was thinking of extending nnfsip to wrap each attention and plug them into the context(s)?
...
Do you think that liquid neural networks is a marketing move. It seems to be so amazing, but there is almost no github repositories on it. There are some paper here and there. But if its so revolutionizing why not everybody jumping on it?
i think multy query works fine if you trying larger ctx but yes the current attention needs to change
Very smart very important indeed, here are some leads how to do it
-*smart forgetting*: (GPT4 seems to have no controll over forgetting it can even tell what info is more important but loses it even when empty text is added if the important info is on the edge of its old token context window. Forgetting to the least importat tokens should theoretically lead to a density of relevant and important tokens to increase in effect i creasing the relevant token context lenght, freezing the use of untelated tokens aka reducibg the weight depending on task also could help
2: Sorting and compressing to different degrees of data loss. For rerrad/regaining of context based on a multitude of sorted context memorys in different ways for different purposes (ad RL on top and u have a selfimprovibg system a mix of hardcode and lerning can i crease stability as well as the ability to choose what to use based on selfobservation
3: Dynmaic weighing of token importance by attention depending on guesses pf importance and muslitble systems that do that by different metrics and methods and metasystems that choode the % of each method depending on result and past result experience (small decider lerning networks)
4: (simple diy) just have multible modeös that each save some context and the. Reconstruct by talking to each other (hey do you know about x? Did the user talk about that?)
Mayve a finetuned „memorize facts“ network a „memorize X“ network.
5) layered categorisation with zooming in on info continually being sprted.
etc.
Depends on the usecase understanding the model and what bottleneck is likeö not to change soon or payed too little attention to then should helps in deciding where you can add value.
Bonus: delfreminders or reminders baded on context might be able to repromt thibgs ouside of context window the LLM could use it as a „unversifyed plugin“ inside of chatGPT for example, weaviate is trying to fevelop such a plugin which is in alpha right now maybe they value contributors since there method in isolation could use help from creative systems in symbiosis that compliment each i
other i thi k personally guessing ad to what is under there hood
You had me at “Better Attention” my dude.
I'm thinking if pretraining is long term memory, if you could store all the information of a data set in the weights, and had a perfect memory, it would not be necessary to have long context. instead you would just "fine tune" the pretrained model with your 100 page document from your prompt and it would perfectly know the document.
in other words, if we would overfit the model perfectly during training, and every prompt would be a perfectly overfitted fine tuning, it would solve the problem of short term memory. the trade-off would then be its reasoning abilities because of overfitting. but if you have vast amounts of data, that could potentially be solved. perhaps this solution would require more than double precision weights. I think it is possible with enough data and compute, without altering transformers, to solve AGI. it probably won't happen this way, but it shows that there are many ways to reach it
Why would longnet go public if it didn't address those points? Does the sagging attention curve have anything to do with the data? More specifically, what is it empirically related to? If it's the model itself and the calculations that's one thing if it's simply a product of the data and the format that's different. One thing I have noticed is that the "good" data all has a common theme/format. It seems very likely to me that the curve was a learned shortcut. I'm even more convinced of this by the simple inclusion of RLHF. There is a very specific way most people choose to communicate, especially in writing, and that curve that you mentioned matches it perfectly. But that is not how educational books or scientific papers are written.
Im unable to find the code interpreter in my gpt 4 im from India why is this issue
To all the papers mentioned, can we have the links
Damn... if only I paid attention to what the video was about prply something awesome with python.
It's kinda crazy that to produce one token, it must pay attention to all of its previous context. If you need to compress information we might as well do finetuning with the context?
This is imho where the current transformer errs. There's no information gained by comparing some important content later in the document, with completely unrelated content in the introduction. We need layered attention that is local to a sentence, paragraph, section/chapter etc..
Yeah, i think that those summarization techniques are not a real use case for something like code or something that is sensitive to data loss.
How does Claude work with the 100k context window?
it's a technique called ALiBi I think (attention with linear bias)
They need attention on the level of sentences and sections in a text. It's ridiculous that the whole context is prioritized using only token attention. If we have attention on several layers, we no longer need a big context and could even reduce context size to < 1K for speedier inference. Longer context is NOT the answer.
Well, you can always introduce another model to summarize the entire context window into 8k-ish tokens for the primary model
Each segment has its own middle in dilated attention. Just a way of knowing which attention to reference as far as I’m aware.
He said the thing!
awesome vid! Loving NNFS as well :D
It will possibly be human solution. A group of people read a million tokens of text and the ones with the best comprehension and fastest times could be queried about their technique. I think the Wheel of Time is a good example to try with, with 4.4 million words. The great dictionaries are another with up 60 million words, but humans could never read it all, apparently.
Smaller models hyper-tuned to specific tasks might actually solve this problem.
"I dilate down to the 1st and last token, so I can do 10^10000 tokens now; it just takes longer than the heat death of the universe to read them in." Is this really useful?
OpenAI has a gpt-4 32K model.
Yep. Still has the lost in the middle problem. A model existing doesn't mean it doesn't have drawbacks
There is a gpt 4 32k model, Claude has 100k, larger context is coming!
FoT focused transformer has shown better training with larger context length by using positive and negative examples to help with this issue. Check it out and let me know what you think.
Very interesting development
I kind of got big bird flashbacks reading this paper.
Where can I get some better attention?
Half way through this video and I feel like I'm watching a healtjy gamer video on how attention and adhd works, not an video about ai
I think with massivly improved hardware the only solution is to have something like memory and information source for the ai to work wirh (I guess something like the paper said, but I didn't get it since I'm not a sience guy). Like a human solving a problem the ai needs to work with the data to brake down the task into chunks it can hold in memory.
Split that beginnimg and end into many more beginnings and ends like a human working on a todo list involving many research-, understanding- and execution- steps. For this to work the process would need to move away from running of memory alone to memory+source aswell as creating a specialised checkpoint of that model just for that task
Claude is 100k tokens already
Like stated in the video, there are models that go beyond 16K and 32K. We also see an example from Microsoft that shows you could have 1B tokens. The point is, scaled out, attention just doesn't work well, both in terms of processing time but also in the actual output quality from that attention.
Easy fix: middle-out compression
If LongNet were being honest, they'd use a log y scale.
Splitting attention into segments doesn't make much sense to me. What if in the second segment you needed the context from the first segment to comprehend it?
Wait, so there's O(N^2) complexity when those models process text prompts? Why is so much hype about chat GPT4 but nobody talks about this fact? It's a huge constraint, seriously limiting the capatibilities and possible use cases.
all the research is crappy trying to do 1 million context length, it just removes so much variability and sparsly evaluates tokens within a context or not at all.
Instead of taking every Nth word, maybe some way of only focusing on meaningful words could help
The above would become:
“ instead every Nth focus meaningful “
Although that is still 5 tokens