Better Attention is All You Need

sentdex

Просмотров 63 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 23 ноя 2024

Комментарии • 220

@hummuswithpitta Год назад ⁺¹⁵⁸
Interestingly (or not), every one of the authors of the original Attention Is All You Need paper have since left Google.
@franklydoodle350 Год назад ⁺⁶
Where are they now? Stanford or OpenAI?
@MouliSankarS Год назад ⁺³¹
@@franklydoodle350 6/8 authors started their own startups. One is at OpenAI. Only one (Llion Jones) is at Google. He is leaving Google later this month to start his own startup.
@Kazekoge101 Год назад ⁺¹
@@MouliSankarS name of the startups?
@MouliSankarS Год назад ⁺²²
@@Kazekoge101 cohere,
AdeptAILabs,
character_ai, near inc, inceptive
@AndrewTateTopG1 Год назад ⁺²
Why are they leaving?
@dan.brandao Год назад ⁺⁶¹
You have my attention.
@hlomphomota8055 Год назад ⁺¹
Good one.
@pw7225 Год назад ⁺¹
What more does he need.
@vene Год назад ⁺⁷⁷
The parallels to human cognition are really interesting. The "lost in the middle problem" is very much a feature of human memory - you always remember the beginning and end of a sequence the best (e.g the beginning of the book you're reading and the part you've read last, but things in the middle are fuzzier).
@bonumonu5534 Год назад ⁺⁵
Yeah I was thinking the same.
I guess it's some kind of working memory conservation technique
@Jay-kb7if Год назад ⁺¹
Some reserch attributes the decay is due to interference as there's little time to rehearse before new information is introduced. Would probably look more like a diminishing gradient without interference.
@Golnarth Год назад ⁺²
Most of the information in a sequence is "in the middle", i.e. not markedly near the ends.
@MrChaluliss Год назад ⁺¹
Recency and primacy effect.
@akam9919 Год назад
that's because most of the stuff in the middle is either entirely irrelevant filler or filler that just expounds on the actually relevant information or the actual idea being presented. if you are expounding, you aren't doing so so that the person you are talking to can remember that stuff in particular, you are doing it so that they can more easily form their own take on an idea...their own model. once they have done that, like the training data for modern ais, it is completely useless unless some other connection to some other information that is relevant or intrinsically interesting to the listener can be made, (like a funny joke, a fun fact, etc).
@talis1063 Год назад ⁺³¹
My intuition is that LLMs need to be stateful. That might allow them to pick out relevant information from the input and compress it to their internal representation. Trying to fight the O(N^2) curve for both training and inference isn't gonna lead to much progress. That state could be separable from the core LLM just like the prompt, but the LLM needs to be able to manage it. Kind of like memory module that you'd pass along with the prompt, but unlike the prompt it isn't converted back to tokens and LLM modifies it. Much closer to how humans are allowed to process entire books worth of data 'at once'. First internalize and then query. Training something like this would be really hard though.
@Jay-kb7if Год назад ⁺⁶
this is similar to what they're all suggesting (open AI etc), which is to focus on smaller and cleaner data for specific use cases. I feel like the ceiling is hit with broad models and it's not lke we even need to further that because it just needs enouh information to know what we mean and to then pipe into the smaller models as you've described. They all suggest it but all this research seems to have a fetish for some lord of the rings type of model to rule them all.
@Sirmrmeowmeow Год назад ⁺³
@@Jay-kb7if Any experiments with "intra-layer memory neurons" or "dedicated memory neurons"? For the purpose of remembering previous HS or being able to focus to on certain activations more than others? (a little more than just coping old activations in an orthogonal layer(s))
@hola-kx1gn Год назад ⁺²
"Trying to fight the O(N^2) curve for both training and inference" - no
@Jay-kb7if Год назад ⁺⁴
@@Sirmrmeowmeow possibly something lik ethat, it's hard to be specific as it's all theoretical. GPT is a good traffic warden because it has strong inference on language, it really doesn't need to know facts, just what you are trying to do and how to access smaller and specific models with technical information. Like for instance, I like to dabble in vision stuff, so imagine a single context with the entire openCV library/documentattion and the most popular github repos as context, It should be pretty good at sourcing any function you want and piecing them together from many scripts. I suspect GPT is probably already doing something like this. This is what Open AI are all promoting people do, but open AI are not your friend so always be suspicous. They are trying to vendor-lock as many businesses as possible by having their infrastructure configured to their APIs, and this proposed solution they promote is also a way to retain GPT as an overarching traffic warden.
@EdanMeyer Год назад ⁺²⁹
I don’t think larger context lengths is necessarily what people want. People want to process longer sequences, and there are other ways to do that, namely memory. Current memory methods leave a lot to be desired, but they are linear in time, and humans manage to do it.
@kasvith Год назад ⁺²
Currently, memory for LLMs are just a string concantation of previous inputs or a vector search for relevant terms.
Both are really useless when conversations grows larger in length, you cutoff information one or another way
@MakeDataUseful Год назад ⁺¹⁰
ChatGPT blows my mind and infuriates at the same time when it spits out completely whack responses
@DeepTylerDurden Год назад ⁺²⁶
I really don't like RNN's, but my intuition says that to solve the context problems we will need to go back to some sort of RNN. It just looks crazy to me that we are feeding those models entire books in one shot and expecting them to be able to answer all questions about it with 100% accuracy.
@MouldySoul Год назад ⁺⁴
RWKV seems interesting tbf, based around a RNN, and riffing off Apple's Attention Free Transformer paper.
@Jay-kb7if Год назад ⁺¹
I could be wrong cos I'm the dumbest guy in the room here, but to me it seems like a conflict between data aggregation and data-weighting. RNNs seem like a purely weighted approach and transformers as an aggregating approach, but also some weighting, in a very particular way if at all. I think it can be easily misleading to think they both weight tokens but transformers to me (and again I'm stupid) seem to continually adjusts positions with the more information its given. Like a moving average. Thought of a different way, forget the actual word and consider its underlying meaning that becomes more defined based on its position with the meaning of other tokens. RNNs are purely probabilistic in an outcome like cramming for tomorrow's math test by rote learning. Done well enough you can cite the exact phrase you repeated ad nauseum. Transformers on the other hand are constantly having to reorientate what each tokens means so it might "fence-sit" comfortably between several "correct" answers, so it will always lack that precision.
@CitizensCommunity Год назад ⁺²
It is like working with a person who is not thinking very hard but is very smart, asking them about detail can result in small errors like numbers or just be wrong if they put little thought into it. So you need to ask it to consider the answer it is giving. We do a lot on auto pilot from system one that is similar to chat, so we should be able to give it larger amounts of context if we reduce the detail except on what you want it to do, and force it to give it the consideration needed on what we are looking for.
@Jay-kb7if Год назад ⁺¹
to add to this, and I hope you read this because I think about this as much as you do; Hinton is big on negative data, and cross entropy is also not just looking at what is high in attention but gets the confidence for low attention. IF they do not assess what has low attention because they simply do not bother to evaluate all tokens in a context, then it's not going to appropraitely strtify tokens within a context.
@goldenbananas1389 Год назад ⁺¹
are we going be getting anymore videos on the neural networks from scratch series?
@akam9919 Год назад ⁺¹
I solved it.
What we need is to do, is make another AI that simplifies all the context info and then makes a TikTok-style video for the actual model to process and use in generating an actually good answer.
@MaJetiGizzle Год назад ⁺⁴
Okay, I just finished the video and these are my thoughts.
Yeah, the O(n^2) nature of attention in transformers is really what’s holding the tech back at this point. If we could somehow get that into even a linear complexity that would open up so many doors of possibilities for LLMs, context length, etc.
I see a lot of people trading space in the form of vector db embeddings as a way to offset the problem without completely addressing it, which works to some extent for long term use cases, but ultimately doesn’t make the problem go away. At the end of the day we’re all essentially needing to chunk things at some level of the LLM pipeline.
Ultimately, I do think a breakthrough with the architecture is possible, especially if we go down the route of trying to scale these models horizontally instead of vertically with techniques like MoE from OpenAI.
I think once we get to the point where we have tiny LLMs automated together with a Kubernetes-like software for managing tiny neural networks we’ll be in better shape.
@Jay-kb7if Год назад
It feels like embeddings aren't all that special and just wrap around the initial prompt. The process of generating embeddings and having an encoded embedding table through open AIs API is no different to what they would do anyway with normal text prompts. It's just to sound fancy.
@kasvith Год назад ⁺¹
I completely agree with you
@keco185 Год назад ⁺¹
Context almost needs to be stored in a tree architecture instead of a single 1-D line
@Veptis Год назад ⁺²
My second comment was about the overall architecture of the whole model. Do we need the full width of the context length all the way up? Or can you simply make higher layers narrower. Somewhat like a pyramid scheme? The one output that matters is either a single CLS token in the front or a probability of next token near the end. maybe you just have small transformers and then chain them with LSTMs or something.
@kevinmaillet8017 Год назад ⁺⁴²
What’s interesting is that if you go for an interview they say either be the first or the last want to interview. By doing that you were going to be remembered the best it’s weird that this curve happens at the beginning and the end of the context, it makes you wonder how close we are to real human thought
@janbiel900 Год назад ⁺¹⁸
I wonder if this has to do with an inherent bias in the dataset. Think of any kind of "completed" written text (book, comment, function), i would venture to say that all of these have the most relevant information at the start and at the end.
@gigiopincio5006 Год назад ⁺¹⁷
ima spit out a billion tokens next interview till they hallucinate
@chickenp7038 Год назад ⁺⁶
@@janbiel900i think that’s exactly what’s happening. for predicting one of the last tokens it needs to read the end of the text and to understand the task it needs to read the beginning. super obvious in my opinion
@FreestyleTraceur Год назад ⁺⁴
Primacy and recency effect
@EricWalisko Год назад ⁺³
arent the models trained on human conversations which inherently contain this phenomenon? shouldnt we expect them to reproduce the patterns they are trained on?
@serta5727 Год назад ⁺¹
I am happy that as you explained your thoughts on the new attention mechanism they were similar to my thoughts. So I feel reassured that my understanding of it is not total nonsense.
@60hit99 Год назад ⁺²
Good to see you back with amazing content
@jannikmeissner Год назад ⁺¹
I agree with a lot here; I spent a lot of my working time in the past four years on researching extending the context window. In the end, our solution (I am co-founder at a startup called Neuralfinity) is indeed a redesigned attention mechanism. I sadly can't reveal how we did it, but we will release a paper end of the year/ beginning of next, when our next generation LLM is ready.
@sentdex Год назад ⁺¹
Looking forward to it!
@isaigordeev Год назад
arxiv?
@jaimejaime9800 Год назад ⁺¹
Nice informative summary! I've been doing lately structured data mining from chemistry papers with llms, and I am not unhappy with the map-reduce hackery with openai 4k chatgpt. In fact I tried to feed the full paper with the 16k models, and the results were far worse. I found a sweet spot of the chunk I fed to the model to get the best extraction around 3k. Some recurrent hybridation and a differentiably trained retrieval index to automate all this hackery of the map reduce and the next neighbour embeddings, looks like the low hanging fruit of improvement for me
@labeardod Год назад ⁺¹
Had no idea about the U-shaped attention problem, but I've definitely come across it. That valley is where GPT's hallucinations live, and thrive.
@Kram1032 Год назад ⁺¹⁵
I think something that *might* work is if you took a Mixture Of Experts approach, where each expert has a different attention dilation map.
Probably not ideal for computer architecture (which wants powers of 2) but at least in principle, it might make sense to choose each expert with a dilation factor that's a prime number, so you get nice phase coverage across a wide area of tokens.
Of course that also means you need more memory for each such expert.
But if you have like 8k Tokens for each expert, where one sees every single one of the first 8k tokens, one sees the first 4k and then 2k worth of every second token and 1k worth of every fourth and so on, and another agent that dilates in steps of three, and five, and seven - you probably get a fairly dense coverage even at fairly high numbers.
Or alternatively, you could just stick to powers of 2 but add a "phase shift" between experts so they see "the other half" or "the other three quarters" etc.
@stxnw Год назад
MoE has been proven to be dogshit
@gandoffboss197 Год назад ⁺¹
These issues remind me of problems from Operating Systems design. Maybe a few concepts from OS Design might be thought provoking. Swap space is an obvious memory management technique that might be useful for the limited RAM but when a need for larger amounts of memory exists. In the vain of how long does it run, maybe thinking about how OS design uses context switching could be useful. Just throwing out some food for thought. Got to get those creative juices flowing!
@aidantilgner Год назад ⁺⁴
Might be a dumb question, but could we use an LLM to recursively summarize the conversation context over the course of the conversation, and use that summary as the context for a given prompt? Basically just as the conversation progresses, a background process would create a summarized, and therefore a sort of lossy compressed version of the conversation. Obviously might not be the most efficient but maybe a cool idea.
@Jay-kb7if Год назад ⁺¹
suffers the same issue with reduced context length. Essentially providing less information (albiet more acute) to generate responses, but it seems very plausible to me, but I am dumb. Likely GPT is already doing this stuff though.
@kasvith Год назад
yes its possible, but answers will be dumb sometimes. Langchain has a summarization chain which can be used for your task
@aidantilgner Год назад
@Jay-kb7if true, I'm sure they're working on ways to compress the context themselves. As for the problem, very true that it will reduce the information that it pulls from, however, I'm thinking that there could be different modes. A mode that would make a short summary of every individual message so far, with the goal of being able to understand what has been discussed in the conversation for long periods of time. And a mode that will simply generate a couple paragraphs explaining the essence of the conversation so far, preserving key points and important phrases that were uttered. Different compression modes may yield different results. We'll see though, if I make a repo I'll link it here.
@aidantilgner Год назад
@kasvith true, but to be fair they're already dumb sometimes 🤣
@kasvith Год назад
@@aidantilgner even with a small context they are dumb
@TheAero Год назад
We are starting to reach a peak in performance. The differences will start to be 1% - 2% per year moving forwards, till entirely something new comes. Maybe fusion models and transformer mix.. New datasets, more data, better compute units, deeper models, larger models. Thats gonna be the game, till the technology saturates.
@phobosmoon4643 Год назад
what i envision is large foundation models spinning up volatile sub-llms, generating a training regimen and abstracted fitness function as well as a goal and a directive to execute p amount of system power and t amount of time on RLHF (not human, but you know) and to return the results of those fine-tuned models.
@Jackson_Zheng Год назад ⁺¹
They also used post-norm instead of pre-norm for the attention which is the same implementation as the original transformer architecture design, but not what state of the art gpt's use (which is pre-norm). This can affect performance since post norm models will need to be trained for longer than pre-norm models before they reach similar accuracy. Because they didn't reveal the exact time they trained the models for, this may not be quite reflective of real world use.
@MouldySoul Год назад
f*** sake, really, I've been doing post-norm, I didn't realise it was slower to train ffs
@notfaang4702 Год назад
Thanks for video and explaining why it's difficult to scale context size
@smicha15 Год назад
You can upload scientific documents to code interpreter. The document size limit is 100MB. I downloaded a whole book into it, and it was bale to answer questions for me.
@adempc Год назад ⁺¹
I'm not gonna pretend to understand any of this. But it sounds like we are pushing up against the limits of processing information without it being compressed first.. is that right?
I know we aren't computers, and computers aren't us - but we have various levels at which we process information, top level all the way down to the unconscious level.
Are we missing the equivalent with our current tools?
@CEOofTheHood Год назад ⁺¹⁷
I will never forget the abandoned neural network from scratch project. Some of the best content on this channel but never finished.
@willsheffield2000 Год назад ⁺¹
The last video (P.9) is chapter 6. Which was a year ago. I have the book. Guess we'll just have to do it the hard way, read it all !
@avi7278 Год назад ⁺¹
Yeah you and his wallet. Just wondering did you pledge a monthly support to him or are you one of those people who feel entitled to everything for free?
@CEOofTheHood Год назад ⁺⁵
@@avi7278 lol easy dude no one tryna fight here. He's charging 100 dollars for the book and claimed the videos would be part of the package.
@CEOofTheHood Год назад ⁺⁴
@@avi7278 plus I don't understand where u see the entitlement. I expressed a opinion. I didn't demand anything.
@hEmZoRz Год назад
@@CEOofTheHood Dude, you full well know the e-book is $29.00. That's more than a reasonable price for the content.
@zoahmed8923 11 месяцев назад
Love your videos! What do you use to keep track of papers? You mentioned you have a tool that summarises research papers for you
@cw9249 Год назад
Have you seen the new paper about long convolutions and toeplitz matrices? I didn’t quite get the toeplitz matrix thing but it sounded interesting
@freedom_aint_free Год назад ⁺¹
I seems to me that it would be surprising if it was not like that:
Since the very inception of AI field, (e.g. Rosenblatt's perceptron) the system have been modeled after human nervous system, and have been trained in human generated data, it seems pretty natural that the system at least in a high level view, would display human psychology like phenomena.
@punkkap 3 месяца назад
'Tis a great video. It's quite a task to put into context everything that was ever written, but I feel like, with the correct architecture and enough processing you will find ... I guess it's just gaussian fields of ever rising dimensions of what works in which situation. But if we have the ability to actually question it well enough, we could evolve our linguistic capabilities as well. I for one would love a book that lists the best allegories for all situations.
@FREELEARNING Год назад ⁺²
Maybe Larger Context is all we need for Even Better (LLMs). I was thinking that maybe integrating the RNN layer withing the Transformer architecture could help in achieving this. For example if the input is split into 8k chunks and each one passes the first layer of the attention then the output is concatenated and passed through the RNN then doing this again and again until the end where everything is passed to the dense layer. So in this case we have the performance of the full attention for each chunk and we have the performance of the RNN in processing the verly long output representation. What do you think?
@NevelWong Год назад
I think the easiest way to improve the context length situation for now would be a compression of the input tokens. Eg to solve most math problems, you will NOT need the prose or syntax related information of the question. That's just baggage to the AI. So ideally we could design a maximally dense language, which purely contains factual statements, no fluff words, no articles, no superfluous prepositions, etc. We could then convert a user's input into this intermediate language, process and generate output in that intermediate language, and then convert it back to English.
Sure, it would sound dry and boring, but we don't need author level prose in say our meta analysis of papers.
This way we could easily double our context length, and likely improve accuracy along the way.
@8eck Год назад
Exactly, i also think that gathering some useful information after all those distributed attention mechanisms is kinda hard or impossible. How model will know which attentions were the most important... I think generalize it will be super hard.
Possibly if there would be some better pre-processing and possibly even some models before this big model, which would separate semantics of the input and distribute by semantic. Then delegate input by semantic into a specific attention responsible for that semantic, then that would possibly lead to some generalization of the model in the end.
@ander300 Год назад
Part 10 of Neural Net from Scratch, about analytical derivatives??? Please bring the series back!
@steve_jabz Год назад
Recurrent Memory Transformers and RWKV got up to 1 million tokens. Magic LTM-1 manages 5 million.
They had some pretty interesting optimizations for getting around some of these problems too
@MouldySoul Год назад
Probably has to be the first video where I'm not even slightly annoyed by the ad at the end. Neural nets from scratch for the win, I'll definitely have a dig there thank you!!
@KodandocomFaria Год назад
I don't know if it is already possible but I think it is time to start using quantum computing for those kinds of things.
Another alternative is to maybe use different architectures like RMT(recurrent memory transformers - paper propose 1M Tokens), or gnn (maybe can be better but will also consume a lot of resources), longNet (1 billion tokens). But independent of architecture, i notice most models are not well optimized to use gpu, I saw many models with the same amount of params but with different memory usage. So I believe for starting there are 3 options that can help better:
1 - Improve model to better resource utilization
2 - maybe migrate to another language faster and that uses less resources like c++ or rust or even go.
3 - to not be necessary to migrate to another language, community could go together and help to improve python performance.
@MKBergins Год назад
Love your content, and got your Machine Learning PDF - awesome stuff good sir. Do you have any local LLM recommendation to help with programming, or a cost-effective way of doing it via something like ChatGPT?
@DevWSJ Год назад ⁺¹
Are you going to continue the neural network from scratch series ? :(
@_XoR_ Год назад
I really think we need to emulate attention at the hardware level. And by this I don't mean an accelerator that operates at the instruction level, but at the architecture level. I don't think there is any other workaround and what I don't understand is why bigger companies haven't invested in the development of this sooner..
@Bencurlis Год назад ⁺⁶
I think we should just ditch Attention as the main information processing feature entirely. Attention will always require to have all tokens available in memory, so the memory required will always scale linearly with the context size, even if we bring down the time complexity of Attention to O(n) (and that will always imply missing some pairwise token relations or simply some of the tokens). A smarter replacement would be to use Attention with a smaller window, but let the model "place" the window anywhere it wants in the context, as needed, and the model will only need this subset of tokens in memory. Of course this would require to get back to RNNs in order to let the model update the location of the Attention window in the context, and that would increase computation times quite a bit.
@chasebrower7816 Год назад ⁺³
Some kind of RNN-Attention composite would be kind of cool, but it's possible that attention is the final feature. A clever enough retrieval system with a vector database or the like might be able to pull off an adequately sophisticated memory system long term.
@joeboyle7390 Год назад ⁺³
@@chasebrower7816 RNN's take way longer to train than the equivalent performing Transformer, mostly because attention can be computed in one step, whereas RNN necessarily needs multiple steps. For RNN's to be viable again I think you need to fix that problem first.
@Bencurlis Год назад
@chasebrower7816 you would still need to make the model learn to use the mechanism for reading and writing from the vector database or the memory system, that would probably be recurrent anyways.
@joeboyle7390 I don't think that is really a problem, there are quite a few methods that were proposed to make RNN training much more efficient. I imagined one where the model would only require the data of two successive time steps, allowing a lot of parallelism along the batch dimension.
@Jay-kb7if Год назад
people needa think why these models work so well and in some ways it's the only true machine learning approach. RNNs are literally just a fancy regression analysis and in hindsight, it's hard to believe how we relied on least squared error to make predictions and expected any kind of sophstication. It's important to think of transformers in context. Language is meaning and rather than word frequency, transformers consider word association. Maybe I'm not explaning that last bit right, but RNNs do not consider the meaning at all and merely where it belongs in a sentence. Your approach is a little more challenging to put into practice and is what transformers alreadyh do. transformers are actually pretty simple in that it looks at the distribution of all tokenz in the context and attends to the highest (or around that depending on temperature) and then again and again. Maybe a dynamic context length? I'm just rambling and talking out of my arse BTW, so forgive me if nothing I'm saying is making sense and completely wrong, lol.
@Bencurlis Год назад
@@Jay-kb7if I don't think there is any difference of the way meaning is learned in transformers compared to RNN, they optimise the exact same loss. Both are performing "fancy regression analysis" as you say, they just process the context and retain information differently. I think the issue with RNN based LLM is that the state vector is simply too small to store enough relevant information without forgetting it, and that they are difficult to train because of vanishing/exploding gradient. Both of these issues can be solved, and it is important to remember that the human brain is a giant RNN (*not* a transformer), so we know it is possible to make RNN work.
@lincolt Год назад
Looks like an issue that was with the image data in before the convolutions arrived.
@thorchh Год назад
Do you have any sources/links to further research the topic of attention's U shaped graph?
@lostpianist Год назад
What I'm realising the last few months is that there is ultimately only so much you can do with LLMs. They are very useful and will become even more useful, but in isolation they will always have some limitations. In future (or indeed already) we will have networks of LLMs that work together and networks of LLMs that decide which LLM to call. The human brain works with symbols, even at the deepest levels of meaning, emotion, its all symbolic representation of information. Look at savantism/savants. It's almost like they are less finely and/or more finely tuned tuned LLMs. Interesting times...
@minimal3734 Год назад ⁺¹
Obviously bits of information have to be dropped to fit the data into sparser representations. The dropped data might be crucial for the understanding of the whole context. I wonder if the model will be able to direct the attention to the "ground level" when necessary, to obtain and process all relevant details.
@8eck Год назад
Can't models use distributed GPU for inference? I thought that this is already implemented in some frameworks...
@adempc Год назад ⁺²
Better attention is all I need.. ain't that the truth!
@ashu- Год назад
Stop watching shorts 😡
@wktodd Год назад ⁺³
Maybe the answer is to tokenize a whole concept . Ie when I listen to you , I'm not storing every word in my head, I'm filtering for facts and context to form a concept of what you are talking about. So, once you have defined the subject, I store that as a concept and recall it when necessary, not the long waffle getting there. If that whole waffle can be condensed to a single token , you have a vast space opening up.
E.G I only have to say 'Lap-time' for you to be triggered into racing car mode . Am I right? 8⁠-⁠)
@MouldySoul Год назад ⁺¹
Lap time sounds like something you'd say to your dog. "Time for some lap time buddy"
@wktodd Год назад
@@MouldySoul well yes , but the point is Lap is the subject ( could be lapping at milk , a occupant of lapland, or your thighs), Time provides context. In your world model that Concept leads to hairy trousers, in Harrison's it's hammering a car around a track.. It is a shortcut to a place in the model space , from where the model can start navigating towards the next generated token. If the LLM had a way to save and recall a marker , it wouldn't have to navigate all the previous prompts to get back to current concept.
I suppose the real problems is whether such a marker could be made smaller than the array of tokens that lead to that position.
@Jay-kb7if Год назад
what is a concept though? A token shouldn't be seen as a word but the smallest meaningful unit of information (so forgetting the actual word, it has its own specific meaning, and in the same context the same word or segment of word as 1 token can be very different).
@wktodd Год назад
@@Jay-kb7if see my comments below. I said Token because it fits into the input stream like any other token , but this marker token's job is to preset/load the context like a sign -post . The pre-prompt gets the model to place-A, your prompt moves it on to place-B, the model navigates to place-C etc. The idea is that the marker would allow direct access the place-X without having to pass through A-W .As I said in the other comment, it may require the marker to be as large as the sum of tokens that got it there, but if there was a way to compress or shortcut it then there is potential for considerable savings.
@HoriaCristescu Год назад ⁺¹
it's quadratic (n^2), not exponential (a^n)
@ChaseFreedomMusician Год назад
I think longnet should actually do better with this middle out problem (silicone valley) Because it's not just doing the additional computations in parallel, it's also the layering they show a pretty interesting mathematical proof that the layers required for 100% coverage are logarithmic. But I think the more interesting part is that the attention heads themselves can attend to different segments of the graph independently which should actually solve that middle problem.
@ChaseFreedomMusician Год назад
I also agree with @talis1063 comments internal state is likely important to make concepts spatially invariant
@nathank5140 Год назад
Am I missing something. The perplexity score goes down with increasing context size when the batch size is 16… if it continues to go down for larger contexts doesn’t that give us very large context windows without performance drop off? 12:39
@yorailevi6747 Год назад ⁺²
Is it so complicated the make attention iterative though?
Like how humans do, they're aware that something exists not specifically with all the detail and if needed the parse it again with higher level of detail.
It's really not that complicated if you make the system dynamic.
But then ofc it's rnn's all over again
@Jay-kb7if Год назад
different to what they do now it would be. I have the same thoughts as you though with dynamic context lengths. Do we really need another iteration of 1 million tokens for highly specific words, it's just going to make 99.99% of them -0.00000000001
@Veptis Год назад
I lost my previous comment, so I will split it up.
I am working on a code generation evaluation benchmark that will support multiple tasks. And a difficult decision for me is what to allow as model context. And also do I write a variant that works for instruction finetuned models...
@adi331 Год назад
Haven't read the research paper regarding remembering information in the middle . But could it be that the stuff in the middle is a lot of "filler" information and therefore not worth remebering ?
Is it just an inherent property of text that the stuff in the middle is less important than the beginning and end ? Not sure
@pisoiorfan Год назад
Yup, this is a problem. I think a good attempt is to do what we humans do: incrementally drop irrelevant (=not worth attention) tokens. If you split a 2k span window in a series of 8x256 token segments, feeding in each segment 1/2 of tokens coming out of the previous segment, the "virtual" attention span expands to 256 + 512 + 1024 ... =~ 64k tokens.
@YEASTY_COMMIE Год назад ⁺¹
I had this simple idea a while ago to improve attention, just take a normal transformer, with like a relatively small context, and apply it to your whole large context like you would with a convolution filter in a CNN, and either by changing the stride or with max pooling or something, reduce the size of your input context. Do that over multiple layers, and you can in theory compress your context, divide its size by two or four at every step, until it fits in that 2048 window. I wonder if something like this has been tried
@joeboyle7390 Год назад
That just sounds like a convolutional network to me, how is it different?
@YEASTY_COMMIE Год назад
@@joeboyle7390 well you replace the filters (simple multiplications) with a whole ass transformer, and have a big transformer at the end instead of the fully connected layer. It's a convolutional transformer
@joeboyle7390 Год назад
@@YEASTY_COMMIE Aha, I think I see what you're proposing. That sounds like something that people would have experimented with, but if not sounds like an interesting research project!
@YEASTY_COMMIE Год назад ⁺²
@@joeboyle7390 every time I have an ML idea, I realize a few months later that it was invented like 2 years ago and was a banger (I thought about something like GANs when I was 16, then realized they had been invented 2 years earlier, same thing happened with ResNets, and a bunch of other ideas). Either that or something similar comes out the next month. Always makes me feel like I missed an opportunity, but on the other hand I probably couldn't have done something that competes with what those teams of researchers produce anyways, so I try to be content with my ideas being vaguely validated
@d-star491 Год назад
@@YEASTY_COMMIEthis is what I want to be. Wanna swap brains?
@tiagotiagot Год назад
Could something sorta like a "mipmap" of the context, with varying levels of "convolution" (ideally some sort of semantic compression if that's a possibility), combined with streaming from disk to read individual details at full resolution when needed, perhaps something sorta analog to Unreal Engine 5's Nanite, perhaps be a possibility?
@gunrage Год назад ⁺¹
Do you think Tesla's dojo will enable building much larger models? Maybe not initially, because it will be used just for Tesla needs, but in general.
@jahcane3711 Год назад
The perceiver model is not a potentially viable solution?
@opusdei1151 Год назад
This is a good video, thank you very much.
@KeepingUp_withAI Год назад ⁺¹
Attention scales quadratically, not exponentially. Other than that, great video!
@MrGeordiejon Год назад
@04:30 Bidens' auto prompt?
I was thinking of extending nnfsip to wrap each attention and plug them into the context(s)?
...
@opusdei1151 Год назад
Do you think that liquid neural networks is a marketing move. It seems to be so amazing, but there is almost no github repositories on it. There are some paper here and there. But if its so revolutionizing why not everybody jumping on it?
@erfanzarechavoshi909 Год назад
i think multy query works fine if you trying larger ctx but yes the current attention needs to change
@hanskraut2018 Год назад
Very smart very important indeed, here are some leads how to do it
-*smart forgetting*: (GPT4 seems to have no controll over forgetting it can even tell what info is more important but loses it even when empty text is added if the important info is on the edge of its old token context window. Forgetting to the least importat tokens should theoretically lead to a density of relevant and important tokens to increase in effect i creasing the relevant token context lenght, freezing the use of untelated tokens aka reducibg the weight depending on task also could help
2: Sorting and compressing to different degrees of data loss. For rerrad/regaining of context based on a multitude of sorted context memorys in different ways for different purposes (ad RL on top and u have a selfimprovibg system a mix of hardcode and lerning can i crease stability as well as the ability to choose what to use based on selfobservation
3: Dynmaic weighing of token importance by attention depending on guesses pf importance and muslitble systems that do that by different metrics and methods and metasystems that choode the % of each method depending on result and past result experience (small decider lerning networks)
4: (simple diy) just have multible modeös that each save some context and the. Reconstruct by talking to each other (hey do you know about x? Did the user talk about that?)
Mayve a finetuned „memorize facts“ network a „memorize X“ network.
5) layered categorisation with zooming in on info continually being sprted.
etc.
Depends on the usecase understanding the model and what bottleneck is likeö not to change soon or payed too little attention to then should helps in deciding where you can add value.
Bonus: delfreminders or reminders baded on context might be able to repromt thibgs ouside of context window the LLM could use it as a „unversifyed plugin“ inside of chatGPT for example, weaviate is trying to fevelop such a plugin which is in alpha right now maybe they value contributors since there method in isolation could use help from creative systems in symbiosis that compliment each i
other i thi k personally guessing ad to what is under there hood
@MaJetiGizzle Год назад
You had me at “Better Attention” my dude.
@JazevoAudiosurf Год назад
I'm thinking if pretraining is long term memory, if you could store all the information of a data set in the weights, and had a perfect memory, it would not be necessary to have long context. instead you would just "fine tune" the pretrained model with your 100 page document from your prompt and it would perfectly know the document.
in other words, if we would overfit the model perfectly during training, and every prompt would be a perfectly overfitted fine tuning, it would solve the problem of short term memory. the trade-off would then be its reasoning abilities because of overfitting. but if you have vast amounts of data, that could potentially be solved. perhaps this solution would require more than double precision weights. I think it is possible with enough data and compute, without altering transformers, to solve AGI. it probably won't happen this way, but it shows that there are many ways to reach it
@bigphab7205 Год назад
Why would longnet go public if it didn't address those points? Does the sagging attention curve have anything to do with the data? More specifically, what is it empirically related to? If it's the model itself and the calculations that's one thing if it's simply a product of the data and the format that's different. One thing I have noticed is that the "good" data all has a common theme/format. It seems very likely to me that the curve was a learned shortcut. I'm even more convinced of this by the simple inclusion of RLHF. There is a very specific way most people choose to communicate, especially in writing, and that curve that you mentioned matches it perfectly. But that is not how educational books or scientific papers are written.
@siddharthaudayakumar9444 Год назад
Im unable to find the code interpreter in my gpt 4 im from India why is this issue
@vikranthkanumuru8900 Год назад
To all the papers mentioned, can we have the links
@IronCandyNotes Год назад ⁺¹
Damn... if only I paid attention to what the video was about prply something awesome with python.
@CapsAdmin Год назад ⁺¹
It's kinda crazy that to produce one token, it must pay attention to all of its previous context. If you need to compress information we might as well do finetuning with the context?
@sgramstrup Год назад
This is imho where the current transformer errs. There's no information gained by comparing some important content later in the document, with completely unrelated content in the introduction. We need layered attention that is local to a sentence, paragraph, section/chapter etc..
@8eck Год назад
Yeah, i think that those summarization techniques are not a real use case for something like code or something that is sensitive to data loss.
@diadetediotedio6918 Год назад
How does Claude work with the 100k context window?
@MouldySoul Год назад ⁺¹
it's a technique called ALiBi I think (attention with linear bias)
@sgramstrup Год назад ⁺¹
They need attention on the level of sentences and sections in a text. It's ridiculous that the whole context is prioritized using only token attention. If we have attention on several layers, we no longer need a big context and could even reduce context size to < 1K for speedier inference. Longer context is NOT the answer.
@lucasa8710 Год назад
Well, you can always introduce another model to summarize the entire context window into 8k-ish tokens for the primary model
@MaxGuides Год назад
Each segment has its own middle in dilated attention. Just a way of knowing which attention to reference as far as I’m aware.
@countofst.germain6417 Год назад
He said the thing!
@Artorias920 Год назад
awesome vid! Loving NNFS as well :D
@hewhointheearthlydomainsee1272 Год назад
It will possibly be human solution. A group of people read a million tokens of text and the ones with the best comprehension and fastest times could be queried about their technique. I think the Wheel of Time is a good example to try with, with 4.4 million words. The great dictionaries are another with up 60 million words, but humans could never read it all, apparently.
@thisisnotramansmusic1045 Год назад
Smaller models hyper-tuned to specific tasks might actually solve this problem.
@scottmiller2591 Год назад
"I dilate down to the 1st and last token, so I can do 10^10000 tokens now; it just takes longer than the heat death of the universe to read them in." Is this really useful?
@calcs001 Год назад
OpenAI has a gpt-4 32K model.
@sentdex Год назад
Yep. Still has the lost in the middle problem. A model existing doesn't mean it doesn't have drawbacks
@CitizenWarwick Год назад
There is a gpt 4 32k model, Claude has 100k, larger context is coming!
@JebBradwell Год назад
FoT focused transformer has shown better training with larger context length by using positive and negative examples to help with this issue. Check it out and let me know what you think.
@serta5727 Год назад
Very interesting development
@sandraviknander7898 Год назад
I kind of got big bird flashbacks reading this paper.
@americanbagel Год назад
Where can I get some better attention?
@dawre3124 Год назад
Half way through this video and I feel like I'm watching a healtjy gamer video on how attention and adhd works, not an video about ai
I think with massivly improved hardware the only solution is to have something like memory and information source for the ai to work wirh (I guess something like the paper said, but I didn't get it since I'm not a sience guy). Like a human solving a problem the ai needs to work with the data to brake down the task into chunks it can hold in memory.
Split that beginnimg and end into many more beginnings and ends like a human working on a todo list involving many research-, understanding- and execution- steps. For this to work the process would need to move away from running of memory alone to memory+source aswell as creating a specialised checkpoint of that model just for that task
@NickWindham Год назад
Claude is 100k tokens already
@sentdex Год назад ⁺¹
Like stated in the video, there are models that go beyond 16K and 32K. We also see an example from Microsoft that shows you could have 1B tokens. The point is, scaled out, attention just doesn't work well, both in terms of processing time but also in the actual output quality from that attention.
@JohnVandivier Год назад
Easy fix: middle-out compression
@scottmiller2591 Год назад
If LongNet were being honest, they'd use a log y scale.
@Linck192 Год назад
Splitting attention into segments doesn't make much sense to me. What if in the second segment you needed the context from the first segment to comprehend it?
@fpsmeter Год назад
Wait, so there's O(N^2) complexity when those models process text prompts? Why is so much hype about chat GPT4 but nobody talks about this fact? It's a huge constraint, seriously limiting the capatibilities and possible use cases.
@Jay-kb7if Год назад ⁺¹
all the research is crappy trying to do 1 million context length, it just removes so much variability and sparsly evaluates tokens within a context or not at all.
@FuZZbaLLbee Год назад
Instead of taking every Nth word, maybe some way of only focusing on meaningful words could help
The above would become:
“ instead every Nth focus meaningful “
Although that is still 5 tokens

Следующие

Автовоспроизведение

QLoRA is all you need (Fast and lightweight model fine-tuning)