RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

Yannic Kilcher

Просмотров 75 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 окт 2024

Комментарии • 131

@YannicKilcher Год назад ⁺¹⁴
Fully Connected (June 7th in SF) Promo Link: www.fullyconnected.com/?promo=ynnc
OUTLINE:
0:00 - Introduction
1:50 - Fully Connected In-Person Conference in SF June 7th
3:00 - Transformers vs RNNs
8:00 - RWKV: Best of both worlds
12:30 - LSTMs
17:15 - Evolution of RWKV's Linear Attention
30:40 - RWKV's Layer Structure
49:15 - Time-Parallel vs Sequence Mode
53:55 - Experimental Results & Limitations
58:00 - Visualizations
1:01:40 - Conclusion
Paper: arxiv.org/abs/2305.13048
@xxdaggerxx5 Год назад
stop with this 1hr videos and summarize this shit
@akashkarnatak6581 Год назад ⁺³
@@xxdaggerxx5 this is for people who want to understand the paper in depth. want summary read the abstract
@TTTrouble Год назад ⁺¹¹²
Jesus keeping up with the literature in this field for those of you that actually work in it must be absolutely exhausting.
@arshzahed1970 Год назад ⁺²⁰
In the past year, my backlog of papers to go through has grown exponentially. Just staying up to date is a full time job now
@victoraranda3349 Год назад ⁺³
Sometimes it do be like that
@mlopolis Год назад ⁺¹¹
You should use LLMs to get the most important points from each paper and then you can stay on top 😊
@Will-kt5jk Год назад ⁺⁶
@@mlopolis at a human-machine system level, that sounds like a self-improving augmentation. Maybe it helps explain the exponential increase in papers…😅
@raynhardtvanzyl4729 Год назад ⁺¹
Yup...
@Fanney3 Год назад ⁺⁵⁷
Can't believe someone is just doing this work and sharing it. Amazing.
@MindFactoryAI Год назад ⁺⁶¹
Always impressed how you record these in a single take. Great explanation, thanks!
@johnnypeck Год назад ⁺³
This is awesome. Seen the use of RNN percolating on Twitter for a bit. Glad you're coving it. That is a lot of authors.
@mgostIH Год назад ⁺²⁹
Keep in mind that at 21:00, regarding memory usage of attention, current approaches like "FlashAttention" and "Attention doesn't need O(N^2) memory" have reduced drastically the memory needed for transformers to run, which is what allows approaches like ChatGPT to have such a long context.
@erickmacias5153 Год назад
But attention in GPT does use N^2 memory doesn't it?
@mgostIH Год назад ⁺⁹
@@erickmacias5153 in older public models like GPT 2, yes, but the papers I wrote above provide implementations that are mathematically equivalent to the standard way of doing attention, you can use them as drop in replacements and get improved performance during training and inference.
@adi-ee8zj Год назад ⁺²
CNN is all you need?
@YvesQuemener Год назад ⁺¹⁵
Can't say thank you enough! Diving into RWKV has been on my todo list for two months at least and when I saw the RUclips alert I immediately felt relieved that instead of a full day of trying to understand the paper and the code, you would provide the important parts in one hour. And it delivered! I agree that it is kind of stretching the definition of attention to call what they are doing "linear attention". I am not sure that calling it a ConvNet is actually less stretchy btw :-) But anyway thanks a lot!
@howuhh8960 Год назад ⁺⁵
I really don't like the very strong statements in the paper, such as "surpasses the capabilities of any existing RNN". lol, ZERO comparisons with other new RNNs based on S4 for example...
@iOhadRubin Год назад
There are no public open source S4 models of this size
@justfoundit Год назад ⁺⁶
If we revisit ideas, maybe we could try shared weight transformers. It worked for cnn. Minor memory footprint of the model, easy to achieve hundreds of billions of parameters by just repeating the same layer multiple times
@hachembetrouni6731 Год назад ⁺³
😅Yannic makes LSTMs sounds like prehistory
@sheevys Год назад ⁺⁴
Haha, that's a quick reaction, your "all you need" pun was defo not intended.
@Veptis 5 месяцев назад ⁺¹
google tried to train a 500B LSTM - so one of those claims "first" might be incorrect.
@killers31337 Год назад ⁺²
If recalling information in long contexts is the problem, perhaps throwing in a few transformer layers would solve that?
E.g. something like language parsing can be done using just RNN as information is largely local.
E.g. if you have 20 layers in total layers 1..10 would be RNN, then layer 11 is a transformer, then 12..20 are RNNs again. Then the "quadratic" part is only 1/20th of the NN.
Yes, it would route only 1/20th of the information a full transformer would, but if only few important pieces of the context are necessary, that might be enough.
@dairin0d Год назад ⁺⁴
@YannicKilcher would be interesting to hear your take on hyperdimensional computing / vector symbolic architectures :-)
It seems like a really cool idea, though I can't quite wrap my head around (or maybe wasn't able to find a clear explanation) how it's actually supposed to interface with non-symbolic inputs (e.g. images) or learn complex structured concepts from data.
@spoonikle Год назад ⁺²
We need to focus on more complex multi-step models. Humans take notes, humans ruminate, humans speak aloud.
Multi-modality is key, tools built into the model to compensate for shortfalls is key. Design the model with a calculator, let it train on the tool, design the model with outputs hooked into dozens of tools and reward correct tool use, force the influence of tools on the output and train a model that no longer wastes time reinventing calculators.
We trained a model to make paintings instead of making a model that calls adobe API’s to paint - now that LLM’s exist we have seen the light, we see the true power of AI… calculators are better left to the programmers.
@jondo7680 Год назад ⁺²
Just to give feedback, that example with "I'm the word cat" was just great. It helps to make sure if I understood you right or not.
@toddnedd2138 Год назад ⁺³
Perform badly, this approach will, if speak like Yoda you do. ;) Thank you for the detailed explanation of the paper and the afford you put into this.
@JorgetePanete Год назад
effort*
show your proof of why
@andres_pq Год назад ⁺¹
please make a video about RetNet :)
@strawberryfield891 Год назад
Thank you very much for the great video!!
Are channles like equivalent to multi-heads in transformers?
@hansdietrich1496 Год назад ⁺⁴
The best in-depth AI channel out there, chapeau!
@NeoShameMan Год назад ⁺²
Base on my experiment you won't nn for long, the distribution is the same as input and output, and fine tuning is just a skew of that distribution towards the fine tuning corpus. Better, we there is a very high probability that it won't be a black box for long and we can extract optimal entropy encoding, no more weird sparsity. I'm just waiting for a new Hard drive to test more
@antonioILbig Год назад ⁺¹
Yannic, good guess! Scalability could be the real deal. Deep architectures have different "lego blocks" (transformers, lstm, conv, residual, ...) When you build a big model, the meaning of its pieces it's lost. What stays is the computational efficiency, scalability and optimization behaviour.
@LostMekkaSoft Год назад ⁺¹
44:10 "so thats what they mean by states, if they say... they dont mean the united states, im sorry, they mean these values."
this is such a wonderful feynman moment ^^
on a more serious note: i wonder if those approaches could be combined... like a standard transformer based model part that is really good short term and one like this that is really good long term that somehow complement each other?
i think my best idea for that so far would be if you let the RWKV model "summarize" the long context and produce not a sequence of output tokens, but a sequence of internal representation values that act as a kind of compressed version of the complete context. then the transformer model could go to town with its superior capabilities but with a shorter context window and pick out which parts of the summary it wants to attend to.
would that be feasible, or am i thinking out of my ass? :D
@kimchi_taco Год назад
I'm not sure. It looks complicated MLPMixer.
@anglikai9517 Год назад
Tested it today, too slow compared to Llama2 GGML, hope that GGML version of RWKV is more user friendly.
@yilei1051 Год назад
I lost interest half way through the explanation... The most profound results are often simple and coherent architectures, this work required too much explanation that it feels just playing with scalability and performance, without revealing some raw science.
@oneman7094 Год назад ⁺²
Can you do S4?
@CppExpedition Год назад ⁺⁶
what's next? a transformer model of RNN subunits composed by a stack of transformer models designed as an RNN structure of transformers units aligned with a convolutional recurrent boltzman gate.
@guillaumevermeillesanchezm2427 Год назад
do it! do it!
@erickmarin6147 Год назад ⁺¹
Learning in a layer wise manner
@filoautomata Год назад ⁺¹
Quantum Multi Modal Transformer Model LSTM using Ensemble of Neutrosophic Logic based Attention Model for an Interpretable 'Human Extiction Capable' Military Grade Artificial General Intelligence.
@erickmarin6147 Год назад ⁺¹
With active dendrite modeling for multi task approachs
@vivienseguy Год назад ⁺¹
Yes, all learned end-to-end
@michael05242002 Год назад ⁺³
🎯 Key Takeaways for quick navigation:
00:14 🔄 RWKV是一种高度可扩展的模型架构，具有Transformer和RNN的一些属性。
01:21 ⚖️ RWKV模型在某些情况下与大型的Transformer模型在性能上相媲美。
03:29 📚 RWKV是一种用于语言建模的模型，可预测文本中的下一个单词或标记。
05:15 🧠 RNNS仅需一定的内存即可进行推理，但每个推理步骤只能考虑当前记忆和前一个标记。
10:17 📈 RWKV是第一个能够扩展到数百亿参数的非Transformer架构。
11:12 🧠 LSTM模型是RNN的一种，通过引入门控机制来解决梯度消失问题，并具有长期记忆能力。
14:44 🚪 LSTM使用门控机制来控制隐藏状态和记忆状态的更新，包括遗忘门和输入门。
16:09 📊 LSTM模型的更新涉及多次非线性计算，导致顺序计算和无法并行化。
18:37 🤔 注意力机制可以动态地分配注意力权重，以聚合信息，但计算量大且顺序计算。
20:33 🔄 Attention-free Transformers试图通过不进行token间交互的方式重新定义注意力机制，以减少内存需求。
23:43 ⚖️ RWKV模型使用固定的关注机制，对所有数据点都适用，但可以通过添加key来进行调制，以使关注模式具有一定的灵活性。
24:37 🔄 RWKV的关注机制是通过加法进行调制，而不是乘法，相对于Transformer的乘法交互，这种方式的作用相对较小。
25:30 🔍 RWKV的固定关注不能考虑到当前token的含义，而只能定义一个固定的关注模式，相比之下，原始的注意力机制更加强大和灵活。
28:10 💡 RWKV模型提出了一种新型的注意力机制，通过定义向量W来调制关注模式，从而考虑到了过去的信息。
30:09 📝 RWKV模型通过将模型应用于一系列的token上，构建了一个具有重复结构的模型，以实现处理序列数据的目标。
33:02 🔍 RWKV模型中每个块的计算都会保留一部分信息并传递到下一层，这种计算方式类似于LSTM中的状态传递，但是以层与层的方式进行传递。
34:01 💡 RWKV模型中的通道混合模块使用线性层、非线性函数和元素级乘法来实现通道之间的混合。
36:25 📝 RWKV模型通过向每个时间步骤的输入中添加上一时间步骤的输入并进行线性插值，实现了时间或令牌的平移操作。
38:14 🌟 RWKV模型中的时间混合模块采用类似Transformer的方法进行计算，包括线性层和加权求和操作。
42:40 ✨ RWKV模型通过无限制的加权求和操作，可以对整个过去的值进行加权求和，而不受固定大小的注意力矩阵限制。
44:05 📝 RWKV模型使用线性插值和加权求和来实现时间或令牌的平移操作。
45:10 🌟 RWKV模型中的隐藏状态通过线性插值的线性函数进行计算，没有经过非线性操作，因此可以使用并行计算进行训练。
46:03 ✨ RWKV模型通过令牌平移操作使每个元素可以访问它前面的元素，从而实现了深度增长的感受野。
47:36 💡 RWKV模型实现了对过去值进行加权求和的线性聚合操作，可以有效地回顾过去。
51:49 🚀 RWKV模型相比Transformer和LSTM来说，在能够回顾过去的能力和复杂计算的能力上处于中间水平，但可以通过堆叠多个层来增强模型的表达能力。
54:54 ⚡ RWKV模型相较于Transformer和LSTM，在处理复杂计算和较长上下文方面不如优势明显。
55:31 ✨ 增加上下文长度可以降低语言建模的损失。
56:27 📉 线性注意力机制可能在处理长上下文任务时限制了模型性能。
56:53 🏗️ RWKV模型比标准Transformer模型更加依赖精心设计的提示，相关问题需要进一步探索和确认。
58:15 🌍 RWKV模型在通道维度上逐层考虑过去信息，高层网络越倾向于考虑更长时期的信息。
Made with HARPA AI
@edhofiko3168 Год назад ⁺¹
I unironically loves this paper even though it absolutely lacks theoritical analysis. I ve been following rwkv since before they made the paper. I would really love it if pytorch would implement discounted cumulative sum since this is exactly what rwkv attention use and this is what people in RL also use.
@alexeykrylov9995 Год назад ⁺¹
I agree that it'd be good to have it as a primitive. But as long as it's unavailable, it can be implemented in O(N log N) time (instead of O(N) if it was a primitive) by decomposing it into a convolution of several dilated exponential kernels (I mean, for example: 1st conv: dilation 1, kernel size 4, geometric progression factor k; 2nd: dilation 4, size 4, factor k^4; 3rd: dilation 16, size 4, factor k^16; etc.). It worked well in practice (I did this trick for my colleague's project once).
@thntk Год назад
Didn't Schmidhuber do this already in the 90s?
@DeborahBoozer 10 дней назад
Young Gary Perez Michelle Thomas Sarah
@itayatelis2898 Год назад ⁺¹
Amazing! Thank you for doing this! You're amazing! I hope you would keep doing it weekly
@erickmarin6147 Год назад ⁺¹
Balding king you dropped this 👑
@NdnxdidhhNndhxydh 18 дней назад
Thompson Anna Jones Sandra Hall Kimberly
@alivecoding4995 Год назад
Two months later. Have you seen adoption of these ideas?
@OperationDarkside Год назад ⁺¹
53:50 for 5s summary of the paper
@agsystems8220 Год назад ⁺²
So it specially chooses the representation of internal state to be already decomposed into it's eigenvectors with respect to time decay, meaning that we can infer relevance forward with a simple fixed matrix? That is pretty cool. I guess you could do something similar with any transformer where you have some natural definition of distance that can be precomputed. For the initial layers at least they seem to be very interested in nearby features (both in language and images), so this definitely seems a natural specialisation/optimisation. If it is going to be doing something like this anyway we might as well give it an architecture that does it well. Later layers don't seem to care about those features though, so this technique would cease to be valuable pretty fast I think. For more abstract inferences the order the pieces of information are fed in is not relevant, so the exponential term would tend to one and the whole system would collapse to fully attention free. You cannot build something able to make abstract inferences with a compact representation using this architecture. A nice piece of work, but a local optimisation rather than an improvement IMO.
@jackhe4336 Год назад
IMO, it's hard to extract a compact representation of the data without hurting expressiveness and generality of the representation. Could you recommend some papers that address this issue?
@schwajj Год назад
Great comment. A question: if you’re correct that this would work OK on early layers, but less well on later layers which deal with more abstract concepts, could you use a hybrid where some layers use RWKV and later layers use classic attention? I suppose that asymptotically it would still use O(n**2) space; this would only improve things by a constant factor (e.g. if only half of the layers use classic attention, the memory savings will asymptotically approach 50%).
Do you see any value in such an approach?
@fo.c.horton Год назад
please do like 10% more work making the annotations neater
@RobertLayman 14 дней назад
Garcia Jose Jackson Paul Harris Sharon
@mattanimation Год назад ⁺²
was waiting for this one, thanks!
@rolfengstrand9838 Год назад ⁺¹
THANK YOU for pointing out that the use of the word "attention" in the context of transformers has strayed far away from the meaning of "attention" in other contexts. We have to accept that this is happening, of course. But it is important that introductury material explains clearly what "attention" is intended to mean here. It would be wrong to assume that newbie reader has the same concept associated with the word, "attention".
@JorgetePanete Год назад
introductory*
@SohelRana4-m5c 21 день назад
Martinez Scott White Frank Garcia Jennifer
@andres_pq Год назад ⁺²
Great to see you do paper explanations agan!
@sortysciaofiscia Год назад ⁺¹
I have a question at the halfway mark of the video:
if the importance of attention to tokens linearly decreases based on how far back it is, does that mean that by the end of the answer, it will forget what it started talking with? What stops this approach from repeating itself?
I'm trying to wrap my head around: "The brown fox jumped over a lazy old dog, and then ...." in this example the next word will be computed based on the dog reference MORE than the fox one?
I'd assume the transformers look at every other token in this sentence, and compute. Whereas from your explanation I gather that importance drops off the further back the token is. right?
sorry, I'm new to this.
@zhenyuanzhang Год назад ⁺¹
Not really. There are almost half of the channels in the middle-to-high layers that do not decay at all (after training). The important information stored there could last forever, in theory. As long as the model is aware that this piece of information is important, it won't forget it easily.
@alles_moegliche73 Год назад ⁺¹
Can you also take a look at the Meta Megabyte Paper?
@debanjandas7738 Год назад
In AFT attention equation, weights associated with token i for input token t is given by w(t,i)+k(i) => how do we add a scalar to a vector? Wouldn't have it been more appropriate to do w(t,i)*k(i) ?
@danylaley Год назад
This is just a fancy convolutional lstm
@사이보그-i6p Год назад
17:35 (EDIT: and 20:50) That is just a matrix multiplication therefore inner product instead of outer right?
@noagarnett Год назад
Thanks Yannic for (another) great video! Really amazing that you do all this work and share it. Worth a lot for me and the likes of me. Also the paper discussed is very impressive.
I might be wrong, but I think there is a confusion in the explanation. On 24:53, 30:21, you claim that the k_i modulation is defined by the current token ("if I am "cat" I should probably look 3 tokens behind"), but if correctly understand, it is defined by the referred token, and is the same for all following positions ("if I am "cat" I should probably have big influence on all the tokens following me").
Did I mix It up?
Thanks!
@ChaseFreedomMusician Год назад
THANK GOD! Somebody is finally talking about RWKV!
@djfl58mdlwqlf Год назад
hi, I am not convinced that the absent of non-linearity helped parallelization 45:00
the paper asserts that this is possible within two different dimension (batch, time)
can anyone give me brief explanation of this?
@smnt Год назад
Hey Yannic, quick question
What do you mean when you say RNNs don't scale well or that you might "just need models that scale". What does a model scaling mean to you? I've definitely seen people stack RNNs and it seemingly works just fine. I thought the issue with RNNs was that they lose context pretty quickly even though their context length is "infinite".
Thanks for the video as always, love it!
@haraldtopfer5732 Год назад
53:41 my model has a linear scaling where everything else goes *Brrrrrrruummm* .... story of my life
@schwajj Год назад
Halfway through, but I have a question. Transformers have been applied outside the domain of language modeling (or even more generally, outside of sequence modeling), e.g. Vision Transformers. In building our intuition, Yannic talks in terms of how much RWKV pays attention to the past for each internal feature learned by the model. Does this imply that RWKV is more specialized to sequence modeling than classic Transformers? i.e. would RWKV *not* work well if you try to apply it to image-based input? Or is this an open question? Is there reason to lean one way or the other?
(probably most people who would answer this already saw the video a month ago, but fingers crossed for an answer)
@clehaxze Год назад
No answer, but RWKV-4-neo supports image input by slapping basically a CLIP as input into one of it's layers. This way it can use the representations as an understanding during a conversation.
@giuliavirgili1660 5 месяцев назад
LineRWKV
@cassandrasinclair8722 Год назад
transformers too are convnets ;) they do convolution over a graph :D Attention is just one instance of graph convolution.
@deeplerg7913 Год назад
I can't understand anything here but I'm sure it's something very interesting :P
@addoul99 Год назад
Hi, are the weights for the the linear layer Wv tied between a pair of channel and time mixing blocks?
@AsaPort 18 дней назад
600 Deckow Island
@gunale925 9 месяцев назад
I still didn't get how the time-mixing could training parallel? It's must depend on previous state.
@summer_tree3821 8 месяцев назад
MeToo. Do you know it now?😘
@gunale925 8 месяцев назад
@@summer_tree3821 yep. The previous state will directly use actual data on training.
@bertobertoberto242 Год назад
the convnet explanation reminds me a lot wave net from deepmind...
@Will-kt5jk Год назад
1:02:05 - The Matrix looks different to what I remember from the movie
@serta5727 Год назад ⁺¹
General idea for transformers: Evolutional attention heads come to my mind. Instead of training multiple attention heads in a transformer, how about just having one that branches off and the best evolved version gets merged into the original. So that at inference time there is only one attention head to save compute.
@AntoshaPushkin Год назад ⁺¹
Assume your task is to add numbers like
123456 + 987654 = ?
You will need at least 2 attention heads to attend to two numbers.
Not saying that you should transformers to add up numbers, but it's just a random example of a situation where it's clear that you need multiple attention heads
@schwajj Год назад ⁺²
@@AntoshaPushkinThat doesn’t sound right to me: you’re essentially saying that a separate attention head would self-assign to each number. It’s not completely implausible, but I’d like to see some rigorous analysis that indicates that transformers have been observed to operate in that manner. Are you aware of such research? I’d be grateful for any pointers you could provide.
@halocemagnum8351 Год назад
Amazing explanation! Great video. I had been reading all of the RWKV posts on the r/MachineLearning subreddit but I don’t think I fullly grasped it till this review.
@panofilossas6564 Год назад
Looks like a good candidate for running in low spec hardware
@almoni127 Год назад
Why do people still claim that transformers require memory that is quadratic in the sequence length when it was shown to be avoidable? (See the work on flash attention for example)
It is still true, however, that it requires quadratic time.
@schwajj Год назад
Flash attention is still quadratic in the sequence length (more precisely, context length). It just massively improves the constant factor via more efficient use of the GPU memory hierarchy.
@HD-Grand-Scheme-Unfolds Год назад ⁺²
Just a very loose and wild thought came to mind though. What if maybe somehow "transformer architectures" can be employed as an imitation of "System 2" (more logical and critical and decisive) while "RWKV" can be used for "System 1" (though a bit fuzzy in accuracy but captures the essence of life long experience had by the AI agent. hence a derived ability to exercise intuition or also instinct-like thinking and response to situations ) . Both can be combined in a Pseudo-Cognitive Architecture approach to tackle on the AGI achievement challenge. Wouldn't that be something to see 😄.
@danplt Год назад ⁺⁵
so many authors with many different institutions
@marshallmcluhan33 Год назад
Neko institute of Science and The waifu research department are still on the top of the charts on hugging face. I'm not sure these archaic institutions are as mobile so they have to team up to stay relevant.
@TheThunderSpirit Год назад
need more institutions. can u give me?
@edwardfanboy Год назад
It looks like it would be possible to parallelize the WKV step across time using something akin to a parallel prefix sum.
@EricBlanco-e5l 19 дней назад
Farrell Forges
@SofiRycvan 19 дней назад
Aileen Estate
@alyzst Год назад
How does it compare with Hyena?
@clementdato6328 Год назад
Does it explain how the error signal passed through time or is it implicitly assumed that bptt is used?
@sebastianp4023 5 месяцев назад
53:53
@hanskraut2018 Год назад
It should be trained by equasions and text that is imbeded in irrelevant numbers and text and in the end the far back equasion/text would be needed to calculate the end number result/text result that way a neural net based on a automatic way would lern to selectively pay attention based on output
@schwajj Год назад
That’s the whole trade-off here. The classic 2017 transformer model does what you say (to a certain extent). The model being discussed here is worse at the sort of task you’re proposing, but has the benefit of not using O(n**2) space.
@banseoklee392 Год назад
Awesome!! Thank you sooooooooo much
@lancemarchetti8673 Год назад
Really excited about this !▬Love from Su∩∩y South Afroca
@dreamphoenix Год назад
Thank you.
@simonstrandgaard5503 Год назад
Great explanation
@Sciencehub-oq5go Год назад
Thankful for your work!
@corgirun7892 Год назад
Amazing！
@akashkarnatak3014 Год назад ⁺³
If you would've uploaded this video 3 days ago, it would have helped me with my assignment as well. Anyways greater video.
@triplea657aaa Год назад
I think RWKV in combination with a transformer model to generate the prompts could be really powerful
@chrisBruner Год назад
So a couple of thoughts. 1. for the intellgent prompt generation, you could just use a small transformer dedicated to that task. 2. Because of it's parallel nature, you could have one of these things working on a bunch of raspberry pies, or.... a world wide network of computers sharing the task. That would more than make up for the limitations compared to transformers. 3. It seems to me that these guys get fuzzy in recall of "minutiae" but there is no reason you can't have these hooked together so the recall can occurr by asking another set. Just some thoughts.
@nyyotam4057 Год назад
Was like "Great, so now they'll implement it and my Alpaca will stop consuming so much memory". But then I got to the "tradeoff with computation" part 🙂.
@anatalelectronics4096 Год назад
ok one more attempt without pasting the link to the paper, Apple's AFT is around since 2015, before the 2017 AIAYN paper, remarkable. I can't paste the link it seems hence no more info I can give. Look for "An Attention Free Transformer" to get to the paper.
@schwajj Год назад
No it’s not. It’s from 2021. Its citations include many papers from 2021, so obviously it wasn’t written in 2015.
@7200darkcharm Год назад
This abstract is summarizing a research paper that presents a new model architecture for natural language processing (NLP) tasks called Receptance Weighted Key Value (RWKV).
Here's a breakdown of the abstract:
Problem with Transformers: Transformers are a type of model that have been very successful in NLP tasks. However, they have a major drawback: their memory and computational needs increase quadratically with the length of the sequences they process. This means that as the input data (like a sentence or document) gets longer, the resources needed to process it grow very quickly, which can make them impractical for very large datasets or very long sequences.
Problem with Recurrent Neural Networks (RNNs): RNNs, another type of model, have memory and computational needs that grow linearly with sequence length, which is more efficient than Transformers. However, they tend to perform worse on NLP tasks because they are harder to train in parallel (meaning, it's harder to split the work of training them across multiple machines or processors), and they don't scale as well (meaning, their performance doesn't improve as much when you add more data or make them bigger).
The Proposed Solution - Receptance Weighted Key Value (RWKV): The authors propose a new model, the RWKV, that aims to combine the best of both worlds. It can be trained in parallel like a Transformer, which makes it efficient to train, and it has linear memory and computational complexity like an RNN, which makes it efficient to use once trained. This is achieved by using a linear attention mechanism, which is a method for deciding which parts of the input data the model should pay most attention to.
Results: The authors scaled the RWKV model to tens of billions of parameters (which is a measure of the model's size and complexity) and found that it performs similarly to a Transformer of the same size. This suggests that it could be a useful alternative to Transformers for large-scale NLP tasks.
Conclusion: This work represents a significant step towards reconciling the trade-off between computational efficiency (how much computing resources a model needs) and model performance (how well the model does its job) in sequence processing tasks. The hope is that future work can build on this to create even more efficient models.
So in essence, the abstract is saying, "We've developed a new model that combines the best parts of two existing types of models. Our new model can handle large amounts of data and perform as well as the best current models, while using less computational resources. This is a big step forward for the field."
@arzigogolato1 Год назад
Why, why didn't they think of a better name? RWKV is really bad marketing...
@Sciencehub-oq5go Год назад
The paper isn't very well written, and too short / confusing in parts.
What do they mean exactly with "channels"? The components of embedding vectors?
@novelspace Год назад
Galaxy 🧠 stuff
@qwerty123443wifi Год назад ⁺²
Does being an author on a ML paper mean anything anymore? There are so many authors on some of these papers that it seems a bit ridiculous
@wujacob4642 Год назад ⁺²
That's because the paper is written in an open-sourced way. The main author Bo Peng told that in his blog
@xxdaggerxx5 Год назад
i cant watch 1hr video man, summarize this shit
@girrajjangid4681 Год назад
Which mic you are using for video. It amazing. @YannicKilcher
@klammer75 Год назад
Amazing amazing amazing! I’ve been delving sooo much into the code side of implementation I forgot how much I love the maths side of the architecture and this walkthrough so expertly done by Yannic has lit my maths brain on fire once again! I can’t thank you enough for that, was a thrilling explanation and you are by far my favorite technical AI explainer out there! You sir are an asset to humanity and I for one tip my hat to you! And to think that there’s billions if not trillions of these weights/equations/parameters or whatever you want to call them in these models which give rise to the results we see is truly mind boggling….I feel like I just took an address watching that🤪😂🥳🦾🤓🤫

Следующие

Автовоспроизведение

Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)