Thanks for bringing this format back. I much prefer listening to Yannic explain-rambling a paper than him preparing me to an interview with authors. Please keep being critical/opinionated, even if in the back of your mind you know you'll "face" the authors later on.
It reminds me of when we discussed the perceiver paper which could handle multiple modalities (including RL) but they separated them as different tasks, maybe the approach we should follow is to throw at transformers literally every problem we can think of and let them learn information across multiple domains
I really like having the medium or long paper overview followed by an interview with the author. Whether it's split into two videos or not doesn't matter too much to me--might as well do whichever is better for making the algorithm happy. ;)
Since these reviews are so useful, I'd suggest to release them as soon as they're ready and not to wait for authors to watch it first and arranging an interview with them.
The decay lambda 1 seems reasonable as the goal is to create linear projections of state,reward, and action and match it to the embedding space of the transformers. so, decay is used here to make sure that the update to input projections doesn't happen for full training loop, making sure that similar input projections has similar input embeddings.
The benefit appears to reside in the initialization values. Pretraining creates a smoother manifold which can be then be morphed smoothly to adapt it to a new task.
Since they tried it on CLIP and that didn't work so well, I'd love to see how CM3 would do in this regard, combining structured seq2seq website language-and-image modelling.
I don't think this is correct. EMNLP 2022 didn't happen yet (due in December 2022) and in 2021 winner papers are here: 2021.emnlp.org/blog/2021-10-29-best-paper-awards Machel Reid and Yutaro Yamada have a nontrivial number of papers from 2020 and are mature enough to not require "age" as a differentiating factor.
This is interesting. According to Figure 2. Attention analysis, the action basically only attended to previous states, then what if we just throw away all previous actions and rewards, just keep the previous states? 😉 23:19
yes, which is why it is a bit surprising that they found it didn't work when they froze the transformer, and also that there seems to be limited transfer from iGPT
Is it possible somehow the Wikipedia anchors or page tags help supervise the language model and vice versa? The formulaic and encoded approach to wiki publication maybe leading the model in some latent ways?
Yeah this is one of the reasons for the Hutter Prize specifying Wikipedia clear back in 2006. We have been arguing ever since then with people who don't believe that language models are all that relevant to modeling the physical world. But this is a consequence of using algorithmic information (Kolmogorov Complexity) approximation for unsupervised model selection. If any of the big boys were serious they would back the Hutter Prize with orders of magnitude more money.
The question that comes to my mind is what if you pretrain with one of the language models which are primarily trained for filling in words rather than the autoregressive thing? Like BERT I think? Assuming that would even make any sense, which I don’t know. Like, can you try to use BERT autoregressively even though it wasn’t trained for it and get something which isn’t completely garbage? Like, if you just mask out all the future tokens even though BERT expects to only have only one or a few tokens masked out? What I’m saying here might be confused.
It's not stupid. Many people have looked into using BERT for text generation, either just decoding autoregressively, or actually training it like that, but results are not very good.
If their goal is to assess whether language model pretraining is better than image pretraining, then they should be using the same architecture for both! Comparing GPT2 to IGPT is useless. The idea is cool, but this paper is a letdown. These types of papers require so much compute that only places like Google can foot the bill. You'd think they'd spend a little more time ironing out their argument before then crank up all those TPUs...
Criticisms are way easier to come by, but absolutely unreliable results. These guys just thought "eh what do we have to lose" and just made a paper out of it. It's like training a audio classification 2D-CNNs pretrained imagenet CNNs: it works, but very unreliable. Hard pass on this paper.
OUTLINE:
0:00 - Intro
1:35 - Paper Overview
7:35 - Offline Reinforcement Learning as Sequence Modelling
12:00 - Input Embedding Alignment & other additions
16:50 - Main experimental results
20:45 - Analysis of the attention patterns across models
32:25 - More experimental results (scaling properties, ablations, etc.)
37:30 - Final thoughts
Paper: arxiv.org/abs/2201.12122
Code: github.com/machelreid/can-wikipedia-help-offline-rl
My Video on Decision Transformer: ruclips.net/video/-buULmf7dec/видео.html
Thanks for bringing this format back. I much prefer listening to Yannic explain-rambling a paper than him preparing me to an interview with authors. Please keep being critical/opinionated, even if in the back of your mind you know you'll "face" the authors later on.
Really like this format of interviewing the creators of the paper after review.
It's almost like the peer review process, but better
It reminds me of when we discussed the perceiver paper which could handle multiple modalities (including RL) but they separated them as different tasks, maybe the approach we should follow is to throw at transformers literally every problem we can think of and let them learn information across multiple domains
Good idea, need a lab to do that.
I really like having the medium or long paper overview followed by an interview with the author. Whether it's split into two videos or not doesn't matter too much to me--might as well do whichever is better for making the algorithm happy. ;)
I love that with this new format, you can explore alot of fringe papers, test the boundaries more but in a constructive way!
Yannic knows we love to binge his videos over the weekend!
Perfect format, one of a kind on the internet at large. 👌
Since these reviews are so useful, I'd suggest to release them as soon as they're ready and not to wait for authors to watch it first and arranging an interview with them.
The decay lambda 1 seems reasonable as the goal is to create linear projections of state,reward, and action and match it to the embedding space of the transformers. so, decay is used here to make sure that the update to input projections doesn't happen for full training loop, making sure that similar input projections has similar input embeddings.
The benefit appears to reside in the initialization values. Pretraining creates a smoother manifold which can be then be morphed smoothly to adapt it to a new task.
But then why would Image-GPT and CLIP perform worse than GPT?
Shorter videos without authors -- 😍
Yellow pre-comments and green comments -- 🔥
Thanks!
Since they tried it on CLIP and that didn't work so well, I'd love to see how CM3 would do in this regard, combining structured seq2seq website language-and-image modelling.
Off the bat, this idea sounds fun!
This works because language is a sequence model. Use a video transformer, it will work well
Okay, I don't know if you guys know but the main author is only 17 years old and this paper won a best paper award at EMNLP.
I don't think this is correct. EMNLP 2022 didn't happen yet (due in December 2022) and in 2021 winner papers are here: 2021.emnlp.org/blog/2021-10-29-best-paper-awards
Machel Reid and Yutaro Yamada have a nontrivial number of papers from 2020 and are mature enough to not require "age" as a differentiating factor.
This is interesting. According to Figure 2. Attention analysis, the action basically only attended to previous states, then what if we just throw away all previous actions and rewards, just keep the previous states? 😉 23:19
If I remember coorrectly, there is one paper suggesting frozen pre-trained transformer on text also works on image classification?
yes, which is why it is a bit surprising that they found it didn't work when they froze the transformer, and also that there seems to be limited transfer from iGPT
Wouldn't that mean that language is a function that models (approximates) reality itself?
Is it possible somehow the Wikipedia anchors or page tags help supervise the language model and vice versa? The formulaic and encoded approach to wiki publication maybe leading the model in some latent ways?
Yeah this is one of the reasons for the Hutter Prize specifying Wikipedia clear back in 2006. We have been arguing ever since then with people who don't believe that language models are all that relevant to modeling the physical world. But this is a consequence of using algorithmic information (Kolmogorov Complexity) approximation for unsupervised model selection. If any of the big boys were serious they would back the Hutter Prize with orders of magnitude more money.
I grew up with Wikipedia. So I am interested with what the paper will show.
Can you machine intuition your way to selecting a model from a set based on the task, to converge cheaply to a solution?
The question that comes to my mind is what if you pretrain with one of the language models which are primarily trained for filling in words rather than the autoregressive thing? Like BERT I think?
Assuming that would even make any sense, which I don’t know.
Like, can you try to use BERT autoregressively even though it wasn’t trained for it and get something which isn’t completely garbage?
Like, if you just mask out all the future tokens even though BERT expects to only have only one or a few tokens masked out?
What I’m saying here might be confused.
It's not stupid. Many people have looked into using BERT for text generation, either just decoding autoregressively, or actually training it like that, but results are not very good.
@@YannicKilcher Thanks!
If their goal is to assess whether language model pretraining is better than image pretraining, then they should be using the same architecture for both! Comparing GPT2 to IGPT is useless. The idea is cool, but this paper is a letdown.
These types of papers require so much compute that only places like Google can foot the bill. You'd think they'd spend a little more time ironing out their argument before then crank up all those TPUs...
Ok now I can imagine it more that the transformer can treat language similarly too playing games
AI gets plus 50 buff 💪 points because retraining with Wikipedia was very effective 🥇
This sounds like one of those ideas born drunk in a pub.
Criticisms are way easier to come by, but absolutely unreliable results. These guys just thought "eh what do we have to lose" and just made a paper out of it. It's like training a audio classification 2D-CNNs pretrained imagenet CNNs: it works, but very unreliable. Hard pass on this paper.
#Blackhistorymonth
The title is phishing. I imagined ML learns world model from Wikipedia?
Title should be "Can LM pretrain Help Offline Reinforcement Learning?"