OpenAI o1's New Paradigm: Test-Time Compute Explained

bycloud

Просмотров 31 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 ноя 2024
Наука

Комментарии • 145

@bycloudAI Месяц назад ⁺²⁸
Let me know if you guys want a dive into the methodologies of TTC, there's a lot of new papers/implementations coming out every day lol (entropix is a cool one)
Check out NVIDIA's suite of Training and Certification here:
[NVIDIA Certification] nvda.ws/3XxkFyj
[AI Learning Essential] nvda.ws/4gvD474
[Gen AI/LLM Learning Path] nvda.ws/4enwYE7
You can use the code “BYCLOUD” at checkout for 10% off!
@Pastellsdj Месяц назад
Entropix video would be appreciated. Keep up the great work!
@broformation6530 Месяц назад
Please cover entropix
@juliandarley Месяц назад
thanks for the video. i would like to see more on methodologies of TTC.
@sirius-harry Месяц назад ⁺¹
Yes, please.
@lbgstzockt8493 Месяц назад ⁺⁶⁹
OpenAI went from extremely secretive closed-source for profit to even more secretive closed-source for profit. Truly revolutionary change.
@Guedez1 Месяц назад ⁺⁹⁰
One of the chain of thoughts felt like doing an A* search on all possible answers
@TheRyulord Месяц назад ⁺⁶
you might be thinking of "tree of thought"
@polyhistorphilomath Месяц назад ⁺³
@@TheRyulord In soviet Russia, tree of thought thinks of you. It's quite considerate.
@herp_derpingson Месяц назад
Add a replace the heuristic function with value function from reinforcement learning and you get Q*
@rawallon Месяц назад ⁺⁹⁵
Your channel is like twitter but only the good part, I love it
@sajeucettefoistunevaspasme Месяц назад ⁺¹
so a compilation of sources from other websites ?
@ten_cents Месяц назад ⁺²
he compiles signal from twitter
@gerardo.arroyo.s Месяц назад ⁺²
there is no such thing as "the good part of twitter"
@XetXetable Месяц назад ⁺⁴⁷
I don't understand why you're so insistent that using RL to learn reasoning can't cause new knowledge to be gained. You're implicitly assuming that if the model knows A and that A implies B then the model must already know B. But that's not true. The model knows the rules of chess, and these rules imply whatever the optimal strategy is, but it definitely doesn't know this optimal strategy. It may come to learn it (or of approximations of it) through RL, though, as alpha zero and similar did.
@tukib_ Месяц назад ⁺⁵
Yeah, and deductive reasoning isn't the only form of reasoning. If anything, abductive and inductive reasoning are used a lot more than deduction in human cognition. So even without CoT, search methods are incredibly useful here and were key ideas of Cyc and Watson.
@WoolyCow Месяц назад ⁺⁵
will be interesting to see how this plays out, im still split about this tbh...
i think the difference between chess and reasoning is the 'why'. in chess, there is no 'correct theory', over time the model will either get better, or it wont. It doesnt matter how the bot ranks the importance of pieces or aims to control the board, we just care about the result, winning. when we evaluate a model based on the outcome, it may very well 'reason' but it does it in such a weird and wonderful way that we just cant relate to.
but things break down when instead of evaluating a model based on the outcome, we evaluate it based on the process. in this case, the process is the steps of reasoning that are taken to get from a to b. the very reason why nns are so powerful, the fact that they 'think' in completely different ways to us, is exactly what makes it difficult for them to conform to a very specific set of human prescribed ways of thinking. it forces a narrow range of 'correct' ways to think onto a bot that prefers to find its own way optimal way. it cant learn its own reasoning, because our evaluation will penalise it every time it tries to be creative.
so, this leaves two possibilities. it either:
1. learns to conform to our definition of reason
2. it can't, and just does its own thing
i think the problem is as follows (take this with a grain of salt im not an expert):
when the models are trained, they mostly learn how to learn however they want, there isnt a prescribed way of thinking forced onto them. this results in them thinking in weird and wonderful ways that likely have no congruence with what we consider 'correct logic'.
so, come time to finetune reasoning into them, or get them to start doing CoT, they may have learnt how to imitate correct reasoning steps, but deep down they are still doing what they always did, the weird and wonderful way they always did.
this training paradigm will be unlikely to truly embed the 'correct reasoning process' into models, as by their nature, they create their own way to reason. either we need more synthetic data to encourage correct reasoning in all training data, or a new hybrid approach is needed that blends the best of everything we've got, and can accurately instruct the model to make correct logical assumptions
@Terenfear Месяц назад ⁺²⁴
Glad to see the original editing approach back.
@fnytnqsladcgqlefzcqxlzlcgj9220 Месяц назад ⁺³
Yeah this is like 2x slower I can actually watch it, his videos were getting faster and faster to the point where it was just dopamine noise
@shApYT Месяц назад ⁺¹⁵
RLHF or in other words LGTM ship it to prod.
@BloomDevelop Месяц назад ⁺¹⁰
Fun fact: I have spent 3-4 days trying to fix a single SQLite bug while I was debugging with AI
@arcturuslight_ Месяц назад ⁺⁴
cute pfp. very pettable boyo
@3letterword Месяц назад
I agree with arc
@kowaihana Месяц назад
that's why you must learn to read errors
@oguzhan.yilmaz Месяц назад ⁺¹
@@kowaihana Or know how to program in sql
@kowaihana Месяц назад
@@oguzhan.yilmaz i know some people who know how to code but not know how to read basic errors
@GIRcode Месяц назад ⁺⁸
kinda reminds me of how chess bots like stockfish are able to view multiple potential outcomes to find the best move possible
@AidanNaut0 Месяц назад ⁺¹⁶
so basically they found out that giving the layman a bit more time to solve an easier problem can be more cost effective thst giving the smart guy a menial task, and it is also worth giving the smart guy more time to train to more effectively solve harder problems...
havent we already known this for hundreds if not thousands of years?
@leoym1803 Месяц назад
You're right, we have. That's why you're out there training and becoming the best you could ever be, instead of writing things we already know for thousands of years, right?
@AidanNaut0 Месяц назад
@@leoym1803 just reiterating to learn. as the proomters say, "read it again"
@beautifulcursecalledmusic 2 часа назад
a) Subscribed after 1 minute;
b) I really like this almost perfect amount of quick things on the screen that I can actually understand and have (so little) time to get! Wow;
c) The jokes are good. It gets smiled me at least 5+ times.
@John_YT Месяц назад ⁺⁹
"Bart say the line!"
*Sigh* "The bitter lesson strikes again"
@4.0.4 Месяц назад ⁺⁴
I just hope this kick starts inference backends like ollama, kobold, ooba, tabby or any other into having native support for any test-time compute approaches. Would be nice to query some fast small model like a 12B Mistral and get it to take longer but think through a better answer.
@Originalimoc Месяц назад ⁺³
Okay this explains why higher temp and top_p give better results sometime😮
@cdkw2 Месяц назад ⁺³⁸
9:53 rare anger bycloud moment
@literailly Месяц назад
😂
@cdkw2 Месяц назад
@@literailly its fun lmao
@H0mework Месяц назад ⁺³
Thanks! Very interesting about eng not improving.
@PieroSavastano Месяц назад ⁺²
Totally agree, mid. Deep mind already did the most on this
@Words-. Месяц назад
Really nice stuff, the most informative take I've seen so far on the o1 models, thank you!
@acters124 Месяц назад ⁺²
Also what is interesting about silly things like counting the amount of r in strawberry, it can easily be done if you instead start the AI with something more solid to work with, such as telling it to use the code interpreter/generation capabilities. which means 4o right now can technically count r better than o1 because it can run simple python code. This is the difference between running a nondeterministic model vs asking it to instead leverage a tool specifically made to be completely deterministic. 4o being able to use something like code generation and interpreter is more massive use than o1 can do with its limited capabilities. instead, openai will need to implement tools for o1 to interact with that can give more solid deterministic outcomes. so that when o1 does the chain of thought, it can simply think, hey I am unsure let me query a tool that can output something reliably or touch on a verifiable database of information.
@szebike 26 дней назад
From what I see they do the same with answers what a LLM does with tokens in o1 , so predict the most likely answer in the chain of thought like they were tokens in a sentence. The hallucination problem still is prevalent and essentially we could have new hallucination types involved eventually (the whack a mole situation is still strong) . Also this makes the whole cost per answer increase way more so essentially you run multiple GPTs at once (hence the highe rprice). As you excellently showed it has some crucial flaws like small "copying errors" in those chains [strawberrrry]. So if we are somewhat sceptical one could say its futile because the Apple paper is correct and LLMs can't reason and by breaking down complex tasks into subtasks only the probability is higher to have a pattern match in that smaller context and there is still no proper reasoning going on ( based on logic rather on raw pretrained statistics - the performance drop even in o1 when switching some small parameters in the benchmark questions hints to that). My problem especially with Openai is the insane (ungrounded) valuation which creates insane pressure to perform and thus not only destroys their working culture but also the honesty about what works and what not. if you have the incentive to always announce "GPT6AGI next month confirmed" and stir up artificial hype to get more cash than your burnrate you will stifle any scientific progress. I think thats why Claude has more progress because they have more "ease of mind" while developing in their team. In my opinion Meta, Openai and Google aswell as Anthropic etc. would fare way better if they would work on one big closed model intended for scientific progress and not a product while giving the comunity the chance to improve upon it since its a global effort towards safe AI [ yep too late for that one ]. As fun these local models are the only true use I see at the moment is manipulative AI slop everyhwere (either to manipulate opinions or trying to gain a quick and dirty buck or two for low effort). The only benefit I see is that it raised the AI research field into the public awareness but the overheated bubble behavior will do some harm we will see whats left once the dust settled.
@johnmcclane4430 Месяц назад ⁺⁸
Do the studies that compare 01 vs gpt4 utilize a chain of thought prompt for the latter because if not the discrepancy in performance seems arbitrary.
@avraham4497 Месяц назад
They didn’t and they shouldn’t have
@johnmcclane4430 Месяц назад ⁺¹
@@avraham4497 You'll have to explain why. Having COT baked in from training doesn't tell anyone if the model is strictly better at reasoning than another model given a COT prompt.
@avraham4497 Месяц назад
@@johnmcclane4430 When you test the reasoning of humans with exams, do you try to prompt engineer your question to maximize the performance of the ones being tested? Or do simply write the question in a clear as possible manner? The answer is the latter option, and the same should be true for testing AI systems.
Your second sentence is true about the underlying LLMs, but not about the models as wholes; if you add COT to a model, it becomes a different model, and it shouldn’t be looked at as the same model. You are telling me that an AI scientist from 10 years ago couldn’t compare the reasoning abilities of gpt 4o and open AI o1 if they were given to him as black boxes without any explanation as to how they work?
@johnmcclane4430 Месяц назад
@@avraham4497 Humans are the ones with the reasoning skills trained in you dunderhead. As for your scientist question, I sure hope that anyone who does actual research quickly realises that they don't need to confine their AI to a singular line of questioning. Seriously, did you think any part of this through before you made your comment?
@avraham4497 Месяц назад ⁺¹
@@johnmcclane4430 Your response makes no sense to me
@sgttomas Месяц назад
love these paper summaries. thank you 🙏 🎊
@amantayal1897 Месяц назад ⁺¹
now i think performance increase of o1 models are only because of new knowledge added during this CoT based RL training. Also training data will be mostly comprised of maths and coding problems as it's much easier to create CoT based examples for them which reflects thr performance increase only in these categories.
@DoktorUde Месяц назад ⁺³
The Google paper on test-time compute evaluated how Process Reward Model or Self-revision performed as it scaled. Given OpenAI's approach of training on millions of synthetic reasoning chains, you can't simply use this paper to claim it doesn't scale as OpenAI described, since it involves a very different approach in what the model does post-training. At least as far as I understand.
@ThePowerLover Месяц назад
This.
@acters124 Месяц назад ⁺¹
I most definitely got o1 talking to itself for MORE than 60 seconds, but does seem to hit 59 seconds most of the time when given complex or longer tasks.
@shodanxx Месяц назад
I really resonate with you as a human during your mini meltdown at 9:00
@TawnyE 26 дней назад
squad mentioned!!!!!!!
The game squad
@randomlettersqzkebkw Месяц назад ⁺⁶
wait... they got the weights?
this was a joke right?
@bycloudAI Месяц назад ⁺⁹
ya they sent me on discord
@nehemiasvasquez8536 Месяц назад
@@bycloudAISeriously?... They shared the weights?...
@DefaultFlame Месяц назад
@@bycloudAI . . . are they sharing with anyone else? Like, somewhere anyone can access them?
@voxelsilk8462 Месяц назад ⁺⁶
@@DefaultFlame Yes, the weights are available to Talk Tuah Plus subscribers and above.
@DefaultFlame Месяц назад
@@voxelsilk8462 The hell is "Talk Tuah"?
@KeedsLinguistics Месяц назад
Great video, keep up the good work.
@poipoi300 Месяц назад
Thanks for this. TTC has the potential to be great, not like this though.
Internally, the model needs to be able to execute loops to refine and transform information until it has determined it has solved the question. Using generated tokens to sort-of accomplish that is a lot of unnecessary work. It requires all thoughts to be translated from language to thoughts to language to thoughts over and over again. If we want reasoning, we will need models that can memorize DURING inference and modify that memory until a mechanism signals that the memory is in a suitable state to answer the question. Perhaps this functionality can be trained independently before being grafted on.
@gemstone7818 Месяц назад
This was a very interesting analysis of the o1 model, on par with ai explained
@malchemis Месяц назад ⁺⁷
the strawberry test is hard because the word gets encoded to : [302, 1618, 19772] == [st, raw, berry] or something similar. The model doesn't reason in letters but in tokens which removes some of the information necessary to, for example, count the number of letters.
@mirek190 Месяц назад
People also do not thinking in letters .
Latest research proved that human neuron is storing the whole word like llm does.
@ZintomV1 Месяц назад ⁺²
@@mirek190The difference is, humans can then spell out the word letter by letter which a model will not internally do, unless you use a CoT
@mirek190 Месяц назад
@@ZintomV1 Using COT shows only llm can do that bur are lazy ;) .
Also most words llm can spell easily "strawberry" is one of the exceptions.
Most llms currently using as thinking of stage 1 easy thinking without looping problems which is thinking stage 2.
Next generation llms probably will be learn this way from the background so even word like "strawberry" will be easy as llm use stage 2 thinking.
@ChristophBackhaus Месяц назад ⁺⁹
Again... What is up with the bad prompting.
o1 mini getting 100% of the 20x20 Table simply by using:
# Goal
Create a comprehensive multiplication table for numbers 1 through 20, including detailed calculations for each product.
# Method
1. For each pair of numbers from 1x1 to 20x20:
a. Calculate the product
b. For products involving numbers 11-20, show the step-by-step calculation
2. Format the calculations as follows:
- For simple products (1-10):
5x7 = 35
- For products involving 11-20:
13x20 = 10x20 + 3x20
10x20 = 200
3x20 = 60
200 + 60 = 260
3. Create a grid displaying all products from 1x1 to 20x20
# Additional Notes
- Be thorough and show all steps for products involving numbers 11-20
- Ensure accuracy in all calculations
- Present the information in a clear, organized manner
- The grid should have both horizontal and vertical headers from 1 to 20
# Example Output Snippet
1x1 = 1
1x2 = 2
...
13x20 = 10x20 + 3x20
10x20 = 200
3x20 = 60
200 + 60 = 260
...
20x20 = 400
[Include the grid here]
By following this method, you'll create a detailed and educational multiplication table that shows not just the results, but also the process of calculating more complex products.
Took me not even 5 minutes to come up with this prompt and no reprompting or anything needed. This works zeroshot
@float32 Месяц назад ⁺²
I think 5 minutes of human time is more expensive than 5 minutes of computer.
@BHBalast Месяц назад
Maybe I don't get the assigment, but I tested it on llama 8B with generic system assistant prompt and "# Goal
Create a comprehensive multiplication table for numbers 1 through 20, including detailed calculations for each product." user prompt and it did just fine.
@crimsonkim6824 Месяц назад ⁺⁴
the problem of further prompt engineering is that it is not generalizable. you don't want to be thinking about additional prompts for each and every new unique problems/tasks you want to solve.
@rolandgao5894 Месяц назад ⁺⁷
They are talking about 20 digits multiplied by 20 digits. You are talking about 2 digits by 2 digits, which the graph in the video shows the model can do.
@janek4913 Месяц назад
Dude. Not 20x20. 20 DIGITS. As in 95800904631026778660 x 25684705875830852248
All equations till 20x20 are guaranteed to be in the training dataset somewhere anyway..
@dakidokino Месяц назад
I've been waiting for OpenAI to have this since it was introduced by babyAGI to the public on twitter. Twitter is always lightyears ahead with the new beauties of AI. At first, I thought when they said their business has proprietary methods, I thought it already had chain of thought. I guess I was wrong. GPT might go to the next level with this, since it has potential BEFORE this was in their pipeline! I used to prompt the AI to correct, review, etc, to itself since I naturally assume it could be making mistakes if it does too many over coding or advanced topics.
@duduzilezulu5494 Месяц назад ⁺¹
10:00 went from ClosedAI to SigmaAI.
@tvwithtiffani Месяц назад
So in my current system that uses LLM. After watching this video, I added a setTimeout that changes a bool to true after 8 seconds, and a while loop that runs inference over and over for a "thought" given the current environment state while the bool is false. so it's thinking for about 8 seconds and spits out about 4 'thoughts' in that time. After stuffing my speaker agent's context with those thoughts generated in 8 seconds it really does improve the quality of the final output. I'm just curious, did anyone catch how they calculate how long to "think" for?
@chapol8573 25 дней назад
Where can I find these papers?!
@KevinKreger Месяц назад ⁺³
The reasoning trace is not aligned, so they don't want anyone to see it for that reason. I don't think there is a secret sauce. Antrhopic beat them to the punch on the inference reasoning trace. This is catch-up to prior art.
@islandfireballkill Месяц назад
OpenAI explicitly states that they hide it for a competitive advantage. This was also what they said when they released their model card on GPT-4, which didn't include any training or architecture details, on why they aren't revealing their processes.
@BattousaiHBr Месяц назад
a bit ironic the choice of the "for profit" person as elon musk, considering when openAI was funded he helped choose the non-profit format and left it due to disagreements about for-profit deals with microsoft.
@scoffpickle9655 Месяц назад
Why not make a MBRL chatbot? (Do a MCTS of it's tokens). I know it's unrelated but still food 4 thought
@abod_bios2 Месяц назад
0:32
they name it strawberry because of a glitch in ai chat if you ask him how many (R) in strawberry and he will response 2 (r)
@hanif72muhammad Месяц назад
I just realize, some people that have high level knowledge, tend to be overconfident with their answer. Seems it shows the same as well in these large parameters LLMs
@hanif72muhammad Месяц назад
of course! yapping straight forward is never the best way to talk, you have to think first, choose your words carefully, why are those people realize that soo late?
@rign_ Месяц назад
I wouldn't trust ChatGPT or any other LLM model to write creative writing, most of the "tone" ChatGPT writes is the same across the genres I ask for.
@gerardo.arroyo.s Месяц назад ⁺¹
Tbh, LLMs are not capable of writing anything literary... yet. Soon they will, and humanity will be doomed
@sFeral Месяц назад
Croatian politicians 0:07 memefied
@michmach74 Месяц назад
Video on entropix when
@BloomDevelop Месяц назад
4:36 ah yes, how you are supposed to do that
@seraphin01 Месяц назад ⁺²
great cover. To be honest that strawberry was just a big disappointment to me.. after all the drama at openAI I really expected a game changer.. the opposite of what we got.
I'm sure all the others are already implementing a version of it as we speak as it doesn't seem like much of a MOAT (or they wouldn't try so hard to hide the reasoning from us)..
Also LLMs will NEVER create a freaking cancer drug or something.. god I'm exhausted about those claims, LLMs are dumb AF when it comes to creating new things, by definition.
So unless openAI are actually working on some form of AGI that is capable of learning and experimenting by itself like alpha fold, LLM will just be some kinda useful assistant with major flaws
@darthvader4899 Месяц назад
ruclips.net/video/pi7LF-OpO6k/видео.html
what's the music that starts here?
@yash1152 Месяц назад
4:25 no way i am takong certificate from a co whose chef said no cder needed & ttteeet signde
@Originalimoc Месяц назад ⁺²
14:10 😂 exp on x axis... That's log/diminishing on y axis 😂😂😂
@l.halawani Месяц назад
It's doing top-k to generate sentences and then top-p on sentences? xd
@jamesgphillips91 Месяц назад ⁺¹
i am an unemployed dev who switched industries... i can build a cot system w/ langchain... I dont know why any serious software org would need this as a SaSS product when open source has had this for literally 4 years
@MrFlexNC Месяц назад
Lets see if you are right
@bbok1616 Месяц назад
Monte Carlos tree search
@FryGuy1013 Месяц назад
Monte Carlos tree search lol
@davidl.e5203 Месяц назад
So its giving LLM anxiety and overthinking
@ten_cents Месяц назад
I really do wish oai shills would defluff their hype a bit. Makes it so apparently disingenuous
@XeTute Месяц назад
First Comment =)
@Napert Месяц назад
Why not hire experts and fact-checkers in respective fields to build a fully human-generated dataset and use that to train the model, while applying massive penalties for whatever the model gets wrong/makes up?
@PraveenKumar-bo7fw Месяц назад
Monte carlos lmao
@KeedsLinguistics Месяц назад
It’s funny how so many focus on grammar drills but forget that real fluency comes from actually using the language daily. Totally changed the game for me.
@fergalhennessy775 Месяц назад
6666 view 6 hours ago...........
@KuZiMeiChuan Месяц назад
Backfire 重音應該放在第一個音節
是BACKfire不是backFIRE 😊
@arandomfox999 Месяц назад
Lol, AI is now suffering from Dunning Kruger effect as their model grows their ego grows.
This is now my head canon interpretation of what's happening.
@gerardo.arroyo.s Месяц назад ⁺¹
AI is not sentient, so it doesn't have 'ego'... yet
Now, the owners of the AI, that's a different story
@arandomfox999 Месяц назад
@@gerardo.arroyo.s my man. That was a joke. That's why it's my head canon, my own story I like enough to pretend it's what's happened.
Graces. People these days can't even disseminate humour from serious statements.
@gerardo.arroyo.s Месяц назад ⁺¹
@@arandomfox999 it was so unfunnny though
@panzerofthelake4460 Месяц назад
this is indeed glorpshit
@SearchingForSounds Месяц назад
Entropix for small models looks promising
@kacperogorek3958 Месяц назад
I feel like you try to prove a point that is off from the start. TTC was never intended to be a substitute for models' scale - it was to unlock a new level of quality not feasible without it, on top of a proper parameter scaling. Both DeepMind and OpenAI show similar results of generation scores being log-proportional to a TTC's budget - at least in some section of the scale. So, it's not that OpenAI made this graph up, different teams back it up. I find your statement here a bit biased and misleading :(
@Sekhmmett Месяц назад
Fartface
@jonclement Месяц назад ⁺¹
Maybe their CoT method is that Q* leak? -- Great videos. My gf thinks i'm just watching that Daily Dose of Internet guy but for tech
@yolocrayolo1134 Месяц назад
Seems fake and gaeh but i felt like the way forward is to tokens to yap for a whole minute before giving you an answer.
I think we should have made specific modifiers or addons like loras but for chat gpt instead of of forcing chat gpt to be an all in one solution.
@tukib_ Месяц назад ⁺¹
One of the reasons GPT and other LLMs are so generalisable is precisely because of their large training corpora. Besides, OpenAI's API does allow you to run fine-tuning, presumably using some adapter because its inexpensive. And we know Azure OpenAI fine-tuning in particular uses LoRA.

Следующие

Автовоспроизведение

Why Does Diffusion Work Better than Auto-Regression?