Let me know if you guys want a dive into the methodologies of TTC, there's a lot of new papers/implementations coming out every day lol (entropix is a cool one) Check out NVIDIA's suite of Training and Certification here: [NVIDIA Certification] nvda.ws/3XxkFyj [AI Learning Essential] nvda.ws/4gvD474 [Gen AI/LLM Learning Path] nvda.ws/4enwYE7 You can use the code “BYCLOUD” at checkout for 10% off!
I don't understand why you're so insistent that using RL to learn reasoning can't cause new knowledge to be gained. You're implicitly assuming that if the model knows A and that A implies B then the model must already know B. But that's not true. The model knows the rules of chess, and these rules imply whatever the optimal strategy is, but it definitely doesn't know this optimal strategy. It may come to learn it (or of approximations of it) through RL, though, as alpha zero and similar did.
Yeah, and deductive reasoning isn't the only form of reasoning. If anything, abductive and inductive reasoning are used a lot more than deduction in human cognition. So even without CoT, search methods are incredibly useful here and were key ideas of Cyc and Watson.
will be interesting to see how this plays out, im still split about this tbh... i think the difference between chess and reasoning is the 'why'. in chess, there is no 'correct theory', over time the model will either get better, or it wont. It doesnt matter how the bot ranks the importance of pieces or aims to control the board, we just care about the result, winning. when we evaluate a model based on the outcome, it may very well 'reason' but it does it in such a weird and wonderful way that we just cant relate to. but things break down when instead of evaluating a model based on the outcome, we evaluate it based on the process. in this case, the process is the steps of reasoning that are taken to get from a to b. the very reason why nns are so powerful, the fact that they 'think' in completely different ways to us, is exactly what makes it difficult for them to conform to a very specific set of human prescribed ways of thinking. it forces a narrow range of 'correct' ways to think onto a bot that prefers to find its own way optimal way. it cant learn its own reasoning, because our evaluation will penalise it every time it tries to be creative. so, this leaves two possibilities. it either: 1. learns to conform to our definition of reason 2. it can't, and just does its own thing i think the problem is as follows (take this with a grain of salt im not an expert): when the models are trained, they mostly learn how to learn however they want, there isnt a prescribed way of thinking forced onto them. this results in them thinking in weird and wonderful ways that likely have no congruence with what we consider 'correct logic'. so, come time to finetune reasoning into them, or get them to start doing CoT, they may have learnt how to imitate correct reasoning steps, but deep down they are still doing what they always did, the weird and wonderful way they always did. this training paradigm will be unlikely to truly embed the 'correct reasoning process' into models, as by their nature, they create their own way to reason. either we need more synthetic data to encourage correct reasoning in all training data, or a new hybrid approach is needed that blends the best of everything we've got, and can accurately instruct the model to make correct logical assumptions
so basically they found out that giving the layman a bit more time to solve an easier problem can be more cost effective thst giving the smart guy a menial task, and it is also worth giving the smart guy more time to train to more effectively solve harder problems... havent we already known this for hundreds if not thousands of years?
You're right, we have. That's why you're out there training and becoming the best you could ever be, instead of writing things we already know for thousands of years, right?
I just hope this kick starts inference backends like ollama, kobold, ooba, tabby or any other into having native support for any test-time compute approaches. Would be nice to query some fast small model like a 12B Mistral and get it to take longer but think through a better answer.
Also what is interesting about silly things like counting the amount of r in strawberry, it can easily be done if you instead start the AI with something more solid to work with, such as telling it to use the code interpreter/generation capabilities. which means 4o right now can technically count r better than o1 because it can run simple python code. This is the difference between running a nondeterministic model vs asking it to instead leverage a tool specifically made to be completely deterministic. 4o being able to use something like code generation and interpreter is more massive use than o1 can do with its limited capabilities. instead, openai will need to implement tools for o1 to interact with that can give more solid deterministic outcomes. so that when o1 does the chain of thought, it can simply think, hey I am unsure let me query a tool that can output something reliably or touch on a verifiable database of information.
@@avraham4497 You'll have to explain why. Having COT baked in from training doesn't tell anyone if the model is strictly better at reasoning than another model given a COT prompt.
@@johnmcclane4430 When you test the reasoning of humans with exams, do you try to prompt engineer your question to maximize the performance of the ones being tested? Or do simply write the question in a clear as possible manner? The answer is the latter option, and the same should be true for testing AI systems. Your second sentence is true about the underlying LLMs, but not about the models as wholes; if you add COT to a model, it becomes a different model, and it shouldn’t be looked at as the same model. You are telling me that an AI scientist from 10 years ago couldn’t compare the reasoning abilities of gpt 4o and open AI o1 if they were given to him as black boxes without any explanation as to how they work?
@@avraham4497 Humans are the ones with the reasoning skills trained in you dunderhead. As for your scientist question, I sure hope that anyone who does actual research quickly realises that they don't need to confine their AI to a singular line of questioning. Seriously, did you think any part of this through before you made your comment?
From what I see they do the same with answers what a LLM does with tokens in o1 , so predict the most likely answer in the chain of thought like they were tokens in a sentence. The hallucination problem still is prevalent and essentially we could have new hallucination types involved eventually (the whack a mole situation is still strong) . Also this makes the whole cost per answer increase way more so essentially you run multiple GPTs at once (hence the highe rprice). As you excellently showed it has some crucial flaws like small "copying errors" in those chains [strawberrrry]. So if we are somewhat sceptical one could say its futile because the Apple paper is correct and LLMs can't reason and by breaking down complex tasks into subtasks only the probability is higher to have a pattern match in that smaller context and there is still no proper reasoning going on ( based on logic rather on raw pretrained statistics - the performance drop even in o1 when switching some small parameters in the benchmark questions hints to that). My problem especially with Openai is the insane (ungrounded) valuation which creates insane pressure to perform and thus not only destroys their working culture but also the honesty about what works and what not. if you have the incentive to always announce "GPT6AGI next month confirmed" and stir up artificial hype to get more cash than your burnrate you will stifle any scientific progress. I think thats why Claude has more progress because they have more "ease of mind" while developing in their team. In my opinion Meta, Openai and Google aswell as Anthropic etc. would fare way better if they would work on one big closed model intended for scientific progress and not a product while giving the comunity the chance to improve upon it since its a global effort towards safe AI [ yep too late for that one ]. As fun these local models are the only true use I see at the moment is manipulative AI slop everyhwere (either to manipulate opinions or trying to gain a quick and dirty buck or two for low effort). The only benefit I see is that it raised the AI research field into the public awareness but the overheated bubble behavior will do some harm we will see whats left once the dust settled.
now i think performance increase of o1 models are only because of new knowledge added during this CoT based RL training. Also training data will be mostly comprised of maths and coding problems as it's much easier to create CoT based examples for them which reflects thr performance increase only in these categories.
I most definitely got o1 talking to itself for MORE than 60 seconds, but does seem to hit 59 seconds most of the time when given complex or longer tasks.
the strawberry test is hard because the word gets encoded to : [302, 1618, 19772] == [st, raw, berry] or something similar. The model doesn't reason in letters but in tokens which removes some of the information necessary to, for example, count the number of letters.
@@ZintomV1 Using COT shows only llm can do that bur are lazy ;) . Also most words llm can spell easily "strawberry" is one of the exceptions. Most llms currently using as thinking of stage 1 easy thinking without looping problems which is thinking stage 2. Next generation llms probably will be learn this way from the background so even word like "strawberry" will be easy as llm use stage 2 thinking.
The Google paper on test-time compute evaluated how Process Reward Model or Self-revision performed as it scaled. Given OpenAI's approach of training on millions of synthetic reasoning chains, you can't simply use this paper to claim it doesn't scale as OpenAI described, since it involves a very different approach in what the model does post-training. At least as far as I understand.
Again... What is up with the bad prompting. o1 mini getting 100% of the 20x20 Table simply by using: # Goal Create a comprehensive multiplication table for numbers 1 through 20, including detailed calculations for each product. # Method 1. For each pair of numbers from 1x1 to 20x20: a. Calculate the product b. For products involving numbers 11-20, show the step-by-step calculation 2. Format the calculations as follows: - For simple products (1-10): 5x7 = 35 - For products involving 11-20: 13x20 = 10x20 + 3x20 10x20 = 200 3x20 = 60 200 + 60 = 260 3. Create a grid displaying all products from 1x1 to 20x20 # Additional Notes - Be thorough and show all steps for products involving numbers 11-20 - Ensure accuracy in all calculations - Present the information in a clear, organized manner - The grid should have both horizontal and vertical headers from 1 to 20 # Example Output Snippet 1x1 = 1 1x2 = 2 ... 13x20 = 10x20 + 3x20 10x20 = 200 3x20 = 60 200 + 60 = 260 ... 20x20 = 400 [Include the grid here] By following this method, you'll create a detailed and educational multiplication table that shows not just the results, but also the process of calculating more complex products. Took me not even 5 minutes to come up with this prompt and no reprompting or anything needed. This works zeroshot
Maybe I don't get the assigment, but I tested it on llama 8B with generic system assistant prompt and "# Goal Create a comprehensive multiplication table for numbers 1 through 20, including detailed calculations for each product." user prompt and it did just fine.
the problem of further prompt engineering is that it is not generalizable. you don't want to be thinking about additional prompts for each and every new unique problems/tasks you want to solve.
They are talking about 20 digits multiplied by 20 digits. You are talking about 2 digits by 2 digits, which the graph in the video shows the model can do.
Dude. Not 20x20. 20 DIGITS. As in 95800904631026778660 x 25684705875830852248 All equations till 20x20 are guaranteed to be in the training dataset somewhere anyway..
Thanks for this. TTC has the potential to be great, not like this though. Internally, the model needs to be able to execute loops to refine and transform information until it has determined it has solved the question. Using generated tokens to sort-of accomplish that is a lot of unnecessary work. It requires all thoughts to be translated from language to thoughts to language to thoughts over and over again. If we want reasoning, we will need models that can memorize DURING inference and modify that memory until a mechanism signals that the memory is in a suitable state to answer the question. Perhaps this functionality can be trained independently before being grafted on.
So in my current system that uses LLM. After watching this video, I added a setTimeout that changes a bool to true after 8 seconds, and a while loop that runs inference over and over for a "thought" given the current environment state while the bool is false. so it's thinking for about 8 seconds and spits out about 4 'thoughts' in that time. After stuffing my speaker agent's context with those thoughts generated in 8 seconds it really does improve the quality of the final output. I'm just curious, did anyone catch how they calculate how long to "think" for?
I've been waiting for OpenAI to have this since it was introduced by babyAGI to the public on twitter. Twitter is always lightyears ahead with the new beauties of AI. At first, I thought when they said their business has proprietary methods, I thought it already had chain of thought. I guess I was wrong. GPT might go to the next level with this, since it has potential BEFORE this was in their pipeline! I used to prompt the AI to correct, review, etc, to itself since I naturally assume it could be making mistakes if it does too many over coding or advanced topics.
a bit ironic the choice of the "for profit" person as elon musk, considering when openAI was funded he helped choose the non-profit format and left it due to disagreements about for-profit deals with microsoft.
of course! yapping straight forward is never the best way to talk, you have to think first, choose your words carefully, why are those people realize that soo late?
I just realize, some people that have high level knowledge, tend to be overconfident with their answer. Seems it shows the same as well in these large parameters LLMs
The reasoning trace is not aligned, so they don't want anyone to see it for that reason. I don't think there is a secret sauce. Antrhopic beat them to the punch on the inference reasoning trace. This is catch-up to prior art.
OpenAI explicitly states that they hide it for a competitive advantage. This was also what they said when they released their model card on GPT-4, which didn't include any training or architecture details, on why they aren't revealing their processes.
great cover. To be honest that strawberry was just a big disappointment to me.. after all the drama at openAI I really expected a game changer.. the opposite of what we got. I'm sure all the others are already implementing a version of it as we speak as it doesn't seem like much of a MOAT (or they wouldn't try so hard to hide the reasoning from us).. Also LLMs will NEVER create a freaking cancer drug or something.. god I'm exhausted about those claims, LLMs are dumb AF when it comes to creating new things, by definition. So unless openAI are actually working on some form of AGI that is capable of learning and experimenting by itself like alpha fold, LLM will just be some kinda useful assistant with major flaws
i am an unemployed dev who switched industries... i can build a cot system w/ langchain... I dont know why any serious software org would need this as a SaSS product when open source has had this for literally 4 years
Why not hire experts and fact-checkers in respective fields to build a fully human-generated dataset and use that to train the model, while applying massive penalties for whatever the model gets wrong/makes up?
It’s funny how so many focus on grammar drills but forget that real fluency comes from actually using the language daily. Totally changed the game for me.
Lol, AI is now suffering from Dunning Kruger effect as their model grows their ego grows. This is now my head canon interpretation of what's happening.
@@gerardo.arroyo.s my man. That was a joke. That's why it's my head canon, my own story I like enough to pretend it's what's happened. Graces. People these days can't even disseminate humour from serious statements.
I feel like you try to prove a point that is off from the start. TTC was never intended to be a substitute for models' scale - it was to unlock a new level of quality not feasible without it, on top of a proper parameter scaling. Both DeepMind and OpenAI show similar results of generation scores being log-proportional to a TTC's budget - at least in some section of the scale. So, it's not that OpenAI made this graph up, different teams back it up. I find your statement here a bit biased and misleading :(
Seems fake and gaeh but i felt like the way forward is to tokens to yap for a whole minute before giving you an answer. I think we should have made specific modifiers or addons like loras but for chat gpt instead of of forcing chat gpt to be an all in one solution.
One of the reasons GPT and other LLMs are so generalisable is precisely because of their large training corpora. Besides, OpenAI's API does allow you to run fine-tuning, presumably using some adapter because its inexpensive. And we know Azure OpenAI fine-tuning in particular uses LoRA.
Let me know if you guys want a dive into the methodologies of TTC, there's a lot of new papers/implementations coming out every day lol (entropix is a cool one)
Check out NVIDIA's suite of Training and Certification here:
[NVIDIA Certification] nvda.ws/3XxkFyj
[AI Learning Essential] nvda.ws/4gvD474
[Gen AI/LLM Learning Path] nvda.ws/4enwYE7
You can use the code “BYCLOUD” at checkout for 10% off!
Entropix video would be appreciated. Keep up the great work!
Please cover entropix
thanks for the video. i would like to see more on methodologies of TTC.
Yes, please.
OpenAI went from extremely secretive closed-source for profit to even more secretive closed-source for profit. Truly revolutionary change.
Your channel is like twitter but only the good part, I love it
so a compilation of sources from other websites ?
he compiles signal from twitter
there is no such thing as "the good part of twitter"
One of the chain of thoughts felt like doing an A* search on all possible answers
you might be thinking of "tree of thought"
@@TheRyulord In soviet Russia, tree of thought thinks of you. It's quite considerate.
Add a replace the heuristic function with value function from reinforcement learning and you get Q*
I don't understand why you're so insistent that using RL to learn reasoning can't cause new knowledge to be gained. You're implicitly assuming that if the model knows A and that A implies B then the model must already know B. But that's not true. The model knows the rules of chess, and these rules imply whatever the optimal strategy is, but it definitely doesn't know this optimal strategy. It may come to learn it (or of approximations of it) through RL, though, as alpha zero and similar did.
Yeah, and deductive reasoning isn't the only form of reasoning. If anything, abductive and inductive reasoning are used a lot more than deduction in human cognition. So even without CoT, search methods are incredibly useful here and were key ideas of Cyc and Watson.
will be interesting to see how this plays out, im still split about this tbh...
i think the difference between chess and reasoning is the 'why'. in chess, there is no 'correct theory', over time the model will either get better, or it wont. It doesnt matter how the bot ranks the importance of pieces or aims to control the board, we just care about the result, winning. when we evaluate a model based on the outcome, it may very well 'reason' but it does it in such a weird and wonderful way that we just cant relate to.
but things break down when instead of evaluating a model based on the outcome, we evaluate it based on the process. in this case, the process is the steps of reasoning that are taken to get from a to b. the very reason why nns are so powerful, the fact that they 'think' in completely different ways to us, is exactly what makes it difficult for them to conform to a very specific set of human prescribed ways of thinking. it forces a narrow range of 'correct' ways to think onto a bot that prefers to find its own way optimal way. it cant learn its own reasoning, because our evaluation will penalise it every time it tries to be creative.
so, this leaves two possibilities. it either:
1. learns to conform to our definition of reason
2. it can't, and just does its own thing
i think the problem is as follows (take this with a grain of salt im not an expert):
when the models are trained, they mostly learn how to learn however they want, there isnt a prescribed way of thinking forced onto them. this results in them thinking in weird and wonderful ways that likely have no congruence with what we consider 'correct logic'.
so, come time to finetune reasoning into them, or get them to start doing CoT, they may have learnt how to imitate correct reasoning steps, but deep down they are still doing what they always did, the weird and wonderful way they always did.
this training paradigm will be unlikely to truly embed the 'correct reasoning process' into models, as by their nature, they create their own way to reason. either we need more synthetic data to encourage correct reasoning in all training data, or a new hybrid approach is needed that blends the best of everything we've got, and can accurately instruct the model to make correct logical assumptions
Glad to see the original editing approach back.
Yeah this is like 2x slower I can actually watch it, his videos were getting faster and faster to the point where it was just dopamine noise
9:53 rare anger bycloud moment
😂
@@literailly its fun lmao
Fun fact: I have spent 3-4 days trying to fix a single SQLite bug while I was debugging with AI
cute pfp. very pettable boyo
I agree with arc
that's why you must learn to read errors
@@kowaihana Or know how to program in sql
@@oguzhan.yilmaz i know some people who know how to code but not know how to read basic errors
so basically they found out that giving the layman a bit more time to solve an easier problem can be more cost effective thst giving the smart guy a menial task, and it is also worth giving the smart guy more time to train to more effectively solve harder problems...
havent we already known this for hundreds if not thousands of years?
You're right, we have. That's why you're out there training and becoming the best you could ever be, instead of writing things we already know for thousands of years, right?
@@leoym1803 just reiterating to learn. as the proomters say, "read it again"
RLHF or in other words LGTM ship it to prod.
kinda reminds me of how chess bots like stockfish are able to view multiple potential outcomes to find the best move possible
I just hope this kick starts inference backends like ollama, kobold, ooba, tabby or any other into having native support for any test-time compute approaches. Would be nice to query some fast small model like a 12B Mistral and get it to take longer but think through a better answer.
"Bart say the line!"
*Sigh* "The bitter lesson strikes again"
Thanks! Very interesting about eng not improving.
Totally agree, mid. Deep mind already did the most on this
Okay this explains why higher temp and top_p give better results sometime😮
Also what is interesting about silly things like counting the amount of r in strawberry, it can easily be done if you instead start the AI with something more solid to work with, such as telling it to use the code interpreter/generation capabilities. which means 4o right now can technically count r better than o1 because it can run simple python code. This is the difference between running a nondeterministic model vs asking it to instead leverage a tool specifically made to be completely deterministic. 4o being able to use something like code generation and interpreter is more massive use than o1 can do with its limited capabilities. instead, openai will need to implement tools for o1 to interact with that can give more solid deterministic outcomes. so that when o1 does the chain of thought, it can simply think, hey I am unsure let me query a tool that can output something reliably or touch on a verifiable database of information.
Really nice stuff, the most informative take I've seen so far on the o1 models, thank you!
Do the studies that compare 01 vs gpt4 utilize a chain of thought prompt for the latter because if not the discrepancy in performance seems arbitrary.
They didn’t and they shouldn’t have
@@avraham4497 You'll have to explain why. Having COT baked in from training doesn't tell anyone if the model is strictly better at reasoning than another model given a COT prompt.
@@johnmcclane4430 When you test the reasoning of humans with exams, do you try to prompt engineer your question to maximize the performance of the ones being tested? Or do simply write the question in a clear as possible manner? The answer is the latter option, and the same should be true for testing AI systems.
Your second sentence is true about the underlying LLMs, but not about the models as wholes; if you add COT to a model, it becomes a different model, and it shouldn’t be looked at as the same model. You are telling me that an AI scientist from 10 years ago couldn’t compare the reasoning abilities of gpt 4o and open AI o1 if they were given to him as black boxes without any explanation as to how they work?
@@avraham4497 Humans are the ones with the reasoning skills trained in you dunderhead. As for your scientist question, I sure hope that anyone who does actual research quickly realises that they don't need to confine their AI to a singular line of questioning. Seriously, did you think any part of this through before you made your comment?
@@johnmcclane4430 Your response makes no sense to me
From what I see they do the same with answers what a LLM does with tokens in o1 , so predict the most likely answer in the chain of thought like they were tokens in a sentence. The hallucination problem still is prevalent and essentially we could have new hallucination types involved eventually (the whack a mole situation is still strong) . Also this makes the whole cost per answer increase way more so essentially you run multiple GPTs at once (hence the highe rprice). As you excellently showed it has some crucial flaws like small "copying errors" in those chains [strawberrrry]. So if we are somewhat sceptical one could say its futile because the Apple paper is correct and LLMs can't reason and by breaking down complex tasks into subtasks only the probability is higher to have a pattern match in that smaller context and there is still no proper reasoning going on ( based on logic rather on raw pretrained statistics - the performance drop even in o1 when switching some small parameters in the benchmark questions hints to that). My problem especially with Openai is the insane (ungrounded) valuation which creates insane pressure to perform and thus not only destroys their working culture but also the honesty about what works and what not. if you have the incentive to always announce "GPT6AGI next month confirmed" and stir up artificial hype to get more cash than your burnrate you will stifle any scientific progress. I think thats why Claude has more progress because they have more "ease of mind" while developing in their team. In my opinion Meta, Openai and Google aswell as Anthropic etc. would fare way better if they would work on one big closed model intended for scientific progress and not a product while giving the comunity the chance to improve upon it since its a global effort towards safe AI [ yep too late for that one ]. As fun these local models are the only true use I see at the moment is manipulative AI slop everyhwere (either to manipulate opinions or trying to gain a quick and dirty buck or two for low effort). The only benefit I see is that it raised the AI research field into the public awareness but the overheated bubble behavior will do some harm we will see whats left once the dust settled.
love these paper summaries. thank you 🙏 🎊
squad mentioned!!!!!!!
The game squad
now i think performance increase of o1 models are only because of new knowledge added during this CoT based RL training. Also training data will be mostly comprised of maths and coding problems as it's much easier to create CoT based examples for them which reflects thr performance increase only in these categories.
I most definitely got o1 talking to itself for MORE than 60 seconds, but does seem to hit 59 seconds most of the time when given complex or longer tasks.
I really resonate with you as a human during your mini meltdown at 9:00
Great video, keep up the good work.
the strawberry test is hard because the word gets encoded to : [302, 1618, 19772] == [st, raw, berry] or something similar. The model doesn't reason in letters but in tokens which removes some of the information necessary to, for example, count the number of letters.
People also do not thinking in letters .
Latest research proved that human neuron is storing the whole word like llm does.
@@mirek190The difference is, humans can then spell out the word letter by letter which a model will not internally do, unless you use a CoT
@@ZintomV1 Using COT shows only llm can do that bur are lazy ;) .
Also most words llm can spell easily "strawberry" is one of the exceptions.
Most llms currently using as thinking of stage 1 easy thinking without looping problems which is thinking stage 2.
Next generation llms probably will be learn this way from the background so even word like "strawberry" will be easy as llm use stage 2 thinking.
The Google paper on test-time compute evaluated how Process Reward Model or Self-revision performed as it scaled. Given OpenAI's approach of training on millions of synthetic reasoning chains, you can't simply use this paper to claim it doesn't scale as OpenAI described, since it involves a very different approach in what the model does post-training. At least as far as I understand.
This.
Again... What is up with the bad prompting.
o1 mini getting 100% of the 20x20 Table simply by using:
# Goal
Create a comprehensive multiplication table for numbers 1 through 20, including detailed calculations for each product.
# Method
1. For each pair of numbers from 1x1 to 20x20:
a. Calculate the product
b. For products involving numbers 11-20, show the step-by-step calculation
2. Format the calculations as follows:
- For simple products (1-10):
5x7 = 35
- For products involving 11-20:
13x20 = 10x20 + 3x20
10x20 = 200
3x20 = 60
200 + 60 = 260
3. Create a grid displaying all products from 1x1 to 20x20
# Additional Notes
- Be thorough and show all steps for products involving numbers 11-20
- Ensure accuracy in all calculations
- Present the information in a clear, organized manner
- The grid should have both horizontal and vertical headers from 1 to 20
# Example Output Snippet
1x1 = 1
1x2 = 2
...
13x20 = 10x20 + 3x20
10x20 = 200
3x20 = 60
200 + 60 = 260
...
20x20 = 400
[Include the grid here]
By following this method, you'll create a detailed and educational multiplication table that shows not just the results, but also the process of calculating more complex products.
Took me not even 5 minutes to come up with this prompt and no reprompting or anything needed. This works zeroshot
I think 5 minutes of human time is more expensive than 5 minutes of computer.
Maybe I don't get the assigment, but I tested it on llama 8B with generic system assistant prompt and "# Goal
Create a comprehensive multiplication table for numbers 1 through 20, including detailed calculations for each product." user prompt and it did just fine.
the problem of further prompt engineering is that it is not generalizable. you don't want to be thinking about additional prompts for each and every new unique problems/tasks you want to solve.
They are talking about 20 digits multiplied by 20 digits. You are talking about 2 digits by 2 digits, which the graph in the video shows the model can do.
Dude. Not 20x20. 20 DIGITS. As in 95800904631026778660 x 25684705875830852248
All equations till 20x20 are guaranteed to be in the training dataset somewhere anyway..
Where can I find these papers?!
Thanks for this. TTC has the potential to be great, not like this though.
Internally, the model needs to be able to execute loops to refine and transform information until it has determined it has solved the question. Using generated tokens to sort-of accomplish that is a lot of unnecessary work. It requires all thoughts to be translated from language to thoughts to language to thoughts over and over again. If we want reasoning, we will need models that can memorize DURING inference and modify that memory until a mechanism signals that the memory is in a suitable state to answer the question. Perhaps this functionality can be trained independently before being grafted on.
This was a very interesting analysis of the o1 model, on par with ai explained
10:00 went from ClosedAI to SigmaAI.
wait... they got the weights?
this was a joke right?
ya they sent me on discord
@@bycloudAISeriously?... They shared the weights?...
@@bycloudAI . . . are they sharing with anyone else? Like, somewhere anyone can access them?
@@DefaultFlame Yes, the weights are available to Talk Tuah Plus subscribers and above.
@@voxelsilk8462 The hell is "Talk Tuah"?
So in my current system that uses LLM. After watching this video, I added a setTimeout that changes a bool to true after 8 seconds, and a while loop that runs inference over and over for a "thought" given the current environment state while the bool is false. so it's thinking for about 8 seconds and spits out about 4 'thoughts' in that time. After stuffing my speaker agent's context with those thoughts generated in 8 seconds it really does improve the quality of the final output. I'm just curious, did anyone catch how they calculate how long to "think" for?
I've been waiting for OpenAI to have this since it was introduced by babyAGI to the public on twitter. Twitter is always lightyears ahead with the new beauties of AI. At first, I thought when they said their business has proprietary methods, I thought it already had chain of thought. I guess I was wrong. GPT might go to the next level with this, since it has potential BEFORE this was in their pipeline! I used to prompt the AI to correct, review, etc, to itself since I naturally assume it could be making mistakes if it does too many over coding or advanced topics.
a bit ironic the choice of the "for profit" person as elon musk, considering when openAI was funded he helped choose the non-profit format and left it due to disagreements about for-profit deals with microsoft.
of course! yapping straight forward is never the best way to talk, you have to think first, choose your words carefully, why are those people realize that soo late?
0:32
they name it strawberry because of a glitch in ai chat if you ask him how many (R) in strawberry and he will response 2 (r)
I just realize, some people that have high level knowledge, tend to be overconfident with their answer. Seems it shows the same as well in these large parameters LLMs
Why not make a MBRL chatbot? (Do a MCTS of it's tokens). I know it's unrelated but still food 4 thought
The reasoning trace is not aligned, so they don't want anyone to see it for that reason. I don't think there is a secret sauce. Antrhopic beat them to the punch on the inference reasoning trace. This is catch-up to prior art.
OpenAI explicitly states that they hide it for a competitive advantage. This was also what they said when they released their model card on GPT-4, which didn't include any training or architecture details, on why they aren't revealing their processes.
Video on entropix when
4:36 ah yes, how you are supposed to do that
Croatian politicians 0:07 memefied
4:25 no way i am takong certificate from a co whose chef said no cder needed & ttteeet signde
I wouldn't trust ChatGPT or any other LLM model to write creative writing, most of the "tone" ChatGPT writes is the same across the genres I ask for.
Tbh, LLMs are not capable of writing anything literary... yet. Soon they will, and humanity will be doomed
Lets see if you are right
So its giving LLM anxiety and overthinking
It's doing top-k to generate sentences and then top-p on sentences? xd
great cover. To be honest that strawberry was just a big disappointment to me.. after all the drama at openAI I really expected a game changer.. the opposite of what we got.
I'm sure all the others are already implementing a version of it as we speak as it doesn't seem like much of a MOAT (or they wouldn't try so hard to hide the reasoning from us)..
Also LLMs will NEVER create a freaking cancer drug or something.. god I'm exhausted about those claims, LLMs are dumb AF when it comes to creating new things, by definition.
So unless openAI are actually working on some form of AGI that is capable of learning and experimenting by itself like alpha fold, LLM will just be some kinda useful assistant with major flaws
ruclips.net/video/pi7LF-OpO6k/видео.html
what's the music that starts here?
Monte Carlos tree search
14:10 😂 exp on x axis... That's log/diminishing on y axis 😂😂😂
i am an unemployed dev who switched industries... i can build a cot system w/ langchain... I dont know why any serious software org would need this as a SaSS product when open source has had this for literally 4 years
I really do wish oai shills would defluff their hype a bit. Makes it so apparently disingenuous
Monte Carlos tree search lol
Why not hire experts and fact-checkers in respective fields to build a fully human-generated dataset and use that to train the model, while applying massive penalties for whatever the model gets wrong/makes up?
It’s funny how so many focus on grammar drills but forget that real fluency comes from actually using the language daily. Totally changed the game for me.
Monte carlos lmao
First Comment =)
Entropix for small models looks promising
6666 view 6 hours ago...........
this is indeed glorpshit
Lol, AI is now suffering from Dunning Kruger effect as their model grows their ego grows.
This is now my head canon interpretation of what's happening.
AI is not sentient, so it doesn't have 'ego'... yet
Now, the owners of the AI, that's a different story
@@gerardo.arroyo.s my man. That was a joke. That's why it's my head canon, my own story I like enough to pretend it's what's happened.
Graces. People these days can't even disseminate humour from serious statements.
@@arandomfox999 it was so unfunnny though
Backfire 重音應該放在第一個音節
是BACKfire不是backFIRE 😊
I feel like you try to prove a point that is off from the start. TTC was never intended to be a substitute for models' scale - it was to unlock a new level of quality not feasible without it, on top of a proper parameter scaling. Both DeepMind and OpenAI show similar results of generation scores being log-proportional to a TTC's budget - at least in some section of the scale. So, it's not that OpenAI made this graph up, different teams back it up. I find your statement here a bit biased and misleading :(
Fartface
Maybe their CoT method is that Q* leak? -- Great videos. My gf thinks i'm just watching that Daily Dose of Internet guy but for tech
Seems fake and gaeh but i felt like the way forward is to tokens to yap for a whole minute before giving you an answer.
I think we should have made specific modifiers or addons like loras but for chat gpt instead of of forcing chat gpt to be an all in one solution.
One of the reasons GPT and other LLMs are so generalisable is precisely because of their large training corpora. Besides, OpenAI's API does allow you to run fine-tuning, presumably using some adapter because its inexpensive. And we know Azure OpenAI fine-tuning in particular uses LoRA.