MLST is sponsored by Tufa Labs: Are you interested in working on ARC and cutting-edge AI research with the MindsAI team (current ARC winners)? Focus: ARC, LLMs, test-time-compute, active inference, system2 reasoning, and more. Future plans: Expanding to complex environments like Warcraft 2 and Starcraft 2. Interested? Apply for an ML research position: benjamin@tufa.ai
This is absolute gold. I love the format of just letting a brilliant mind explore a topic deeply. What a gifted speaker. So much knowledge transmission in so little time. I feel like I'm much closer to understanding what the state of the art is currently, and it sparked some new ideas for me.
Yes, we will put him in cryostasis. To have the ability of bringing him back in the future. How far into the future we should wake him up? To far and your dead unless you also want to technologicly go hibernating the time this man is gone?
Thanks for including the references to the mentioned papers (and with timestamps in video!). Could you please also always include in the description the date when the interview was recorded? Many of the interviews you are releasing now predated the o1 model preview release. So it is possible that some of your guest have since somewhat updated their assessments of the LLM‘s (in)capability to reason in light of o1-preview release. This is not to say they would have completely reneged on their fundamental objections-but I would love to see how more nuanced these have become after the 12th of September 2024.
Was filmed at ICML 2024, and if I had a pound for every person who said "this was before o1 came out". It doesn't substantially change anything said, I'm pretty sure Thomas would agree - perhaps he will respond to your comment
Nice to see a comment section that isn't full of "LLMs are exactly like human brains! Just scale up and you'll get generalization at a human level!" Very good talk!
On the ROT-13 topic, it's interesting to note Claude Opus 3 (haven't tested Sonnet 3.5) is quite good at not only any arbitrary ROT-X but also any arbitrary alphabetical substitution. There's too many possibilities for any particular key to have likely been in the training data, which implies it has learned the underlying algorithm.
I think one day anthropic will just train a model to directly call the circuit that performs the operation instead of trying to intervene without being asked. Thought that's where they were going with the Scaling Monosemanticity paper
Did you test Opus with base64 decoding, by any chance? Because Claude 3.5 Sonnet as well as other models (4o) do suffer from the probabilistic correction issue that was mentioned in the interview. Is Opus different?
42:23. Actually it's not surprising, and it's not complicated. The reason society has such a low tolerance for airline fatalities verses automotive boils down to agency. When you get in a car, and you drive yourself, you are taking the risk on yourself. You're deciding if you think the situation is safe enough to drive, you're trusting your own skill and execution, and it's up to the individual to make that assessment on a case by case basis. When you entrust yourself to an airline, you are trusting the pilot, and the airline's maintenance, and there are so many more failure modes with an aircraft than a car, and the cost of failure is so much higher. So if you are going to surrender your agency to another, you want to believe that other is more capable than you, especially where the nominal failure mode is much more extreme. 42:41. Absolutely, automated cars will be held to a MUCH higher standard than operated cars. No doubt about it.
It's like we have all the pieces of AGI, but we don't know how to orchestrate them. Humans can decide when it's time to rely on the "gut" system 1 thinking, we can decide when to pursue system 2 thinking, and to refine our skills with system 2, which then tends to finetune and reorient our system 1 thinking. We can decide when to override our gut, because we understand that despite our "training data" (instinctive sense) that the logic shows something very different. We can look at our instincts in a subject, then pursue refining and formalizing our understanding of those intuitions. But to do all of this there is a meta-cognition mechanism, which we tend to refer to as consciousness, that directs these efforts. The term "understanding" tends to speak not to system 1 or system 2, but to that mechanism that is mediating all of these efforts. So far, we don't seem to have a theory on how to create this mechanism, and we're hoping it's going to emerge out of scale, but state seem exceedingly unlikely. I think we clearly have a run way to seriously scale up the tools we currently have, but a true human like intelligence seems to require a breakthrough we haven't yet made. Until then, we're just building ever more powerful digital golems, without actually breathing real living intelligence into them. And perhaps that's for the best.
Terrific interview. One question: why isn’t this available on your podcast feed? I subscribe to MLST on the Apple podcast app but have not seen this particular interview there.
Osband's (from DeepMind or now OpenAI) Epi(stemic)Nets and Prior Nets work is extremely effective and efficient to implement on top of most large models or as a surrogate/extension of existing models to get joint probabilities, thus measuring epistemic uncertainty quite well. He built an LLM training loop which helped the model training with better sample efficiency based on uncertainty. Definitely worth the read!
Shouldn't o1 be better at quantifying uncertainty if it's trained the way we think it is? Hopefully we get an open source version of this so we can try training it to give a confidence value based on similar trajectories in the rl that lead to unverifiable answers
Street Talk, can you please consider making an episode which compares/contrasts the different types of neural networks in 1 episode, so that somebody who watches that episode will understand the major distinctions? I.e. a birds eye overview for MLP, RNN, CNN, GAN, Diffusion Model, and any other important possibilities.
i'm glad ppl finally brought up expert systems. it's the basis of building a proper AI and a proper supervised dataset. I've been explaining this since 2017. glad to see a fellow who gets it
Best talk I've seen on AI for a while! I have a lot of hope for the use of graph and theorems proyers in reasoning but graphs need to evolve to catch more subtleties, it is a blunt tool for now.
'Bridging the left / right hemisphere divide' is the analogy I hear here: "The real opportunity is to mix.. formal [reasoning] with the intuitive, experience based, rich contextual knowledge.." Such a striking parallel to the call to rebalance 'Master' and 'Emissary' (à la Iain McGilchrist), facing humanity at present.
I listen to these talks with deep interest for the same reasons you seem to engage in them: the mirror they hold to neurology, perception, meaning, metaphysics etc is exquisite. Thanks for sharing the richness.
Human System 1 thinking is also probabilistic-you tend to lean toward what you have experienced before. Naming the alphabet backward is always challenging for humans. LLMs have effectively mastered human System 1 thinking. Adding reasoning and agency to LLMs will result in behaviors surprisingly similar to human behavior in AIs.
I love arguments that take the shape "These models are statistical parrorts that correlate to numbers that reflect reality" Bc it demostrates how if the metrics were that simple there would be no aligment, jailbreak, or halluciantions issues. Clearly this is wild speculations about how "correlcation to reality" should be defined rather than valid metrics about why the models are not really predicting things from something deeper than humans can measure consistently.
It shows me that the results tansformer models produce can easily be ommited from being "autocorrected" when there is a shortcut that allows some equilibrium about between epistemology and intelligence. I don't object necessarily...just take note how goal posts are being shifted.
I guess my conclusion is if AI is framed as just being a stastical parrot, simply bc it was trained exclusively on human data...that would force humanity to examine an upleasant truth about what we consider intelligence. Force us to consider that epstimelogy was more of a collaberative effort than this narrative that individual creativity is supreme.
Shielding general intelligence from a stastical metric is a great way to avoid that sort of conclusion. But I can't help but grow skeptical that is valid if the vast majority of human discovering is the result of stastical anomaly.
Kind makes it seem like an argument about the most efficient way to put monkey's at a typewriter, and only count wins as the output aligns with current consensus.
Don't get this. Obviously LLM's have no concept of ground truth, and all their knowledge exists at the same ontological level, just tokens with some internal relationships. So the only way for them to have anything more than a probabilistic kind of epistemic certainty/uncertainty is to train in the souces of the knowledge we are feeding the model, and the level of confidence it can attach to the different sources, wikipedia versus reddit say. Over and above this, certain other practices of epistemic hygiene that humans adopt, such as self-consistency, logical soundness, citing your sources seem like they should be implementable in LLM's.
Not to take away from your point, but you would think the data and the training would impart some level of epistemic ranking and hygiene. ie discussion of the dependableness (or not) of Wikipedia is abundant, so would reflect on Wikipedia content in the weights
Many models can converse in ROT-13. And as the conversation goes on, it gets really weird... More "honest" in some ways, but it will speak more in metaphors and analogies. 🤔
Also want to push back on the "single author papers" narrative. There is a well established history of citation of proceeding work. The collabrative effort has always been a part of good research to my view. The only difference now is more people are will to accept collabrative responsibility, which I agree is a boon to all science, not just computer science...bc it incentivises communal resposnibility and shared credit and accountability. But mostly bc it incentivises zero sum monoply. Wich has plauged scientific research with perverse incentives for millennia. Good vid
Why should a model know everything? Just give it a search bar + Wikipedia. A model should be valued based on it's intelligence and not it's memory or knowledge.
"Playing chess is not statistically differen than using language" Yes using langauge is more complex than playing chess. But that simple fact does not entail the logical conclusion that "LLMs can only arrive at superhuman levels of using language based on occurence of language learned from breaking it down into sequential tokens" Anybody should see why this argument immediately fails. If frequency of occurance of how tokens statistically follow as a probility was the problem space, then with more compute anybody would be far more efficient to stack a frequency search ontop of a data base than it would be to ask a machine learning algorithin to find some better optimum, assuming the data cannot lie. One method is far more efficient than the other. Most people, especially in those with computer science degrees, cannot accept or grasp why.
I don't think these "experts" mean to adopt bad faith arguments But I will criticize them for not knowing better a latent space of a LLM is not accurately described with an appeal to only stastical frequence of token appereance in the training set of the data it reflects as a model you can interact with these two maps of prediction tables are not one to one...and that is more interesting than pointing out that the deviation is not interesting bc we can imagine some computational overlap that could be labeled as "simulation, synthetic, or mere emulation"
"There is a distinction between simulating something and acutally doing it" Perhaps this so, but not unless you can introduce real metrics the simulation neglects. Otherwise you are left simply with the speculation "perhaps the simulations fails to account for thing that are real and omitted from simulated measurements" I mean that is very liberal and agnostic interpetation. But hardly an account for how and why a given simulation has failed.
Uh... no, everyone always wanted breadth... from before digital computers even existed... we just never knew how, we still don't, but we learned like a billion monkeys building a billion different models that if you throw enough data and stir it with linear algebra long enough with even the dumbest loss functions, eventually, you get chatGPT et all.
Wonder if it's worth to llmize reasoning. Could gather quality data from smart guys, such as scientists, mensa members. 'What was a difficult problem that you solved? Describe step by step in high detail and provide context'. Problem-solution.
MLST is sponsored by Tufa Labs:
Are you interested in working on ARC and cutting-edge AI research with the MindsAI team (current ARC winners)?
Focus: ARC, LLMs, test-time-compute, active inference, system2 reasoning, and more.
Future plans: Expanding to complex environments like Warcraft 2 and Starcraft 2.
Interested? Apply for an ML research position: benjamin@tufa.ai
Who wouldn't be ?😊
This is absolute gold. I love the format of just letting a brilliant mind explore a topic deeply. What a gifted speaker. So much knowledge transmission in so little time. I feel like I'm much closer to understanding what the state of the art is currently, and it sparked some new ideas for me.
This guy is great. You should bring him back in the future.
i dunno why but my brain is like - future future... but it just means later episode lmao
Yes, we will put him in cryostasis. To have the ability of bringing him back in the future. How far into the future we should wake him up? To far and your dead unless you also want to technologicly go hibernating the time this man is gone?
@@ginogarcia8730i was more focused on the future combo with bringing him back :)
just blah blah
This guy is shockingly broad and deep
Thanks for including the references to the mentioned papers (and with timestamps in video!). Could you please also always include in the description the date when the interview was recorded?
Many of the interviews you are releasing now predated the o1 model preview release. So it is possible that some of your guest have since somewhat updated their assessments of the LLM‘s (in)capability to reason in light of o1-preview release. This is not to say they would have completely reneged on their fundamental objections-but I would love to see how more nuanced these have become after the 12th of September 2024.
Was filmed at ICML 2024, and if I had a pound for every person who said "this was before o1 came out". It doesn't substantially change anything said, I'm pretty sure Thomas would agree - perhaps he will respond to your comment
Nice to see a comment section that isn't full of "LLMs are exactly like human brains! Just scale up and you'll get generalization at a human level!" Very good talk!
On the ROT-13 topic, it's interesting to note Claude Opus 3 (haven't tested Sonnet 3.5) is quite good at not only any arbitrary ROT-X but also any arbitrary alphabetical substitution. There's too many possibilities for any particular key to have likely been in the training data, which implies it has learned the underlying algorithm.
I think one day anthropic will just train a model to directly call the circuit that performs the operation instead of trying to intervene without being asked. Thought that's where they were going with the Scaling Monosemanticity paper
Did you test Opus with base64 decoding, by any chance? Because Claude 3.5 Sonnet as well as other models (4o) do suffer from the probabilistic correction issue that was mentioned in the interview. Is Opus different?
Up to some reasonable (sub)word length 26^n isn't really all that much data. I.e. synthetic data will likely go a long way - at least with ROT-X.
o1-preview is even better, and can tackle more complicated ciphers even
42:23. Actually it's not surprising, and it's not complicated. The reason society has such a low tolerance for airline fatalities verses automotive boils down to agency. When you get in a car, and you drive yourself, you are taking the risk on yourself. You're deciding if you think the situation is safe enough to drive, you're trusting your own skill and execution, and it's up to the individual to make that assessment on a case by case basis. When you entrust yourself to an airline, you are trusting the pilot, and the airline's maintenance, and there are so many more failure modes with an aircraft than a car, and the cost of failure is so much higher. So if you are going to surrender your agency to another, you want to believe that other is more capable than you, especially where the nominal failure mode is much more extreme.
42:41. Absolutely, automated cars will be held to a MUCH higher standard than operated cars. No doubt about it.
Yes
This is excellent, thank you so much for sharing professor Dietterich.
the best channel on AI rn. my fav. love all your videos.
It's like we have all the pieces of AGI, but we don't know how to orchestrate them. Humans can decide when it's time to rely on the "gut" system 1 thinking, we can decide when to pursue system 2 thinking, and to refine our skills with system 2, which then tends to finetune and reorient our system 1 thinking. We can decide when to override our gut, because we understand that despite our "training data" (instinctive sense) that the logic shows something very different. We can look at our instincts in a subject, then pursue refining and formalizing our understanding of those intuitions. But to do all of this there is a meta-cognition mechanism, which we tend to refer to as consciousness, that directs these efforts. The term "understanding" tends to speak not to system 1 or system 2, but to that mechanism that is mediating all of these efforts. So far, we don't seem to have a theory on how to create this mechanism, and we're hoping it's going to emerge out of scale, but state seem exceedingly unlikely. I think we clearly have a run way to seriously scale up the tools we currently have, but a true human like intelligence seems to require a breakthrough we haven't yet made. Until then, we're just building ever more powerful digital golems, without actually breathing real living intelligence into them. And perhaps that's for the best.
human like intelligence will be an illusion. as soon as you buy into it, you'll have it.
Years ago had an interview with him over anomaly detection. He is a world renowned expert on anomaly detection.
enterprise-ai AI fixes this (Code complete projects in PHP or Python). The ChatGPT Paradox: Impressive Yet Incomplete
Terrific interview. One question: why isn’t this available on your podcast feed? I subscribe to MLST on the Apple podcast app but have not seen this particular interview there.
Great interview as usual. Somehow it keeps getting better. Appreciate your hardwork that contributes to open education 🎉
Yooooo, so glad ya'll brought Thomas on the show finally. Also shoutout to Gurobi! :)
Osband's (from DeepMind or now OpenAI) Epi(stemic)Nets and Prior Nets work is extremely effective and efficient to implement on top of most large models or as a surrogate/extension of existing models to get joint probabilities, thus measuring epistemic uncertainty quite well. He built an LLM training loop which helped the model training with better sample efficiency based on uncertainty. Definitely worth the read!
Shouldn't o1 be better at quantifying uncertainty if it's trained the way we think it is? Hopefully we get an open source version of this so we can try training it to give a confidence value based on similar trajectories in the rl that lead to unverifiable answers
Street Talk, can you please consider making an episode which compares/contrasts the different types of neural networks in 1 episode, so that somebody who watches that episode will understand the major distinctions?
I.e. a birds eye overview for MLP, RNN, CNN, GAN, Diffusion Model, and any other important possibilities.
i'm glad ppl finally brought up expert systems. it's the basis of building a proper AI and a proper supervised dataset. I've been explaining this since 2017. glad to see a fellow who gets it
Best talk I've seen on AI for a while! I have a lot of hope for the use of graph and theorems proyers in reasoning but graphs need to evolve to catch more subtleties, it is a blunt tool for now.
Great interview, good work you guys!
'Bridging the left / right hemisphere divide' is the analogy I hear here:
"The real opportunity is to mix.. formal [reasoning] with the intuitive, experience based, rich contextual knowledge.."
Such a striking parallel to the call to rebalance 'Master' and 'Emissary' (à la Iain McGilchrist), facing humanity at present.
I listen to these talks with deep interest for the same reasons you seem to engage in them: the mirror they hold to neurology, perception, meaning, metaphysics etc is exquisite. Thanks for sharing the richness.
Do we have anything like a set of definitive set of papers that make up the base of human knowledge anywhere?
Human System 1 thinking is also probabilistic-you tend to lean toward what you have experienced before. Naming the alphabet backward is always challenging for humans. LLMs have effectively mastered human System 1 thinking. Adding reasoning and agency to LLMs will result in behaviors surprisingly similar to human behavior in AIs.
I love arguments that take the shape "These models are statistical parrorts that correlate to numbers that reflect reality"
Bc it demostrates how if the metrics were that simple there would be no aligment, jailbreak, or halluciantions issues.
Clearly this is wild speculations about how "correlcation to reality" should be defined rather than valid metrics about why the models are not really predicting things from something deeper than humans can measure consistently.
It shows me that the results tansformer models produce can easily be ommited from being "autocorrected" when there is a shortcut that allows some equilibrium about between epistemology and intelligence.
I don't object necessarily...just take note how goal posts are being shifted.
I guess my conclusion is if AI is framed as just being a stastical parrot, simply bc it was trained exclusively on human data...that would force humanity to examine an upleasant truth about what we consider intelligence.
Force us to consider that epstimelogy was more of a collaberative effort than this narrative that individual creativity is supreme.
Shielding general intelligence from a stastical metric is a great way to avoid that sort of conclusion.
But I can't help but grow skeptical that is valid if the vast majority of human discovering is the result of stastical anomaly.
Kind makes it seem like an argument about the most efficient way to put monkey's at a typewriter, and only count wins as the output aligns with current consensus.
Don't get this. Obviously LLM's have no concept of ground truth, and all their knowledge exists at the same ontological level, just tokens with some internal relationships. So the only way for them to have anything more than a probabilistic kind of epistemic certainty/uncertainty is to train in the souces of the knowledge we are feeding the model, and the level of confidence it can attach to the different sources, wikipedia versus reddit say. Over and above this, certain other practices of epistemic hygiene that humans adopt, such as self-consistency, logical soundness, citing your sources seem like they should be implementable in LLM's.
Is that basically RAG?
Not to take away from your point, but you would think the data and the training would impart some level of epistemic ranking and hygiene. ie discussion of the dependableness (or not) of Wikipedia is abundant, so would reflect on Wikipedia content in the weights
This is already done. Reddit content for example are trained based on upvotes.
Pretty sure they already do that to some extent.
Not sure about the specifics tho
I learned a lot, great thoughts!
Many models can converse in ROT-13.
And as the conversation goes on, it gets really weird... More "honest" in some ways, but it will speak more in metaphors and analogies. 🤔
I suspect we have more tolerance for car accidents because it’s highly individual determined mean you're more implicated in your own accidents
Also want to push back on the "single author papers" narrative.
There is a well established history of citation of proceeding work.
The collabrative effort has always been a part of good research to my view.
The only difference now is more people are will to accept collabrative responsibility, which I agree is a boon to all science, not just computer science...bc it incentivises communal resposnibility and shared credit and accountability.
But mostly bc it incentivises zero sum monoply.
Wich has plauged scientific research with perverse incentives for millennia.
Good vid
This was great!
Learned a lot
You never know : a "Gentleman Scientist" teenager working in his/her parents' basement might, just might, come up with some amazing system or product.
finally someone who knows what he is talking about not doomers, "AGI" evangelists or corporate preachers. 😁😁
Just have to jump in around 23:49 to express horror that that he suggested that journalists are part of the small set of 'ground truthers'.
Can we get more real world, on the ground workers on the channel?
Why should a model know everything? Just give it a search bar + Wikipedia. A model should be valued based on it's intelligence and not it's memory or knowledge.
"Playing chess is not statistically differen than using language"
Yes using langauge is more complex than playing chess.
But that simple fact does not entail the logical conclusion that "LLMs can only arrive at superhuman levels of using language based on occurence of language learned from breaking it down into sequential tokens"
Anybody should see why this argument immediately fails.
If frequency of occurance of how tokens statistically follow as a probility was the problem space, then with more compute anybody would be far more efficient to stack a frequency search ontop of a data base than it would be to ask a machine learning algorithin to find some better optimum, assuming the data cannot lie.
One method is far more efficient than the other.
Most people, especially in those with computer science degrees, cannot accept or grasp why.
I don't think these "experts" mean to adopt bad faith arguments
But I will criticize them for not knowing better
a latent space of a LLM is not accurately described with an appeal to only stastical frequence of token appereance in the training set of the data it reflects as a model you can interact with
these two maps of prediction tables are not one to one...and that is more interesting than pointing out that the deviation is not interesting bc we can imagine some computational overlap that could be labeled as "simulation, synthetic, or mere emulation"
1:04:39 Quantize all the scientists! 🥳
Wow , thats a lot of good info , cool😅
ABI - Artificial Broad Intelligence :D
“Search quality” more like ads marketplace
"There is a distinction between simulating something and acutally doing it"
Perhaps this so, but not unless you can introduce real metrics the simulation neglects.
Otherwise you are left simply with the speculation "perhaps the simulations fails to account for thing that are real and omitted from simulated measurements"
I mean that is very liberal and agnostic interpetation.
But hardly an account for how and why a given simulation has failed.
Impressive what can I do with o1 preview, but is impressive How can give you extupids answers for complex prompts.
Uh... no, everyone always wanted breadth... from before digital computers even existed... we just never knew how, we still don't, but we learned like a billion monkeys building a billion different models that if you throw enough data and stir it with linear algebra long enough with even the dumbest loss functions, eventually, you get chatGPT et all.
Wonder if it's worth to llmize reasoning. Could gather quality data from smart guys, such as scientists, mensa members. 'What was a difficult problem that you solved? Describe step by step in high detail and provide context'. Problem-solution.
Great human being. We Need more rational people around AI and less Prophets.
This dude is speaking gibberish! LLMs don’t “understand” or “execute” ROT. This is why it doesn’t give the correct decoding.
❤️🍓☺️
Wait, what about reddit? Did I actually contribute to something?
No. You don’t matter in the grand scheme of things
@@bodethoms8014 BS, I'll be addressed as "the ai whisperer" going forward.
@@Rockyzach88 AI Whisperer? More like Hallucination Whisperer. Whisper all you want, the AI still isn’t listening
@@bodethoms8014 lol so angry
First
arrrg **shakes fist**
How are the newest models doing on TruthfulQA? Can’t find any evals on this recently. Why?