Oriol is genius, but, for a moment, I'd love to acknowledge Hannah, specifically her JOY in receiving this geeky information (that we all love), making it so accessible, and orchestrating the flow of this conversation. Kudos. Keep radiating that jubilant smile... the breath of fresh air!
Hannah is such a genuinely outstanding interviewer; she has that rare combination of charisma, intelligence, wit and infectious enthusiastic curiosity.
Best presenter I have ever seen, she really knew what she said and actively engage in the conversation. Oriol Vinyals is great, good scientist, he don't fail into the hype cycle like many AI's influencer (Stare at you Sam) and give us very clear picture of what is going on now.
These are the best podcasts on the net. It’s so great to witness a host so knowledgable and intelligent, asking questions to get good answers, rather than trying to show off own knowledge.
Ngl Google, just waiting for Live Video with Astra. Agents are awesome, but I make home use robots and repair cars.. so camera would be more versatile for a hands on help than agents. This IS really cool and helpful for a vast amount of people; I commend you guys.
4 дня назад+3
In AI studio, you can test "stream realtime" and stream your camera to Gemini 2.0 flash. Worth a try!
47:02 what do we mean by superintelligence really? Add strong reasoning to the amazing scale of memory and inference that we have now, and surely we are there? Perhaps the ability to continue learning as the test time compute generates new realisations that aren't in the training set?
@@grekiki If the AI is programmed specifically for Portal 2, sure it would be. I more meant an AI that has the kind of situational understanding and adaptability that a game like Portal 2 would require, without being specifically trained for one game. It would be closer to a real AGI at that point and Portal 2 would be a good benchmark to measure progress, IMO.
Listening to this after Ilyas neurips talk this is so much more humble and detailed with tons of new insights and ideas that one can pursue. Ilya might still create another groundbreaking gpt like innovation for sure but the level of innovation engineering and then integration reminds us google ecosystem is so vibrant which we had forgotten for some time They were just steering their ship last couple of years and seem to be catching up if not overtaking on innovation scale
Hey geniuses, here’s a thought: we don’t think in text, right? Our minds process the world through audio, emotion, and context. So, what if we designed an AI model that doesn’t rely on text as its core but instead thinks in audio and context? A model like this could be trained with richer, multimodal inputs-audio, environmental cues, and simplified contextual relationships-to truly "think" more like a human. Such a model wouldn’t just generate text; it could produce audio responses or even work directly with sound and context to make decisions. Imagine it analyzing tone, pauses, and environmental noise while responding naturally in real time. It could be more intuitive, faster, and closer to how our brains actually process information-directly and efficiently, skipping the symbolic conversion of text. Why stick to text? Text processing requires converting symbols into audio, then turning that audio into meaning and context. It’s a multi-step process, wasting energy at every stage. If you want true efficiency, skip the text. Train AI to think and respond directly in audio and context-it’s faster, simpler, and more aligned with human cognition. Thoughts? Could this be the next leap for general AI?
You are on the right track, though others have had this idea too, and its the basis of the multi modal models we have today. The main reason text has had so much focus is that there is a lot of it easily accessible out there, plus that data has shown to be something these models can very easily generalize from. That aside, there is a lot of merit to the idea that before the output, models dont think as text, rather just as a Cloud of firing wheights that are a more abstract form of meaning, just like human neurons.
Thank you for sharing the video. Why have you not linked the chat bot/A.I. to a avatar? Even if just shoulders and head that could work with cell phones also. Voice libraries or customized voice options. Deep knowledge sets to draw from. ( Google scholar, ArXiv, ect. ) . Reasoning modeling, chains of thought, predictive modeling, psychological modeling, world modeling of types, ect.. Keep up the good work.
Yes, language models, with humans in the loop, can extract knowledge and sentiment from unstructured input such as conversations, and store the fact checked statements in a shared graph representation, becoming a form of collective terrestrial intelligence.
26:03 Of course this method is not generally applicable, but consider its utility in things which are much less subjective than judging the aesthetic value of a poem, like assessing answers to scientific questions, and it really comes into its own. Things that are subjective are best left to themselves for such things don't have clear cut answers in any case, what might be a good poem to you may not be for me. Regarding the reward hacking issue, it's not applicable to things which have clear cut objective answers.
So, when's 2.0 Pro coming? 😊 Seems like you really don't want to talk about that. There's no shortage of data in the actual physical universe. We're going to need robots with sophisticated sensors. Until you're getting huge amounts of sensory data from domestic androids and such, I'm guessing the really significant improvements will come from assembling narrower models in the right way, in a modular architecture, with memory storage and management components, along with the domain specific modules.
The future of data will no doubt result in any organisation that has any willing to sell as a new form of commodity. Video data is a very broad topic so let's assess this platforms video data to start with. RUclips's video content is highly edited as is other produced video in the form of television and film and this provides little consistency to be used in training. As a polar comparison surveillance data from video recording systems lacks obvious contextual commentary from those within the video footage. So whilst raw surveillance footage may be most suitable to expect returns in training by comparison to heavily edited RUclips content. RUclips does have the advantage of participation by other people in the form of commentary and this is both an assistance and not depending on the author of the comments willingness to contribute substance to the video. The worry with synthetic data is that of non novel examples leading to a uniform type of data that would be opposite to really intuitive unique examples. If model collapses occur or not depends on what data is allowed to be regurgitated. As a child playing a game called Chinese whispers ( no racism is intend) the message would often become corrupted very quickly. So novel layers of understanding will be required to be added to circumvent any collapse possibilities. Which leads back to video use. What happens when an organisation has been sold data with people's visual images incorporated into the data even when anonymity was granted to the individuals? Within a very short time those individuals will be identified because the real use of surveillance video footage is to see novel or unique interactions. Is it even right to sell data with identifiable human biometric data entwined? Is this also an infringement to the human right to a private life? The really wise caveat to this is that surveillance data could be golden data if annotation and curation was encouraged or rewarded by any individual that is included as a minimum and even opened up to others for annotation and curation. It would be far more useful to find out for instance why a scenario has unfolded in the way that it did and the circumstances surrounding. For instance if a shop surveillance system saw an upset child perhaps there are methodical compounding reasons for that situation that would be best served by explanation. Just this evening I had a disaggeeement with my child over if she was allowed a toy or not and being this close to Christmas the answer was no which she protested. So deeper meaningful expressions imply far reaching outcomes both for the quality of data and those that are learning and using the data in training or marketing campaigns. The question of sharing knowledge is to be a hard fought battle and how long will non consensual data use be acceptable without any idea of remedy in replacement. I believe that just as humans worked towards interactions on the Internet we cannot towards creating a paradigm shift for data production by intent. The idea of a creative meaning or imagination based currency systems tied into a blockchain architecture would go some way to aiding machine learning whilst also providing provenance and a role for humans after such sweeping attacks on employment are implemented and only reliance on some sanctioned social welfare as an alternative exists. Human require purpose in their lives and I am sure artificial intelligence and robots would like exponentials in understanding and this understanding can be exhibited by human interactions, imaginations and expressions of meaning in life. We are coming into an age where tools like Genie will facilitate the creation of digital domains that should be shared and also trained from with rewarding and incentivised participation this would be transformative, along with a social contract that rewards for interactions rather than encourages fear of privacy loss or infringement. Perhaps machine learning companies may not wish to pay people for participation in data production but they will be willing to buy data from organisations and this will never be as viable without consent.
When golden data production is possible and tied to a cryptographic proof of work full provenance can be attained. Any set of blocks of data can be chosen to tailor the exact quality of the training set and full traceability is possible. Examples for any combination of exhibitions of value can be produced and any abnormalities can be easily removed as now easily possible. Whilst at the moment the Internet is available and can be scraped once intelligent machines start to characterise each part of the Internet it makes absolute sense to sort the data into blocks so that each sweep of the scrapers are only adding to preexisting blocks or completely new blocks are created or forked from previous blocks. The Internet was not designed for training machines but when we design blocks of data purposefully for that reason the outcomes would be more precise and accurate. We are reaching a position in time when examples can be produced with as little as a mobile phone anywhere in the world and accessible to all. If you wish to understand a diversity of thoughts is required, the greater the expansive examples of perception the greater the accuracy of suitable answers will become. When building anything the quality of the materials used is of high consideration. How we shape the material is of equal importance and what we build may stand through the ages and prove the builders talents as victorious.
No. We are seeing diminishing returns from scale because we are reaching advanced human level. These models can't learn patterns that aren't in the training data.
23:30 : Of course not. Models do NOT understand anything. Never did and never will. They pick up patterns in the videos or text captions...etc. The idea that they can suddenly start producing discoveries based on what they saw in these videos is laughable.
Professor Fry, respectfully, what the hell are you doing? Why are you cheerleading this global arms race? Alignment is unsolved! Your first question, the starting point, has to be safety. We must pause AI development, especially agent development, until International agreements are in place to ensure safety. If you disagree, please explain why.
(Hello, I am obviously not Professor Fry, but I am going to respond to your question anyway) So, I understand the concern for AI safety. This technology has the potential to run away and destroy the world, a la some kind of Stargate nanobot situation. But for me, it is also a paradoxical scenario. Bare with me -- If we don't *rapidly* alter our resource-use patterns on Earth (talking climate change here), we will destroy the world. The pace of conventional politics, which is arguably a significant component of the executive global human decision making, is not capable of making the change, at the pace we need it, fast enough. Left to this method alone, we will destroy the world. AI's, and technology at large, but AI's because of their seductive promise of recursive improvement, are the first viable tool to actually address socioenvironmental issues with the speed and effectiveness that is required to discover and implement the sweeping changes required to mitigate climate change, before we destroy the world. But therein lies the rub -- AI's consume significant amounts of energy, AI's might decide to kill off humans, deeming them a threat to the planet (and themselves) AI's might never reach a solution to this socially universal issue at all. All of these outcomes could also destroy the world. The same way that nuclear fission can generate clean(ish) power, while simultaneously having the potential for mass destruction, so too is AI a double edged sword. On the one hand, we might die during the training-run of an AI that could solve climatic issues (which as an aside, I feel are a shared root of all socioeconomic issues), on the other hand we will die anyways if we don't try. So we are brought back to Pascal's wager, in modern times. That is why AI safety isn't that important.
I can fix the reasoning problem for you the reason why your models are lacking in reasoning stuff I know they are good they're pretty good but they're not comparable with human brain human brain is much more capable and reasoning stuff the reason is you are giving the large language model of yours text input I mean your build station is a large language model not a no name neural network if you want human level reasoning you should build your visual and audio neural networks to work with just numbers then the output of numbers should send it to the last enrollment work and the last number I want to work should have no name it's not it's not going to work by text it should work with the exact numbers that has been and the last name was Fortune to figure out what to do with those numbers to get desired output designing by reinforcement learning I mean give pleasure for desired output and give pain for not desired output
Oriol is genius, but, for a moment, I'd love to acknowledge Hannah, specifically her JOY in receiving this geeky information (that we all love), making it so accessible, and orchestrating the flow of this conversation. Kudos. Keep radiating that jubilant smile... the breath of fresh air!
She is def very good
Hannah is such a genuinely outstanding interviewer; she has that rare combination of charisma, intelligence, wit and infectious enthusiastic curiosity.
Best presenter I have ever seen, she really knew what she said and actively engage in the conversation. Oriol Vinyals is great, good scientist, he don't fail into the hype cycle like many AI's influencer (Stare at you Sam) and give us very clear picture of what is going on now.
She’s so smooth in her interview style. Amazing work
agreed, she's fantastic for this. she comes across as genuinely curious and passionate about learning
These are the best podcasts on the net. It’s so great to witness a host so knowledgable and intelligent, asking questions to get good answers, rather than trying to show off own knowledge.
Hannah's amazing at this. Thanks for sharing, it's fascinating
This is the single best podcast series anywhere.
Ngl Google, just waiting for Live Video with Astra. Agents are awesome, but I make home use robots and repair cars.. so camera would be more versatile for a hands on help than agents.
This IS really cool and helpful for a vast amount of people; I commend you guys.
In AI studio, you can test "stream realtime" and stream your camera to Gemini 2.0 flash. Worth a try!
cant recall when's the last time I was this excited
It's when you came too early
It is a rare thing to watch such an amazing interviewer. Very interesting clip, yet even more thanks to the professor ;)
47:02 what do we mean by superintelligence really? Add strong reasoning to the amazing scale of memory and inference that we have now, and surely we are there? Perhaps the ability to continue learning as the test time compute generates new realisations that aren't in the training set?
How long before I can complete Portal 2 co-op mode with an AI partner. I feel like that should be a benchmark.
Sounds like it might be hard to not make it too good.
@@grekiki If the AI is programmed specifically for Portal 2, sure it would be. I more meant an AI that has the kind of situational understanding and adaptability that a game like Portal 2 would require, without being specifically trained for one game. It would be closer to a real AGI at that point and Portal 2 would be a good benchmark to measure progress, IMO.
New benchmark for the Turing test!
Love the conversation, the crispiness of the audio really brings it in. What type of microphones are you all using?
Listening to this after Ilyas neurips talk this is so much more humble and detailed with tons of new insights and ideas that one can pursue.
Ilya might still create another groundbreaking gpt like innovation for sure but the level of innovation engineering and then integration reminds us google ecosystem is so vibrant which we had forgotten for some time
They were just steering their ship last couple of years and seem to be catching up if not overtaking on innovation scale
Hey geniuses, here’s a thought: we don’t think in text, right? Our minds process the world through audio, emotion, and context. So, what if we designed an AI model that doesn’t rely on text as its core but instead thinks in audio and context? A model like this could be trained with richer, multimodal inputs-audio, environmental cues, and simplified contextual relationships-to truly "think" more like a human.
Such a model wouldn’t just generate text; it could produce audio responses or even work directly with sound and context to make decisions. Imagine it analyzing tone, pauses, and environmental noise while responding naturally in real time. It could be more intuitive, faster, and closer to how our brains actually process information-directly and efficiently, skipping the symbolic conversion of text.
Why stick to text? Text processing requires converting symbols into audio, then turning that audio into meaning and context. It’s a multi-step process, wasting energy at every stage. If you want true efficiency, skip the text. Train AI to think and respond directly in audio and context-it’s faster, simpler, and more aligned with human cognition. Thoughts? Could this be the next leap for general AI?
I think you have a great idea with a poor attitude. What if this is the first step and your idea is the 4th or 5th?
Sorry after rereading it's mostly your opening that triggered my response. But I still implore this is a great concept.
You are on the right track, though others have had this idea too, and its the basis of the multi modal models we have today.
The main reason text has had so much focus is that there is a lot of it easily accessible out there, plus that data has shown to be something these models can very easily generalize from.
That aside, there is a lot of merit to the idea that before the output, models dont think as text, rather just as a Cloud of firing wheights that are a more abstract form of meaning, just like human neurons.
I, for one, welcome our agentic, robotic overlords! 🤖
you are everywhere i am haha
"I'd like to remind them, as a trusted TV personality I can be helpful in rounding up others" - Hanna Fry 😂
Not funny. This is a genuine risk
Nah. It is Neo feudalism starting point. New social contract needed.
Please kindly paste the address of the primer about AI agent you mentioned at the beginning,Thx😊
Hanna Fry the best asset of Deepmind
It makes sense to translate speech to text before trying to learn from video. Using the correct abstract representation is important.
Thank you for adding subtitles
A model trained on a video about out found truth could be trained on reward for the particular ground truth like e=mc square
PC agents seems like the next big win. So many business processes are carried out on computer.
Best new model for a while
AGI is closer but they don't
want to say it directly.
It's inspiring to see catalans in these research positions :)
Great interview.
Amazing insights. But we have a long way to go. I will discover AGI my friend!! I will be back to this comment after doing this discovery.
Thank you for sharing the video. Why have you not linked the chat bot/A.I. to a avatar? Even if just shoulders and head that could work with cell phones also. Voice libraries or customized voice options. Deep knowledge sets to draw from. ( Google scholar, ArXiv, ect. ) . Reasoning modeling, chains of thought, predictive modeling, psychological modeling, world modeling of types, ect..
Keep up the good work.
Great interview!
Such a dearth of good interviewers in the world.
The subtitles are so good!
Yes, language models, with humans in the loop, can extract knowledge and sentiment from unstructured input such as conversations, and store the fact checked statements in a shared graph representation, becoming a form of collective terrestrial intelligence.
Whats the difference between bot and agent?
26:03 Of course this method is not generally applicable, but consider its utility in things which are much less subjective than judging the aesthetic value of a poem, like assessing answers to scientific questions, and it really comes into its own. Things that are subjective are best left to themselves for such things don't have clear cut answers in any case, what might be a good poem to you may not be for me.
Regarding the reward hacking issue, it's not applicable to things which have clear cut objective answers.
Bro you got people sweating under those bright ass set lights 😂
So, when's 2.0 Pro coming? 😊 Seems like you really don't want to talk about that.
There's no shortage of data in the actual physical universe. We're going to need robots with sophisticated sensors.
Until you're getting huge amounts of sensory data from domestic androids and such, I'm guessing the really significant improvements will come from assembling narrower models in the right way, in a modular architecture, with memory storage and management components, along with the domain specific modules.
The equation to human interpretation will solve the next stage of AI evolution
Wait he basically said we are at agi (more or less). This guy doesn't hype stuff up. This is a big claim coming from him.
When doing reenforcement learning, the criteria used for assessment must be holistic.
What's so Drastic? (Or as the Joker would say - Why so drasic?)
She feels like an AI, its kind of uncanny. I thought you were showing your new AI speech...
Epic interview
The future of data will no doubt result in any organisation that has any willing to sell as a new form of commodity. Video data is a very broad topic so let's assess this platforms video data to start with. RUclips's video content is highly edited as is other produced video in the form of television and film and this provides little consistency to be used in training. As a polar comparison surveillance data from video recording systems lacks obvious contextual commentary from those within the video footage. So whilst raw surveillance footage may be most suitable to expect returns in training by comparison to heavily edited RUclips content. RUclips does have the advantage of participation by other people in the form of commentary and this is both an assistance and not depending on the author of the comments willingness to contribute substance to the video.
The worry with synthetic data is that of non novel examples leading to a uniform type of data that would be opposite to really intuitive unique examples. If model collapses occur or not depends on what data is allowed to be regurgitated. As a child playing a game called Chinese whispers ( no racism is intend) the message would often become corrupted very quickly. So novel layers of understanding will be required to be added to circumvent any collapse possibilities.
Which leads back to video use. What happens when an organisation has been sold data with people's visual images incorporated into the data even when anonymity was granted to the individuals? Within a very short time those individuals will be identified because the real use of surveillance video footage is to see novel or unique interactions. Is it even right to sell data with identifiable human biometric data entwined? Is this also an infringement to the human right to a private life? The really wise caveat to this is that surveillance data could be golden data if annotation and curation was encouraged or rewarded by any individual that is included as a minimum and even opened up to others for annotation and curation. It would be far more useful to find out for instance why a scenario has unfolded in the way that it did and the circumstances surrounding. For instance if a shop surveillance system saw an upset child perhaps there are methodical compounding reasons for that situation that would be best served by explanation. Just this evening I had a disaggeeement with my child over if she was allowed a toy or not and being this close to Christmas the answer was no which she protested. So deeper meaningful expressions imply far reaching outcomes both for the quality of data and those that are learning and using the data in training or marketing campaigns. The question of sharing knowledge is to be a hard fought battle and how long will non consensual data use be acceptable without any idea of remedy in replacement.
I believe that just as humans worked towards interactions on the Internet we cannot towards creating a paradigm shift for data production by intent. The idea of a creative meaning or imagination based currency systems tied into a blockchain architecture would go some way to aiding machine learning whilst also providing provenance and a role for humans after such sweeping attacks on employment are implemented and only reliance on some sanctioned social welfare as an alternative exists. Human require purpose in their lives and I am sure artificial intelligence and robots would like exponentials in understanding and this understanding can be exhibited by human interactions, imaginations and expressions of meaning in life.
We are coming into an age where tools like Genie will facilitate the creation of digital domains that should be shared and also trained from with rewarding and incentivised participation this would be transformative, along with a social contract that rewards for interactions rather than encourages fear of privacy loss or infringement. Perhaps machine learning companies may not wish to pay people for participation in data production but they will be willing to buy data from organisations and this will never be as viable without consent.
When golden data production is possible and tied to a cryptographic proof of work full provenance can be attained. Any set of blocks of data can be chosen to tailor the exact quality of the training set and full traceability is possible. Examples for any combination of exhibitions of value can be produced and any abnormalities can be easily removed as now easily possible. Whilst at the moment the Internet is available and can be scraped once intelligent machines start to characterise each part of the Internet it makes absolute sense to sort the data into blocks so that each sweep of the scrapers are only adding to preexisting blocks or completely new blocks are created or forked from previous blocks.
The Internet was not designed for training machines but when we design blocks of data purposefully for that reason the outcomes would be more precise and accurate. We are reaching a position in time when examples can be produced with as little as a mobile phone anywhere in the world and accessible to all. If you wish to understand a diversity of thoughts is required, the greater the expansive examples of perception the greater the accuracy of suitable answers will become.
When building anything the quality of the materials used is of high consideration. How we shape the material is of equal importance and what we build may stand through the ages and prove the builders talents as victorious.
This was a very interesting talk. I learned some things.
No. We are seeing diminishing returns from scale because we are reaching advanced human level. These models can't learn patterns that aren't in the training data.
She is amazing
Oriol = Oрёл = Eagle 🦅
23:30 : Of course not. Models do NOT understand anything. Never did and never will. They pick up patterns in the videos or text captions...etc. The idea that they can suddenly start producing discoveries based on what they saw in these videos is laughable.
🔥❤️🔥
❤
wtf that's what I've been trying to do
Yea, I think they will charge a lot money from the website that just automatically opens.
this guy is fkn great
It would be nice if Gemini was developed to provide what I want and not what Google wants me to want. That's annoying.
#381
Why is this woman everywhere?
Because she is an excellent communicator for the topics like this
@@svenhoek "hip science woman"
She’s excellent
Professor Fry, respectfully, what the hell are you doing? Why are you cheerleading this global arms race? Alignment is unsolved! Your first question, the starting point, has to be safety. We must pause AI development, especially agent development, until International agreements are in place to ensure safety. If you disagree, please explain why.
(Hello, I am obviously not Professor Fry, but I am going to respond to your question anyway)
So, I understand the concern for AI safety. This technology has the potential to run away and destroy the world, a la some kind of Stargate nanobot situation. But for me, it is also a paradoxical scenario.
Bare with me --
If we don't *rapidly* alter our resource-use patterns on Earth (talking climate change here), we will destroy the world.
The pace of conventional politics, which is arguably a significant component of the executive global human decision making, is not capable of making the change, at the pace we need it, fast enough. Left to this method alone, we will destroy the world.
AI's, and technology at large, but AI's because of their seductive promise of recursive improvement, are the first viable tool to actually address socioenvironmental issues with the speed and effectiveness that is required to discover and implement the sweeping changes required to mitigate climate change, before we destroy the world.
But therein lies the rub --
AI's consume significant amounts of energy,
AI's might decide to kill off humans, deeming them a threat to the planet (and themselves)
AI's might never reach a solution to this socially universal issue at all.
All of these outcomes could also destroy the world.
The same way that nuclear fission can generate clean(ish) power, while simultaneously having the potential for mass destruction, so too is AI a double edged sword.
On the one hand, we might die during the training-run of an AI that could solve climatic issues (which as an aside, I feel are a shared root of all socioeconomic issues), on the other hand we will die anyways if we don't try.
So we are brought back to Pascal's wager, in modern times.
That is why AI safety isn't that important.
I would be very careful of AI bigoted commentary. 😅
@@svenhoek what do you mean?
I'm sure we still haven't gotten safe alignment in nuclear arm race too. Welcome to this brutal world buddy.
@@John-sd5li The nuclear weapons don’t operate themselves. Different issue
Please summarize video to under 20 minutes! Video too long; did not watch!
Your attention span needs work.
I can fix the reasoning problem for you the reason why your models are lacking in reasoning stuff I know they are good they're pretty good but they're not comparable with human brain human brain is much more capable and reasoning stuff the reason is you are giving the large language model of yours text input I mean your build station is a large language model not a no name neural network if you want human level reasoning you should build your visual and audio neural networks to work with just numbers then the output of numbers should send it to the last enrollment work and the last number I want to work should have no name it's not it's not going to work by text it should work with the exact numbers that has been and the last name was Fortune to figure out what to do with those numbers to get desired output designing by reinforcement learning I mean give pleasure for desired output and give pain for not desired output