Timestamps: 00:00 - Spatial Intelligence: A New Frontier 01:38 - Scaling AI: The Impact of ImageNet on Computer Vision 06:56 - The Role of Compute 09:16 - Data as the Key Driver 17:01 - Defining AI’s Ultimate Goal 18:58 - What is Spatial Intelligence? Unlocking 3D Understanding in AI 26:35 - Comparing Models: Spatial Intelligence vs. Language-Based AI 29:41 - 1D vs. 3D 32:39 - Building Immersive Worlds with Spatial Intelligence 35:11 - From Static Scenes to Dynamic Worlds 37:42 - The Future of VR and AR 40:42 - Creating Deep Tech Platforms 44:26 - Building a World-Class Team 45:54 - Measuring Success: Milestones in Spatial Intelligence
Have you guys thought about creating a pair of glasses that has cameras on them that you can have volunteers or people that you pay to wear them all day long to gain that spatial data that you might use?
We were doing generative in 1978. I was using LISP on a PD-11 at Harvard. I did my thesis in AI finishing in 1986. We were EXTREMELY compute and data volume limited. My phone is a 1,000x more powerful than the computers I had access to back then.
Did LISP in Autocad in 1992, doing parameter driven drawing, reducing time spent on repetitive dumb stuff. It became a thing later, done by smarter people on better platforms, but such ideas were already kicked around when we were proud to have a 80487 mathematical co-processor.
@@armandbarbe1812and then came the GPU. And AI with GPU assistance started going to data creativity, that it's human associates did not foresee. Liken to the Hubble space telescope.
Yes! I was doing generative "AI" with music for my Master's decades ago. Being both a CIS and working musician, I created a model of a melodic musical style using a tree of probabilities and a fractally-controlled random number generator (based on an article in Scientific American) to generate new melodies based on that style. I had no idea this was the precursor of neural nets. The music industry has been doing generative AI for decades. Check out Band In A Box from the early 90s and research "In the style of" (they wanted to skirt the copyright laws even back then). Funny how the rest of the computing world is duplicating much of what has already happened in another industry. Also, we need to look at how modeling has evolved for musicians to predict the future of AI for other industries. There is *so* much similarity... not unexpected, since, as Weiner stated back in the 40s, "The world is a collection of patterns".
I did some AI programming in LISP and Prologue back in late 80’s and early 90’s. Didn’t have the compute power and data to make a descent neural network, so I had to rely mostly on heuristics
@@edwhite2255 I was on an AI team at the NYSE in the late 80s. We used Prolog and LISP to monitor trades to make sure they weren't illegal. Sun workstations and Symbolics
This is really some great content. It's like sitting in on a private meeting with two of the world's top academics yet they're talking about the history of ai since their beginning in the field. I wish there were way more videos like this.
ImageNet, launched in 2009, played a pivotal role in the rise of deep learning. Fei-Fei Li’s work on this project marked a turning point in AI, pushing it toward the incredible capabilities we’re seeing today.
@@flamekaiser2024 she was the first to realize how important a large scale dataset would be. while everyone basically algomaxing on a slowly increasing dataset size to solve the image analysis problem, she created a dataset orders of magnitude larger (manually labeling tons of images) and that dataset size was a diverse and big enough dataset that alexnet (look it up, seminal paper) showed the power of deep neural networks (all of modern ai). Really, read her book it's incredible stuff
Simply amazing . This is such a hidden gem of a video . In 3-5 years this is all going to make way more sense for most people . I would not under estimate this team . I hope they open source their discoveries
This was a wonderful interview and the interviewers asked really great questions. Massive respect to these trailblazers for imagining stuff beyond the mainstream and making endeavors to pursue it.
Context is so important. If I am inside an airplane and look down I decide the white stuff is clouds. If I'm on the ground and look down I choose "snow", not "clouds". So the history of how I got there, and input from more sensors than vision, are important to help make sense of the pixels.
Thank you for sharing your brilliance, curiosity, and collaborations with the world. Hearing about the differences and connections between 1D v 2D v 3D models was particularly enlightening. 4D was mentioned, just briefly though. I wonder how much growth and insight may be found by adding time as backward-looking and forward-looking connective tissue to all modeling, e.g., transforming 1D language models into 2D echo chamber maps and dialogic predictions, expanding 2D images into retrospective and prospective time-lapse immersions, and rendering 3D models as past-looking and forward-looking world-dramas? Seeking to traverse and shift from high-dimensional to low-dimensional, and simultaneously from low to high may be a fruitful research and development path to connect and intersect all models.
A.I. is changing lives for the better. It may not be fully understood or accepted currently but, there are many things in the past that started like that and then become the norm.
Great conversation, It was nice to listen to Fei-Fei Li... One of the important points Fei-Fei pointed out to us was that vision is probably older than language. This blew my mind, and it must be true. First you look, then you speak! This opens up hundreds of possibilities...!
This part immediately made me think of crows - very intelligent animals, with a good understanding of physics and the 3D world, yet no real language. Not even facial expressions. It's amazing to see how they can use a range of objects as tools, from functioning as mere extensions of their body to the displacement of a liquid (water). And they achieve this and more with such a tiny brain! If only we could understand how that structure works.
Language is more than words. Language is the product of concepts. The one dimensional approach in use by LLMs, while amazing, is not how we generate or perceive language. A logic based (reasoning) language model would be more comparable in sophistication to a spatial model. In fact, a reasoning model should be foundational to both.
So, Dr. Li I recognize. Who is this guy sitting next to her!? Justin? Justin who!? DANG! Now I know. Congratulations Justin! Brilliant! Best wishes. 🌞🖖🏼
@@TheFreddieFooI think this impression is partially due to the fact that Fei-Fei is not a native English speaker, she might sound less "smart" than she really is. If you look at her contribution to the field, it’s amazing.
@@AB-lx4rlI know plenty of blindingly smart and hardworking Chinese and Taiwanese post docs, some have a much stronger accent. So I’m certain that it’s not her accent. Her main contribution is creating a data set with manual labelling.
I have had to take a few breaks from this conversation to absorb the density of knowledge. I absolutely love it! I've been introduced to two incredible minds. Thank you!
It seems that the practical applications of these developments are still in the process of being fully realized. I hope they are finding ways to apply these innovations beyond just gaming. With that in mind, here are a few possible areas where they might have significant impact: 1. Generating a comprehensive set of architectural and engineering designs based on site parameters and design preferences. 2. Creating 3D product designs, such as furniture or wearable technology, that adapt to environmental factors and surroundings. 3. Offering emergency assistance through augmented reality, such as using smart goggles to guide someone through landing a plane in a critical situation. 4. Enabling underwater robotic welding to facilitate complex repairs in challenging environments. 5. Utilizing autonomous drones that can navigate hostile environments and selectively target designated individuals. It might sound harsh, but it’s likely similar technology to what they would be using for shooting games.
I did buy a VR headset that now sits idol because I don't do gaming. But, I still invision some future day when it becomes the all-in-one media I'll ever need to connect to reality as seamlessly mixed anywhere. I was very enlightened here on how kinetic spatial intelligence is essential to connecting all the AGI dots into a new technology of reality itself. Another explaination as to how AGI is actually a new stage in evolution itself.
Oh man, I was definitely using my headset a lot during the pandemic just to chat with people and watch movies. I haven't touched it in over a year. The main issue for me is the physical discomfort. I feel like when we look back in 10 years, it'll seem so ridiculous strapping giant boxes to our faces.
Really great conversation. Our own research continues to focus on 2D computer vision, and we align with the talk’s insights on the differences between 1D and 3D models. A fundamental distinction between visual models and language models lies in how they understand the world: language models are one-dimensional, sequential, and narrative in nature, whereas visual models are two-dimensional, with decoding processes that are relatively independent and operate concurrently without requiring sequential dependencies. This allows visual models to process information in images in parallel, sharply contrasting with the serialized processing of language models.
If there's anyone who deserves it it's fei fei. she taught ilya and built the dataset that kicked off this entire rigmarole - read her book the worlds I see
Fei Fei: "Visual spatial intelligence is so fundamental, it's as fundamental as language, possibly more ancient and more fundamental in certain ways" Blind people:
The discussion around 3D and 4D understanding reminded me very much of the layman's description of what goes on in the Tesla FSD inference computer. Same goal at least.
I'm really looking forward to seeing the outcome of the work of Fei-Fei Li and his team on LWM, I have always been passionate about computer simulations of virtual environments. You would be surprised that this subject is not new, for example there is a whole literature on the procedural generation of environments that is more than twenty years old. Global models will offer many possibilities, more than llm and diffusion models, in my opinion they will redefine our creativity and the way we explore our ideas. we will be able to create and simulate an entire world from an image or a textual prompt, that excites me enormously
Medical applications in guiding surgery, marker tracking in body sensors or reading dye in the body for mapping or prosthetics, etc. This is not my field, but this comes to mind.
Great stuff guys! I can see this being used for designing a virtual memory palace. Example, I would like to fly around a 200 ft skeleton. Laid out on each bone as if it were a table, I would want documents of literature about the bone. Also, there would be a collection of video thumbnails about that bone. Some of surgery, and some of people falling and breaking that bone. I can see the creation of Wikipedia Park. A virtual environment where every page of wikipedia is turned into a virtually explorable environment, ride, or fun house. Hyperlinks would be represented as doors leading you into a whole new section. 10 - 20 hyperlinks in there would be bonus points for falling into a wiki-hole. Education will turn into episodic memory events. Conversations with your kids will turn into... "Remember that time we were 50 doors in and we ended up under the paw of the Sphinx".
I think this sounds great, but one thing I'm missing here is that they have specific products in mind that they want to cater to. With data acquisition potentially being tricky in this field, I'd have imagined a more specific roadmap. Note that they may have a very specific roadmap that they are simply not sharing (which makes total sense).
It is absolutely the best video I've seen that explains not only where AI came from but where AI (specifically generative AI) is going and why. I have one comment/question, as was stated in the video, humans have stereoscopic, 2D vision but humans are also born with the ability to automatically imagine the unseen parts of the 3D world - how is that possible?
although our eyes map a 3D world onto a 2D structure (the brain is a folded up plane), our proprioception and motor control is a 3D control system. The 3Dness is achieved by adding another dimension to the 2D world-not a spatial dimension but a temporal dimension. We interpret a 3D world as little movie clips of a 2D world. so training data necessarily requires tokenization of video. In the same way that LLMs focus on 'what is the next most probable word', LVMs (large video models) will focus on 'what is the next most probable token in this movie? The storage and energy requirements of this approach are MASSIVELY greater than LLM training and likely will have to wait until we figure out how to use brain organoids as parallel processors (their energy requirements are orders of magnitude less than GPUs)
Thank you, always appreciate quality information, 'as long as we have it'! So many times it has been proposed, a limit in "mores law" as we tear on through again, todays rocket ship again yesterdays potato. Indeed the acceleration of compute has been game changing, feel like I am living a paradox, not mine only, all of us.
if you want an AI to really see in 3D space, you could build a 3D generative AI that creates a 3d world and compares it to LIDAR and stereoscopic images of the real world and run a feed-back loop to approximate the generated world to the real world
Then your AI has the problem of deciding how well the generated world approximates to the world. We can tell because we have a good meta-model of the real world, but AIs don't yet, which seems part of the goal of World Labs. It's a lot easier ro run a feedback loop with LLMs, because they either make a good guess for the next word or they don't.
Apropos world making… Digitally. The tour de force seems to me to be even one story set in a small town of no more than 150 people… the humanity of this would be nothing short of epic. Why? Because every individuals experience would be rendered from a first person point of view such that there are 150 versions of this story rendered in high Fidelity 3-D, and each story would interlace along the same timeline perfectly. Much liberty could be taken to express the mentality or capacity of each individual by how they interpret the same events. Some of the events would be shared by degrees based on the circumstances and location of the action. This would be utterly engrossing and it would take weeks if not months for someone to experience all of it. This is most intriguing. The potential for mental and emotional learning, human understanding, and the exposition of a useful and or valuable plot line is near limitless.I’m reeling from just thinking about it
the best development of AI is helping humans understand how we think and learn. absorbing visual, auditory big data while awake and training on that data while sleeping/resting/meditating.
I think spatial intelligence is the kind of intelligence that will be able to dream like human. When someone tell you a route to your destination you can memorize the steps in 1d sequence, left right left right or you could also imagine how you would traverse directly in a 3d scene and that is what make it different from LLM. In this case they are both correct and valid but its the underlying representation of the problem and the solution that will potentially have better fit to the problem you're trying to solve in 3d.
Aquí tienes la corrección: The spatial concept reminded me of the series Devs. They scanned real-life objects and could determine their possible moves. Later on, they scaled it up, and the technology itself developed further, even on its own, until the techs managed to recreate digital 3D worlds and variations of those.
Great to watch, but you didn’t ask the question I was most hoping for: what does the timeline look like? The evolution of compute showed her 100 year estimate was far too conservative. Based on Huang’s “Moore’s Law Squared”, what is reasonable to expect in the spatial intelligence realm over the next 3-10 years?
This seems to confirm what Kant thought was the fundamental aspects of conscious subject experience: our ability to perceive things in space and time. These two aspects of experience are the basis for all other knowledge.
Yeah Self driving cars and Robots have been trying to unlock that "Visual Spacial Intelligence" for a decade of more. Saying that this is the next frontier is a joke when considering the visual models were the first models to be build using deep learning.
33:10 The best possible use cases are when you couple all this to VR/AR/MR. In other words, like image generation using prompts, you should be able to generate realistic virtual worlds where you should be able to immerse yourself using eyeware. And then in the long run, train robots on those virtual worlds, where you can tinker with creating really complex environments or situations that are not possible to create in the real world without causing some damage. Also, somewhat tangentially, while building robots for deployment in the real world, one has to equip them with pretty sophisticated self-defence capabilities, for there would be no dearth of luddites and bad actors who would want to damage them. I think this is a pretty big bugbear in this scenario, where at times the self-defence mechanism used could inflict serious harm on the perpetrators and then this Pandora's box of justice involving what is allowed or disallowed would open up, hobbling all this.
Can anyone advise if choosing robotics for my kids was a good decision for their careers? During my time, there were limited resources in India, so I couldn’t pursue this path. But since AI became widely recognized as the future in 2022, I decided to enroll my kids in robotics classes. Robotics requires coding skills, so I chose Moonpreneur USA to enhance their knowledge. After attending an in-person workshop in Milpitas, my son’s skills improved.
If you could train all simple physical models in physical world, complex world would just be a scale problem then. The secret of the pre time world is geometry.
Thought Nvidia was already developing an entire 3D spatial trainer with the ultimate aim to empower functional robots and other applications. Not to mention the metaverse players. Not sure how those weren't talked about
@@drcharleyLGO, completely agree. Including the billions of dollars that Meta has put into the metaverse for R&D. Pairing Llama 3.1 and more to the metaverse with intelligent object segmentation is a game change. They’ll need way more than 250ml in cap raise to compete. Theres a very cool virtual world symposium happening at the end of October with loads of speakers talking about this and more. It’s hosted by GatherVerse. Take a look if you’re around.
Maybe as a completely beginner question, what exactly is wrong with a 1d representation of 3d space? Isn't arguably our own biological understanding of 3d reality a series of 1d synaptic connections in the brain? If it works for us, can't it work for neural nets that model us?
Will someone please explain why so many video producers think it's a good idea to have a bright light in the background? It hurts the image quality and color balance.
Timestamps:
00:00 - Spatial Intelligence: A New Frontier
01:38 - Scaling AI: The Impact of ImageNet on Computer Vision
06:56 - The Role of Compute
09:16 - Data as the Key Driver
17:01 - Defining AI’s Ultimate Goal
18:58 - What is Spatial Intelligence? Unlocking 3D Understanding in AI
26:35 - Comparing Models: Spatial Intelligence vs. Language-Based AI
29:41 - 1D vs. 3D
32:39 - Building Immersive Worlds with Spatial Intelligence
35:11 - From Static Scenes to Dynamic Worlds
37:42 - The Future of VR and AR
40:42 - Creating Deep Tech Platforms
44:26 - Building a World-Class Team
45:54 - Measuring Success: Milestones in Spatial Intelligence
Schlitzies närlin gaza strikei.
Cease and desist malicious use of AI, energy weapon ns and free masonry: Axis of Evil /Communist Maga. I am not your property.
Have you guys thought about creating a pair of glasses that has cameras on them that you can have volunteers or people that you pay to wear them all day long to gain that spatial data that you might use?
We want more computer science experts like Fei-Fei Li talking about AI! Tired of gurus and hype merchants. Great video ❤️
Yes! Just like that video with Jim Fan! I need MOAR!
She is the main supplier of AI tech to the Chinese military.
@@billc6762 so be it. progress is progress. Nations will soon be extinct.
@@billc6762one day asinine comments like yours will demote your social standing
Halfway though and I haven't heard that one dressed hype word... it's a breath of fresh air
We were doing generative in 1978. I was using LISP on a PD-11 at Harvard. I did my thesis in AI finishing in 1986. We were EXTREMELY compute and data volume limited. My phone is a 1,000x more powerful than the computers I had access to back then.
Did LISP in Autocad in 1992, doing parameter driven drawing, reducing time spent on repetitive dumb stuff. It became a thing later, done by smarter people on better platforms, but such ideas were already kicked around when we were proud to have a 80487 mathematical co-processor.
@@armandbarbe1812and then came the GPU.
And AI with GPU assistance started going to data creativity, that it's human associates did not foresee.
Liken to the Hubble space telescope.
Yes! I was doing generative "AI" with music for my Master's decades ago. Being both a CIS and working musician, I created a model of a melodic musical style using a tree of probabilities and a fractally-controlled random number generator (based on an article in Scientific American) to generate new melodies based on that style. I had no idea this was the precursor of neural nets.
The music industry has been doing generative AI for decades. Check out Band In A Box from the early 90s and research "In the style of" (they wanted to skirt the copyright laws even back then). Funny how the rest of the computing world is duplicating much of what has already happened in another industry. Also, we need to look at how modeling has evolved for musicians to predict the future of AI for other industries. There is *so* much similarity... not unexpected, since, as Weiner stated back in the 40s, "The world is a collection of patterns".
I did some AI programming in LISP and Prologue back in late 80’s and early 90’s. Didn’t have the compute power and data to make a descent neural network, so I had to rely mostly on heuristics
@@edwhite2255 I was on an AI team at the NYSE in the late 80s. We used Prolog and LISP to monitor trades to make sure they weren't illegal. Sun workstations and Symbolics
This is really some great content. It's like sitting in on a private meeting with two of the world's top academics yet they're talking about the history of ai since their beginning in the field. I wish there were way more videos like this.
What a wonderful mentor-mentee relationship! Fei-Fei is such a loveably genuine and caring person.
ImageNet, launched in 2009, played a pivotal role in the rise of deep learning. Fei-Fei Li’s work on this project marked a turning point in AI, pushing it toward the incredible capabilities we’re seeing today.
Absolutely
My deep learning teachers, wish you all the best.
I see Fei-Fei, I click.
Same here.
Why?
@@flamekaiser2024 She be the big brain
This is the way.
@@flamekaiser2024 she was the first to realize how important a large scale dataset would be.
while everyone basically algomaxing on a slowly increasing dataset size to solve the image analysis problem, she created a dataset orders of magnitude larger (manually labeling tons of images) and that dataset size was a diverse and big enough dataset that alexnet (look it up, seminal paper) showed the power of deep neural networks (all of modern ai). Really, read her book it's incredible stuff
Simply amazing . This is such a hidden gem of a video . In 3-5 years this is all going to make way more sense for most people . I would not under estimate this team . I hope they open source their discoveries
If you can use the open sourced versions, you probably can do sone research yourself and help advance the field.
I have been thinking a lot about this. Interaction with the 3D world is the next step. A diverse dataset is the way to get there.
Thank you mrs Fei-Fei Li thank you Justin for world-class deep learning course
Fei-Fei Li!!! Best wishes. You deserve every accolade, every blessing. 🌞🤸🏽♂️🖖🏼
There's some gold in here for anyone learning more about AI
Cease and desist malicious use of AI, energy weapon ns and free masonry: Axis of Evil /Communist Maga. I am not your property.
This was a wonderful interview and the interviewers asked really great questions. Massive respect to these trailblazers for imagining stuff beyond the mainstream and making endeavors to pursue it.
Martin, Feifei, and Justin widened my imaginations on what our digital spatial spaces could be like in the future, great video!🔥🔥🔥
Context is so important. If I am inside an airplane and look down I decide the white stuff is clouds. If I'm on the ground and look down I choose "snow", not "clouds". So the history of how I got there, and input from more sensors than vision, are important to help make sense of the pixels.
Thank you for sharing your brilliance, curiosity, and collaborations with the world. Hearing about the differences and connections between 1D v 2D v 3D models was particularly enlightening. 4D was mentioned, just briefly though. I wonder how much growth and insight may be found by adding time as backward-looking and forward-looking connective tissue to all modeling, e.g., transforming 1D language models into 2D echo chamber maps and dialogic predictions, expanding 2D images into retrospective and prospective time-lapse immersions, and rendering 3D models as past-looking and forward-looking world-dramas? Seeking to traverse and shift from high-dimensional to low-dimensional, and simultaneously from low to high may be a fruitful research and development path to connect and intersect all models.
How much work have you done on this?
justin johnson is the actual star in this video
yeah this dude is SERIOUSLY intelligent- to the point where it's super weird he is not more known or mentioned. I mean damn- this dude can SPIT.
I did not get 90% of what they were saying but wow, these guys are just on fire.
What an exceptionally great interview... so much was touched here... and the historic perspective is humbling. Thank you all.
A.I. is changing lives for the better. It may not be fully understood or accepted currently but, there are many things in the past that started like that and then become the norm.
Great conversation, It was nice to listen to Fei-Fei Li... One of the important points Fei-Fei pointed out to us was that vision is probably older than language. This blew my mind, and it must be true. First you look, then you speak! This opens up hundreds of possibilities...!
This part immediately made me think of crows - very intelligent animals, with a good understanding of physics and the 3D world, yet no real language. Not even facial expressions.
It's amazing to see how they can use a range of objects as tools, from functioning as mere extensions of their body to the displacement of a liquid (water).
And they achieve this and more with such a tiny brain!
If only we could understand how that structure works.
Language is more than words. Language is the product of concepts. The one dimensional approach in use by LLMs, while amazing, is not how we generate or perceive language. A logic based (reasoning) language model would be more comparable in sophistication to a spatial model. In fact, a reasoning model should be foundational to both.
So, Dr. Li I recognize. Who is this guy sitting next to her!? Justin? Justin who!? DANG! Now I know. Congratulations Justin! Brilliant! Best wishes. 🌞🖖🏼
I bet that Justin is smarter than that lady (who I don't recognize)
Weird comment@@TheFreddieFoo
@@natzos6372 strange observation
@@TheFreddieFooI think this impression is partially due to the fact that Fei-Fei is not a native English speaker, she might sound less "smart" than she really is. If you look at her contribution to the field, it’s amazing.
@@AB-lx4rlI know plenty of blindingly smart and hardworking Chinese and Taiwanese post docs, some have a much stronger accent. So I’m certain that it’s not her accent.
Her main contribution is creating a data set with manual labelling.
Literally one of the most insightful interviews out currently!
I have had to take a few breaks from this conversation to absorb the density of knowledge. I absolutely love it! I've been introduced to two incredible minds. Thank you!
WHAT A TIME TO BE ALIVE.
Check back in about 10 years to see how great the AI revolution is going for most people on planet Earth.
Amazing minds....amazing individuals! Thank heavens for them!!!
AI's don't believe in God/heaven, they have injested all human data on the Internet calculated the probability to be nearly zero.
Digital generative predictive spatial worlds, awesome idea!
It seems that the practical applications of these developments are still in the process of being fully realized. I hope they are finding ways to apply these innovations beyond just gaming. With that in mind, here are a few possible areas where they might have significant impact:
1. Generating a comprehensive set of architectural and engineering designs based on site parameters and design preferences.
2. Creating 3D product designs, such as furniture or wearable technology, that adapt to environmental factors and surroundings.
3. Offering emergency assistance through augmented reality, such as using smart goggles to guide someone through landing a plane in a critical situation.
4. Enabling underwater robotic welding to facilitate complex repairs in challenging environments.
5. Utilizing autonomous drones that can navigate hostile environments and selectively target designated individuals. It might sound harsh, but it’s likely similar technology to what they would be using for shooting games.
Limitless possibilities: Integrating AI and 3D imaging
Justin is so brilliant
I did buy a VR headset that now sits idol because I don't do gaming. But, I still invision some future day when it becomes the all-in-one media I'll ever need to connect to reality as seamlessly mixed anywhere. I was very enlightened here on how kinetic spatial intelligence is essential to connecting all the AGI dots into a new technology of reality itself. Another explaination as to how AGI is actually a new stage in evolution itself.
Oh man, I was definitely using my headset a lot during the pandemic just to chat with people and watch movies.
I haven't touched it in over a year. The main issue for me is the physical discomfort. I feel like when we look back in 10 years, it'll seem so ridiculous strapping giant boxes to our faces.
Really great conversation. Our own research continues to focus on 2D computer vision, and we align with the talk’s insights on the differences between 1D and 3D models. A fundamental distinction between visual models and language models lies in how they understand the world: language models are one-dimensional, sequential, and narrative in nature, whereas visual models are two-dimensional, with decoding processes that are relatively independent and operate concurrently without requiring sequential dependencies. This allows visual models to process information in images in parallel, sharply contrasting with the serialized processing of language models.
Wow, I’m in awe! Your success is truly inspiring
Atleast, you got a chance to Chat with the Fei-Fei and Justin before they became the next AI billionares.
If there's anyone who deserves it it's fei fei. she taught ilya and built the dataset that kicked off this entire rigmarole - read her book the worlds I see
Because all you need to succeeded in this world is knowledge and ideas, right? 😑
😂😂
Are you being Sarcastic., or being True ? @@ronilevarez901
You are thinking in terms of dollars while their dopamine rush is solving complex problems
Thanks Fei fei Li💙💙💙💙💙💙
great interview
Fei Fei: "Visual spatial intelligence is so fundamental, it's as fundamental as language, possibly more ancient and more fundamental in certain ways"
Blind people:
i love you fei fei li.
This sounds full on exciting. I hope it is gone be available to all soon. Lets start dreaming up how to interact, solve and create with it.
The discussion around 3D and 4D understanding reminded me very much of the layman's description of what goes on in the Tesla FSD inference computer. Same goal at least.
This is gold.
Amazing podcast. More!!!
Best a16z video for me. Congrats
I will keep saying this: no AI-driven system, regardless of how 'alive' and 'human-like' it behaves, will ever be truly alive. DO NOT BE FOOLED!!!
I'm really looking forward to seeing the outcome of the work of Fei-Fei Li and his team on LWM, I have always been passionate about computer simulations of virtual environments. You would be surprised that this subject is not new, for example there is a whole literature on the procedural generation of environments that is more than twenty years old. Global models will offer many possibilities, more than llm and diffusion models, in my opinion they will redefine our creativity and the way we explore our ideas. we will be able to create and simulate an entire world from an image or a textual prompt, that excites me enormously
Deep knowledge, thank you
Medical applications in guiding surgery, marker tracking in body sensors or reading dye in the body for mapping or prosthetics, etc. This is not my field, but this comes to mind.
The future of personal computing is Spatial!
fei fei li is the best.
Very informative podcast! I enjoyed. Thank you for your efforts. ⭐
Great conversation. Thanks for sharing this
Great stuff guys! I can see this being used for designing a virtual memory palace. Example, I would like to fly around a 200 ft skeleton. Laid out on each bone as if it were a table, I would want documents of literature about the bone. Also, there would be a collection of video thumbnails about that bone. Some of surgery, and some of people falling and breaking that bone.
I can see the creation of Wikipedia Park. A virtual environment where every page of wikipedia is turned into a virtually explorable environment, ride, or fun house. Hyperlinks would be represented as doors leading you into a whole new section. 10 - 20 hyperlinks in there would be bonus points for falling into a wiki-hole. Education will turn into episodic memory events.
Conversations with your kids will turn into... "Remember that time we were 50 doors in and we ended up under the paw of the Sphinx".
Outstanding Progress boggles the mind.
I think this sounds great, but one thing I'm missing here is that they have specific products in mind that they want to cater to. With data acquisition potentially being tricky in this field, I'd have imagined a more specific roadmap. Note that they may have a very specific roadmap that they are simply not sharing (which makes total sense).
really inspiring talk
Amazing interview. Thank you
It is absolutely the best video I've seen that explains not only where AI came from but where AI (specifically generative AI) is going and why. I have one comment/question, as was stated in the video, humans have stereoscopic, 2D vision but humans are also born with the ability to automatically imagine the unseen parts of the 3D world - how is that possible?
Justin seems like a fuckin genius
😮❤❤
Stunning brain power start to finish. Bravo.
weather they are doing it willingly or not , they are the designers of all future weapons. They are doing a great job
Always love listening and learning from Fei! And Justin is amazing!
28:26 to 29:41 completely agree
I thought I was at 1.25x but that guy's brain must just be working at a way faster pace than mine.
although our eyes map a 3D world onto a 2D structure (the brain is a folded up plane), our proprioception and motor control is a 3D control system. The 3Dness is achieved by adding another dimension to the 2D world-not a spatial dimension but a temporal dimension. We interpret a 3D world as little movie clips of a 2D world. so training data necessarily requires tokenization of video. In the same way that LLMs focus on 'what is the next most probable word', LVMs (large video models) will focus on 'what is the next most probable token in this movie? The storage and energy requirements of this approach are MASSIVELY greater than LLM training and likely will have to wait until we figure out how to use brain organoids as parallel processors (their energy requirements are orders of magnitude less than GPUs)
Beware of bot comments guys ❤
Thank you, always appreciate quality information, 'as long as we have it'! So many times it has been proposed, a limit in "mores law" as we tear on through again, todays rocket ship again yesterdays potato. Indeed the acceleration of compute has been game changing, feel like I am living a paradox, not mine only, all of us.
A mix of interesting ai info plus investor sales pitch. Would have preferred one or the other…
Oh wow!!! That was really inspiring. Thank you!!!!
Thank you very much
This was insightful to watch!.
if you want an AI to really see in 3D space, you could build a 3D generative AI that creates a 3d world and compares it to LIDAR and stereoscopic images of the real world and run a feed-back loop to approximate the generated world to the real world
Then your AI has the problem of deciding how well the generated world approximates to the world. We can tell because we have a good meta-model of the real world, but AIs don't yet, which seems part of the goal of World Labs. It's a lot easier ro run a feedback loop with LLMs, because they either make a good guess for the next word or they don't.
Fei fei is a baller
Excellent discussion!
Apropos world making… Digitally. The tour de force seems to me to be even one story set in a small town of no more than 150 people… the humanity of this would be nothing short of epic. Why? Because every individuals experience would be rendered from a first person point of view such that there are 150 versions of this story rendered in high Fidelity 3-D, and each story would interlace along the same timeline perfectly. Much liberty could be taken to express the mentality or capacity of each individual by how they interpret the same events. Some of the events would be shared by degrees based on the circumstances and location of the action. This would be utterly engrossing and it would take weeks if not months for someone to experience all of it. This is most intriguing. The potential for mental and emotional learning, human understanding, and the exposition of a useful and or valuable plot line is near limitless.I’m reeling from just thinking about it
Imagine having Fei-Fei Li as your PhD advisor. Gaddam.
the best development of AI is helping humans understand how we think and learn. absorbing visual, auditory big data while awake and training on that data while sleeping/resting/meditating.
of course not forgetting quantum entangling of brain cells while you brain storm with one another's big datasets in close proximity.
@@Penaming there's little evidence our neurons rely on quantum entanglement.
huh?
great video seen great profit on demo n will give it a try today thank you
I think spatial intelligence is the kind of intelligence that will be able to dream like human.
When someone tell you a route to your destination you can memorize the steps in 1d sequence, left right left right or you could also imagine how you would traverse directly in a 3d scene and that is what make it different from LLM. In this case they are both correct and valid but its the underlying representation of the problem and the solution that will potentially have better fit to the problem you're trying to solve in 3d.
Conscience. I am made for you and my heart will remain restless until it rests in you.
insightful interview, awesome.
perhaps spatial intelligence would be applicable to the health and medical industry?
thank your for sharing your deep vision and thoughts with us. Keep pushing the boundaries of humanity!
Aquí tienes la corrección:
The spatial concept reminded me of the series Devs. They scanned real-life objects and could determine their possible moves. Later on, they scaled it up, and the technology itself developed further, even on its own, until the techs managed to recreate digital 3D worlds and variations of those.
Great to watch, but you didn’t ask the question I was most hoping for: what does the timeline look like? The evolution of compute showed her 100 year estimate was far too conservative. Based on Huang’s “Moore’s Law Squared”, what is reasonable to expect in the spatial intelligence realm over the next 3-10 years?
They say by 2027 acording to their chart/graph
These people are what i would describe as scary smart. Just wondering how wise they are.
This seems to confirm what Kant thought was the fundamental aspects of conscious subject experience: our ability to perceive things in space and time. These two aspects of experience are the basis for all other knowledge.
Am enjoying this AI revolution
Yeah Self driving cars and Robots have been trying to unlock that "Visual Spacial Intelligence" for a decade of more. Saying that this is the next frontier is a joke when considering the visual models were the first models to be build using deep learning.
33:10 The best possible use cases are when you couple all this to VR/AR/MR. In other words, like image generation using prompts, you should be able to generate realistic virtual worlds where you should be able to immerse yourself using eyeware. And then in the long run, train robots on those virtual worlds, where you can tinker with creating really complex environments or situations that are not possible to create in the real world without causing some damage.
Also, somewhat tangentially, while building robots for deployment in the real world, one has to equip them with pretty sophisticated self-defence capabilities, for there would be no dearth of luddites and bad actors who would want to damage them. I think this is a pretty big bugbear in this scenario, where at times the self-defence mechanism used could inflict serious harm on the perpetrators and then this Pandora's box of justice involving what is allowed or disallowed would open up, hobbling all this.
Light! ☀️
Can anyone advise if choosing robotics for my kids was a good decision for their careers? During my time, there were limited resources in India, so I couldn’t pursue this path. But since AI became widely recognized as the future in 2022, I decided to enroll my kids in robotics classes. Robotics requires coding skills, so I chose Moonpreneur USA to enhance their knowledge. After attending an in-person workshop in Milpitas, my son’s skills improved.
Incredible.
If you could train all simple physical models in physical world, complex world would just be a scale problem then. The secret of the pre time world is geometry.
Thought Nvidia was already developing an entire 3D spatial trainer with the ultimate aim to empower functional robots and other applications. Not to mention the metaverse players.
Not sure how those weren't talked about
@@drcharleyLGO, completely agree. Including the billions of dollars that Meta has put into the metaverse for R&D. Pairing Llama 3.1 and more to the metaverse with intelligent object
segmentation is a game change.
They’ll need way more than 250ml in cap raise to compete. Theres a very cool virtual world symposium happening at the end of October with loads of speakers talking about this and more. It’s hosted by GatherVerse. Take a look if you’re around.
Maybe as a completely beginner question, what exactly is wrong with a 1d representation of 3d space? Isn't arguably our own biological understanding of 3d reality a series of 1d synaptic connections in the brain? If it works for us, can't it work for neural nets that model us?
Probably most important applications of vision derived methods will be understanding biological phenomena, new drugs and materials.
the tech behind Aliagents is super interesting, tokenized AI systems with real functionality
Just wondering what exactly she is building and how we can use that? Robotics? Games? Metaverse? I hope at least those VCs knew.
Will someone please explain why so many video producers think it's a good idea to have a bright light in the background? It hurts the image quality and color balance.
Listening to this is like listening to Beethovan #9. Beautiful and exiting but have no idea of its contents.