Thank you, Dr. Cox! I have learned the basic of neuroscience from your amazing online course "Fundamentals of Neuroscience". All three parts of the course are well constructed and very informative! I am so surprised (and maybe excited) to find that neuroscientists switched career to work in the AI industry. The cross-disciplinary feature of this field, especially the link to neuroscience, has always attracted me to want to learn more about it. You have set a great example for me, a researcher in finance who is struggling to turn to things that I have a lifelong passion for. Thank you!
I was looking for the Physics of AI part, in the intro he mentioned power consumption of AI processing/training, nothing in his talk or slides. Great series of videos, thanks MIT and all.
Have a look at this. ruclips.net/video/eZdOkDtYMoo/видео.html not sure if it specifically answers your questions around physics of AI, but it does some advances in specialized algorithm and hardwares to tackle Power consumption.
Doing RAG with LLMs is a form of neurosymbolic ai(cause all you're doing is putting background knowledge into the model). I was mindblown to know about this.
5 years ago on the IBM watsson service presentation meetup i said that i have an algorythm that can do all this stuff and they answered that is impossible, now 5 years later they researced something similar, finally... Great job IBM.
The problem with the ability to do neurosymbolic AI is the lack of dedicated encoding layers. Your eyes and the optical nerves are dedicated feature encoding layers. The brain doesn't need to "prove" to itself that those signals coming from the eyeball are correct inputs into the neural network and spend a whole lot of time figuring out what is important, what is an error or how to filter them out. It just takes the signal as is and tags it. So one time go and no need for another pass. Green is green. Round is round. Furry is furry. Done. Once those signals are then stored in the cerebral cortex as new networks of "memorized" signal inputs that is where the symbolic part comes into play because now it is in memory and in the brain and completely disconnected from the real world. The memories can be replayed at will and (refired, retriggered, re-experienced, whatever) and linked in various ways. The brain then is purely focused on direct inference for determining what any combination of encoded features "means" or "represents". As part of this is the ability to remember shapes and colors and patterns and to use as inputs into direct inference probability calculation to determine what they represent. So you can look at a piece of abstract art and see a cup and saucer, because on one level, the brain is seeing the neurons fire for the shape of a cup and saucer, which is associated with the previous memory of a cup and saucers shape. This is because the encoded signals for the image of a cup and saucer have more fidelity and feature depth than a 2d image on a computer. So there is the convex/concave shape of the surface based on light and shadow. There is the shape of the silhouette of the shapes. There are the textures of the fur. There are the colors on the texture of the fur. Each of these features becomes primary feature sets that can be evaluated independently as first order feature values not strictly tied to any other "pre defined" training states. So the brain sees signals for "fur" and "cup" and determines "furry cup". Current AI systems aren't trained that way because they are simply trained on a single tagged image of a cup and saucer as cup and saucer. All the deep feature values are lost and compressed into a single statistical input for a training set which cannot recreate the feature fidelity of what the brain gets from the eyes. Those lower layer features are glossed over for the purposes of the high order statistical training values. So the AI only sees cup as cup and all the feature layers below that are discarded in determining the the encoding for "cup" because that is how the encoded raw pixels are tagged. "Furry cup" can never be found in this system because "fur" is a lower level feature set that is not tagged as first order training data.
Interesting point! "..Current AI systems aren't trained that way because they are simply trained on a single tagged image of a cup and saucer as cup and saucer. All the deep feature values are lost and compressed into a single statistical input for a training set which cannot recreate the feature fidelity of what the brain gets from the eyes. Those lower layer features are glossed over for the purposes of the high order statistical training values...." If we have a sufficient amount of images with "furry cups" in a given dataset which also contains a sufficient amount of "cups" and other "furry" objects, I wonder what would be the result if instead of labeling them exclusively in one category (e.g., [0, 1, 0, 0, 0] for a one-hot encoding scheme - [obj1, furry-cup, obj3, obj4, , obj5]) we label it in a certain degree of uncertainty. For example, given the labels [obj1, cup, obj3, furry, obj5] ("one-hot encoding" for the softmax output), the "furry cup" could be [0, 0.6, 0, 0.4, 0], the "cup" [0, 1.0, 0, 0, 0] and other "furry" objects [0, 0, 0, 1.0, 0]. Your argument reminded me of this paper (didn't read): "VirTex: Learning Visual Representations from Textual Annotations" (arxiv.org/abs/2006.06666) and now I am curious to read it.
@@viniciusarruda2515 " I wonder what would be the result if instead of labeling them exclusively in one category we label it in a certain degree of uncertainty" - is this like compositional learning ? i.e Learning the whole from learning the parts that its composed of.
@@nitind9786 I think so.. At the end of training a network with this type of data, I would expect the network to learn to extract features in a more compositional way. I do not know what experiment can be made to proof that such a model has learned compositional features. By the way, an intriguing paper on the topic: "Recognition-by-Components: A Theory of Human Image Understanding".
@@viniciusarruda2515 Thanks for replying. Compositional learning does make sense. However, don't you think, compositional learning in a supervised regime would be a pretty hard task. Wouldn't it be extremely difficult to create the (annotated) training set ? Or can you think of unsupervised or rewarding (RL) ways to achieve compositional learning ?
@@nitind9786 I don't know how we can achieve this via RL or with no supervision. And sure, it seems to be a hard task via supervision and to create a dataset to achieve this behavior from the model. However, everything needs a starting point and for me the first one is to check if current object detectors (e.g., FasterRCNN, EfficientDet) has the capability of learning compositional features. In a near future I intend to create an artificial dataset to make this experiment. However, I still have no idea how to change/adapt the last linear layers. My intuition is that putting this inductive bias toward composition inside the object detectors may yield in higher mAP.
25:51 - For fun, I asked ChatGPT a few questions in this section. They reasoned it well and got it mostly right (messed up color blocks a bit). Apparently ChatGPT doesn't use the technique discussed here (I asked it). Alex - I know this video is 4 yrs old and intense research being done in this field. What if anything would you suggest folks to read about this or architecture to get abreast with the field (if still relevant)?
As of November 2024, the example used in this talk, "What's the shape of the red object", can be easily solved by LLMs with vision capabilities. Do LLMs use neurosymbolic approaches or the way to think about the limitation of non-symbolic deep learning was just flawed?
So that people can set the Reminders and do not miss watching the video, as it is a series people are eagerly waiting for the next one. Also, Premiere videos are high likely to appear in recommendations of RUclips.
Build hype, anticipation and yes set reminders. The idea is like it gives more initial views when the video is posted so it feeds in the RUclips algorithm. I get it for movies or album release but not sure for these types of educational videos. People who watch these type of videos are in the minority and would find and watch it or not either way.
Thank you, Dr. Cox! I have learned the basic of neuroscience from your amazing online course "Fundamentals of Neuroscience". All three parts of the course are well constructed and very informative! I am so surprised (and maybe excited) to find that neuroscientists switched career to work in the AI industry. The cross-disciplinary feature of this field, especially the link to neuroscience, has always attracted me to want to learn more about it. You have set a great example for me, a researcher in finance who is struggling to turn to things that I have a lifelong passion for. Thank you!
This is by far one of the best talks on Neurosymbolic AI (or, at least, certainly one of my favorites)! Thank you for posting!
I was looking for the Physics of AI part, in the intro he mentioned power consumption of AI processing/training, nothing in his talk or slides. Great series of videos, thanks MIT and all.
Have a look at this. ruclips.net/video/eZdOkDtYMoo/видео.html not sure if it specifically answers your questions around physics of AI, but it does some advances in specialized algorithm and hardwares to tackle Power consumption.
@20:39 "Does anyone notice anything fucked up about these chairs"
GigaChad
Amazing lecture, thanks MIT!
Doing RAG with LLMs is a form of neurosymbolic ai(cause all you're doing is putting background knowledge into the model). I was mindblown to know about this.
5 years ago on the IBM watsson service presentation meetup i said that i have an algorythm that can do all this stuff and they answered that is impossible, now 5 years later they researced something similar, finally... Great job IBM.
The problem with the ability to do neurosymbolic AI is the lack of dedicated encoding layers. Your eyes and the optical nerves are dedicated feature encoding layers. The brain doesn't need to "prove" to itself that those signals coming from the eyeball are correct inputs into the neural network and spend a whole lot of time figuring out what is important, what is an error or how to filter them out. It just takes the signal as is and tags it. So one time go and no need for another pass. Green is green. Round is round. Furry is furry. Done. Once those signals are then stored in the cerebral cortex as new networks of "memorized" signal inputs that is where the symbolic part comes into play because now it is in memory and in the brain and completely disconnected from the real world. The memories can be replayed at will and (refired, retriggered, re-experienced, whatever) and linked in various ways. The brain then is purely focused on direct inference for determining what any combination of encoded features "means" or "represents". As part of this is the ability to remember shapes and colors and patterns and to use as inputs into direct inference probability calculation to determine what they represent. So you can look at a piece of abstract art and see a cup and saucer, because on one level, the brain is seeing the neurons fire for the shape of a cup and saucer, which is associated with the previous memory of a cup and saucers shape. This is because the encoded signals for the image of a cup and saucer have more fidelity and feature depth than a 2d image on a computer. So there is the convex/concave shape of the surface based on light and shadow. There is the shape of the silhouette of the shapes. There are the textures of the fur. There are the colors on the texture of the fur. Each of these features becomes primary feature sets that can be evaluated independently as first order feature values not strictly tied to any other "pre defined" training states. So the brain sees signals for "fur" and "cup" and determines "furry cup". Current AI systems aren't trained that way because they are simply trained on a single tagged image of a cup and saucer as cup and saucer. All the deep feature values are lost and compressed into a single statistical input for a training set which cannot recreate the feature fidelity of what the brain gets from the eyes. Those lower layer features are glossed over for the purposes of the high order statistical training values. So the AI only sees cup as cup and all the feature layers below that are discarded in determining the the encoding for "cup" because that is how the encoded raw pixels are tagged. "Furry cup" can never be found in this system because "fur" is a lower level feature set that is not tagged as first order training data.
Interesting point!
"..Current AI systems aren't trained that way because they are simply trained on a single tagged image of a cup and saucer as cup and saucer. All the deep feature values are lost and compressed into a single statistical input for a training set which cannot recreate the feature fidelity of what the brain gets from the eyes. Those lower layer features are glossed over for the purposes of the high order statistical training values...."
If we have a sufficient amount of images with "furry cups" in a given dataset which also contains a sufficient amount of "cups" and other "furry" objects, I wonder what would be the result if instead of labeling them exclusively in one category (e.g., [0, 1, 0, 0, 0] for a one-hot encoding scheme - [obj1, furry-cup, obj3, obj4, , obj5]) we label it in a certain degree of uncertainty. For example, given the labels [obj1, cup, obj3, furry, obj5] ("one-hot encoding" for the softmax output), the "furry cup" could be [0, 0.6, 0, 0.4, 0], the "cup" [0, 1.0, 0, 0, 0] and other "furry" objects [0, 0, 0, 1.0, 0].
Your argument reminded me of this paper (didn't read): "VirTex: Learning Visual Representations from Textual Annotations" (arxiv.org/abs/2006.06666) and now I am curious to read it.
@@viniciusarruda2515 " I wonder what would be the result if instead of labeling them exclusively in one category we label it in a certain degree of uncertainty" - is this like compositional learning ? i.e Learning the whole from learning the parts that its composed of.
@@nitind9786 I think so.. At the end of training a network with this type of data, I would expect the network to learn to extract features in a more compositional way. I do not know what experiment can be made to proof that such a model has learned compositional features. By the way, an intriguing paper on the topic: "Recognition-by-Components: A Theory of Human Image Understanding".
@@viniciusarruda2515 Thanks for replying. Compositional learning does make sense. However, don't you think, compositional learning in a supervised regime would be a pretty hard task. Wouldn't it be extremely difficult to create the (annotated) training set ?
Or can you think of unsupervised or rewarding (RL) ways to achieve compositional learning ?
@@nitind9786 I don't know how we can achieve this via RL or with no supervision. And sure, it seems to be a hard task via supervision and to create a dataset to achieve this behavior from the model. However, everything needs a starting point and for me the first one is to check if current object detectors (e.g., FasterRCNN, EfficientDet) has the capability of learning compositional features. In a near future I intend to create an artificial dataset to make this experiment. However, I still have no idea how to change/adapt the last linear layers. My intuition is that putting this inductive bias toward composition inside the object detectors may yield in higher mAP.
25:51 - For fun, I asked ChatGPT a few questions in this section. They reasoned it well and got it mostly right (messed up color blocks a bit). Apparently ChatGPT doesn't use the technique discussed here (I asked it).
Alex - I know this video is 4 yrs old and intense research being done in this field. What if anything would you suggest folks to read about this or architecture to get abreast with the field (if still relevant)?
eagerly waiting
As of November 2024, the example used in this talk, "What's the shape of the red object", can be easily solved by LLMs with vision capabilities. Do LLMs use neurosymbolic approaches or the way to think about the limitation of non-symbolic deep learning was just flawed?
Mind. Blown.
Thanks
Very interesting approach!
Is there any book that talks about Neurosymbolic A.I?
I think this Arxiv paper should help! arxiv.org/pdf/1711.03902.pdf
Neural-Symbolic Learning and Reasoning: A Survey and Interpretation by T.R. Besold
Can we get the slides? The slides are not present at the link present in the description here.
excellent!
WOW !
👏🏻👏🏻👏🏻
why are people using premiere? I don't see a real benefit using it. Can somebody explain it to me?
So that people can set the Reminders and do not miss watching the video, as it is a series people are eagerly waiting for the next one. Also, Premiere videos are high likely to appear in recommendations of RUclips.
Another advantage i can see is that people can discuss about video in real time.
Build hype, anticipation and yes set reminders. The idea is like it gives more initial views when the video is posted so it feeds in the RUclips algorithm. I get it for movies or album release but not sure for these types of educational videos. People who watch these type of videos are in the minority and would find and watch it or not either way.
Thanks everyone!!
31:38
The uptalk tho... :/
Well... lots of talking about how shitty deep learning is, but no time left for the actual relevant information.
2050 and beyond hua hua huaaaaa
Another academic with no idea what he's doing. But still doing it!
2050 and beyond hahahaha