It seems like tokenization would really fuck with one's ability to do some of these tests. Like, I don't know how much the format of something like that would even be preserved? Also, makes me wonder how much the lack of physical 3d movement data/training would impact some of these reasoning tasks. Like, you can even notice in her language use about concepts and stuff how much "spatial" reasoning is involved. It seems like to do one of these tests fairly, you would have to completely sort of homogenize the way the test taker would be taking it? It brings to mind like, disabilities, you know? You wouldn't expect someone that was born not only blind, but completely unable to process visual data in any way we would recognize to be able to solve visual tasks necessarily, unless they generalize well to whatever mediums they do know.
I once saw a research 'paper' on decoding the actual nerve signals that are sent from the mammalian retina to the brain. The "visual data" turned out to actually have been pre-processed into something like 6 channels of what could be described as "very stripped down image data". There was a "channel" of mostly just high contrast edges. There was a "channel" of cells that had recent luminosity changes. There were some "channels" that were apparently still in a status of WTF is this? So I would not be surprised if it turns out that information flow in our human brain turns out to be way more "Tokenized" than we assume.
22:35 - how many of those tasks require knowledge of the exact spelling of the word? - LLM's are only passed the encoded tokens, and may not be aware of word spelling in a way that allows eg: acronyms?
In my opinion understanding is actually pretty clear. In humans, a very useful skill in language acquisition is circumlocution, or referring to a concept without using the name of it. Now, for a true LLM it's possible that could be done with regurgitation of training data directly (what are the odds that some common turns of phrase show up on Wikipeda? Or that a dictionary would find its way into a dataset?), but in a multi modal LLM I think the ability to verify similar patterns of neuron activity for analogous inputs across modalities is pretty indicative of strong understanding and generalization. In other words, I think the strength of understanding can be measured as the number of unique inputs that can implement a given pattern of behavior in the FFN, or lead to the same or similar output in the language head, roughly.
Agreed. This is literally the basis for the formation of internal world models. To me. The world model in itself is confirmation of a deep contextual understanding, stop bottlenecks in that understanding(knowledge conflicts, Hallucinations..etc). This is an architectural and data issue though..fyi temporal self-attention makes a world of a difference. The model needs native temporal embeddings.
So why did her methodology fail to work here? So we have to go to the basics, because you simply can't skip them: 1) Humans are repetition machines, they repeat and recombine their experiences, you can see it ESPECIALLY in the arts, but there we call it inspirations, humans take "inspirations" from their life experiences, and recombine them. 2) AI is the same, they too are repetition machines that recombine experiences, but their experiences are different from humans'. 3) Hence, for comparison, you may not test humans on subjects that they have not experienced, and for AI you may not do the same. However her entire testing methodology was based on testing on experiences that only humans had. Basically, she almost got it when she said that a child learns to wear socks under the shoes by his experiences, but then did not narrow her tests to be based on common experiences of AI and humans, rendering them a curiosity of translation, but not of understanding.
25:06 I consider myself to be reasonably intelligent, but I am absolutely stumped by Problem No. 1. How are you supposed to evaluate the three blocks of letters below the alphabet? Are the two blocks on the first line supposed to serve as an example? Are you supposed to consider all three blocks together? Does the order of the blocks matter? I suspect that there is something implied here that I am missing.
Read left to right on the blocks. Notice that the second block matches the alphabet with the jumbled letters. The second row is the test. Match the the similar sequence, replacing the 'l'.
26:18 -- Those results are highly suspicious. I suspect there's a confirmation bias in the human scores. The workers don't want to lose their qualifications, and so won't perform tasks that are too difficult. So there may be a drop off in participation or submission if a human worker feels like they may be in error for the more complex tasks. Furthermore, the human workers were given "room to think", where the prompting of the LLMs suggest they were not. I suspect allowing GPT-4 to use step-by-step reasoning would improve its score across the board. And dramatically more so if its allowed to create a python script to solve the problem.
lol! the difference in performance between the humans doing the test for free and the people being paid is about the same jump in performance you get from chatgpt when you just give it a prompt vs if you tell it "i'll give you $20 if you do a good job" xD
Mitchell is great! Love her work. But Lecun's reckless, self-serving comments should not be elevated so high. It's like a TV news program hosting a flat-Earther to give both sides of the story.
It seems like tokenization would really fuck with one's ability to do some of these tests. Like, I don't know how much the format of something like that would even be preserved? Also, makes me wonder how much the lack of physical 3d movement data/training would impact some of these reasoning tasks. Like, you can even notice in her language use about concepts and stuff how much "spatial" reasoning is involved.
It seems like to do one of these tests fairly, you would have to completely sort of homogenize the way the test taker would be taking it?
It brings to mind like, disabilities, you know? You wouldn't expect someone that was born not only blind, but completely unable to process visual data in any way we would recognize to be able to solve visual tasks necessarily, unless they generalize well to whatever mediums they do know.
I once saw a research 'paper' on decoding the actual nerve signals that are sent from the mammalian retina to the brain.
The "visual data" turned out to actually have been pre-processed into something like 6 channels of what could be described as "very stripped down image data".
There was a "channel" of mostly just high contrast edges. There was a "channel" of cells that had recent luminosity changes.
There were some "channels" that were apparently still in a status of WTF is this?
So I would not be surprised if it turns out that information flow in our human brain turns out to be way more "Tokenized" than we assume.
22:35 - how many of those tasks require knowledge of the exact spelling of the word? - LLM's are only passed the encoded tokens, and may not be aware of word spelling in a way that allows eg: acronyms?
can you post discussion?
They can't it's logically unavailable
In my opinion understanding is actually pretty clear. In humans, a very useful skill in language acquisition is circumlocution, or referring to a concept without using the name of it. Now, for a true LLM it's possible that could be done with regurgitation of training data directly (what are the odds that some common turns of phrase show up on Wikipeda? Or that a dictionary would find its way into a dataset?), but in a multi modal LLM I think the ability to verify similar patterns of neuron activity for analogous inputs across modalities is pretty indicative of strong understanding and generalization. In other words, I think the strength of understanding can be measured as the number of unique inputs that can implement a given pattern of behavior in the FFN, or lead to the same or similar output in the language head, roughly.
Agreed. This is literally the basis for the formation of internal world models. To me. The world model in itself is confirmation of a deep contextual understanding, stop bottlenecks in that understanding(knowledge conflicts, Hallucinations..etc). This is an architectural and data issue though..fyi temporal self-attention makes a world of a difference. The model needs native temporal embeddings.
EXCELLENT insights! Thank you so much for sharing!
It would be interesting to see the difference between LLMs trained on non-fiction, realistic fiction, and and fantasy.
"It's understanding, Jim, but not as we know it."
So why did her methodology fail to work here? So we have to go to the basics, because you simply can't skip them:
1) Humans are repetition machines, they repeat and recombine their experiences, you can see it ESPECIALLY in the arts, but there we call it inspirations, humans take "inspirations" from their life experiences, and recombine them.
2) AI is the same, they too are repetition machines that recombine experiences, but their experiences are different from humans'.
3) Hence, for comparison, you may not test humans on subjects that they have not experienced, and for AI you may not do the same. However her entire testing methodology was based on testing on experiences that only humans had.
Basically, she almost got it when she said that a child learns to wear socks under the shoes by his experiences, but then did not narrow her tests to be based on common experiences of AI and humans, rendering them a curiosity of translation, but not of understanding.
25:06 I consider myself to be reasonably intelligent, but I am absolutely stumped by Problem No. 1. How are you supposed to evaluate the three blocks of letters below the alphabet? Are the two blocks on the first line supposed to serve as an example? Are you supposed to consider all three blocks together? Does the order of the blocks matter? I suspect that there is something implied here that I am missing.
Read left to right on the blocks. Notice that the second block matches the alphabet with the jumbled letters. The second row is the test. Match the the similar sequence, replacing the 'l'.
fghij?
Nice try chat GPT
26:18 -- Those results are highly suspicious. I suspect there's a confirmation bias in the human scores. The workers don't want to lose their qualifications, and so won't perform tasks that are too difficult. So there may be a drop off in participation or submission if a human worker feels like they may be in error for the more complex tasks. Furthermore, the human workers were given "room to think", where the prompting of the LLMs suggest they were not. I suspect allowing GPT-4 to use step-by-step reasoning would improve its score across the board. And dramatically more so if its allowed to create a python script to solve the problem.
The tasks where llm fail are either useless (not trained on many data), or based on vision capabilities (where are a lot worst yet these systems)
lol! the difference in performance between the humans doing the test for free and the people being paid is about the same jump in performance you get from chatgpt when you just give it a prompt vs if you tell it "i'll give you $20 if you do a good job" xD
The AI understanding of the world is different from Human being.
Melanie Mitchell, ILY
Really good talk :)
Lecun's too overrated in AI community. 🤭
Nope
@@J_MachineC'mon. 😂
@@AlgoNudger u don't understand nothing about AI 🤦♂️
@@J_MachineNow you sound like a stochastic parrot. 🤭
@@AlgoNudger If there is a Stocastic parrot that must be you 😁😁😁😁
Mitchell is great! Love her work. But Lecun's reckless, self-serving comments should not be elevated so high. It's like a TV news program hosting a flat-Earther to give both sides of the story.
There are people who do, and those who criticize