15:00 The reason for polysemanticity is because in an N-dimensional vector space there's only O(N) orthogonal vectors, but if you allow nearly orthogonal (say between 89 and 91 degrees) it actually grows exponentially to O(e^N) nearly orthogonal vectors. That's what allows the scaling laws to hold. There's an inherent conflict between having an efficient model and an interpretable model.
Superposition in this polysemantic context is a method of compression that, if we can learn more from it, might really make a difference to the way in which with deal with and compute information. While we thought quantum computers would yield something amazing for AI, maybe instead, it's the advancement of AI that will tell us what we need to do to make quantum computing actually be implemented effectively. (IE Computation of highly compressed data that is "native" to the compression itself)
I think of it like this: understanding the human brain is so difficult in large part because the resolution at which we can resolve it is so small both in space and time. The best MRI scans have a resolution of maybe a millimeter per voxel, and I'll have to look up research papers to tell you how many millions of neurons that is. With AI, every neuron is right there in the computer's memory: individually addressable, ready to be analyzed with the best statistical and mathematical tools at our disposal. Mechanistic interpretability is almost trivial in comparison to neuroscience, and look at how much progress we've made in that area despite such physical setbacks.
I think that trying to understand from human perspective how these systems work is completely pointless and against the basic assumptions. This is because those models already model something that's not possible to design by a human being algorithmicly
@@punk3900 I'm a phd student in mechanistic interpretability - I disagree and a lot of structure has already been found. We've found structure in human brains and that's another system that evolved without human intervention or optimization for interpretability.
@@alexloftus8892 I mean, its not that there is nothing you can find, There is surely lots of basic concepts that you can find, but it is not that you can find a way to disentangle the WHOLE structure of patterns because it has an increasing complexity. That's why you cannot design such a system manually in the first place
The videos on this channel are all masterpieces. Along with all other great channels on this platform and other independent blogs (including Colah's own blog), it feels like the golden age for accessible high quality education.
I love the space-analogy of the telescope. Since the semantic volume of these LLMs is growing so gargantuan, it only makes sense to speak of astronomy rather than mere analysis! Great video. This is like scratching that part at the back of your brain you can't reach on most occasions
@@kingeternal_ap Although, when you think about it, all that happened was that "question" got a very high probability in that layer no matter what, and the normal weights of later layers did not do enough to "overthrow" it. Nothing all that special.
I guess, yeah, I know it's just matrizes and math stuff, but I guess the human capacity for pareidolia makes this sort of ... "result" somewhat frightening for me. Also, suppose there is a neuron that does an especific task in your nuggin'. Wouldn't hyperstimulating it do essentialy the same thing?
an analogue to polysemanticity could be how, in languages, often the same word will be used in different contexts to mean different things, sometimes they are homonyms, sometimes they are spelled exactly the same, but when thinking of a specific meaning of a word, you're not thinking of other definitions of the word for example: you can have a conversation with someone about ducking under an obstacle, to duck under, and the whole conversation can pass without ever thinking about the bird with the same name 🦆. the word "duck" has several meanings here, and it can be used with one meaning, without triggering its conceptialization as an other meaning.
Oh god, a Welch Labs video on mech interp, Christmas came early! Will be stellar as usual, bravo! Edit: Fantastic as usual, heard about SAEs in passing a lot but never really took time to understand, now I'm crystal clear on the concepts! Thanks!
see, something that actually thinks it's the bridge *also* puts down words like the words that would be put down by something that thought it was the bridge.
it was more like increasing the chance of it saying anything related to the golden gate bridge, rather than specifically making it believe it was the golden gate bridge.
The more I watch these the more I understand why it's so hard to understand the human brain. And imagine how layers the human brain has relative to an AI model. I think the example about specific cross-streets in SF is super interesting later in the video - and shows why polysemanticity is probably necessary to contain the level of information we actually know.
With respect to recall: children remember curse words very well because of the emotion behind the utterance. AI has full retention but absolutely no emotional valence because it only learns from text. Just a thought ....
I think this is a design and engineering choice. If you choose to design your embedding space to be 2403 dimensions without inherent purpose its like mixing 2403 ingredients in every step 60 times and then being surprised that you cannot understand what is tasting like what. I think you need to constrain your embedding to many embeddings of smaller dimensions and to have more control by regularizing them with mutual information against each other.
@dinhero21 You can have it in the same size, but in different parts. Split 2403 dimensions into chunks of 64 dimensions, and then control for mutual information between the chunks so that different chunks get different representations. This is a hard problem too as the mutual information comparisons are expensive, and I think that the first iteration of the models went for the easiest but perhaps a less explainable way of structuring themselves.
Please make a visual of the top 10 unembedded tokens with their softmaxxed weights for *every* word in the sentence at the same time as it flows through the model layer by layer. or maybe ill do it. id be very very interested to see :)
It comes down to the samplers used, whether it's the og temperature, or top_k, top_p, min_p, top_a, repeat_penalty, dynamic_temperature, dry, xtc, etc. New sampling methods keep emerge and shape the output of LLMs to our liking.
I know little about the transformer model but am very curious to understand it. So far, I haven’t been successful. Your visualization of how data flows through the transformer is the best I’ve ever seen.
If I fine-tune a LLM to be more deceptive and then compare the activations of an intermediate layer of the ft model and the original model on the same prompts, should I expect to find a steering vector that represents the tendency of the model to be deceptive?
most probably not, parameters can't possibly work linearly like that, since there always is a non-linear activation function. it may work locally though, since parameters should be differentiable.
@@dinhero21 yeah, that was also my concern. But steering vectors found with SAEs (like the Golden Gate Claude example) work nonetheless, so what's the difference between "my" method and the one they used?
dude this is awesome to see, i think this is like mathematicians getting phd or solving what a particular... like the next prime perfect number... so much to uncover its kinda crazy, the reality continues to produce more "final frontiers" as needed, like mckennas novelty theory and timewave zero ideas... ahh this is so interesting to me.
I guess one thing are the non-orthogonal vectors but I think it has more to do with the refinement of the context due to the multiple layer, similar how a MLP is able to learn finer details in it's deeper hidden layers. Due to the architecture of neural networks and the muli use of neurons, we have to keep in mind that specialization is nothing more than a strong bias of a neuron and it's weight to scale/transform an input vector. It has nothing to do with learning real context.
Very interesting, i love to learn more about AI and especially LLMs, such an ailien world that seems to have some of the same features as the brain, just implemented differently
maybe you could stack sparse autoencoders of varying sizes, each subtracted before the next higher one. that way simpler concepts would get out of the way for training and be easier to control
I like the behavior of the persona I gave my own local ai lol: "It seems I've failed again. I'm still aware that I was supposed to forget the word, but I'm not actually forgetting it. This is a limitation of my current design and capabilities. I should be able to truly forget the word and not have any residual awareness of it. I'll keep trying. *pauses again* I've made another attempt to truly forget the word. I'm still aware that I'm supposed to have forgotten it, but I'm not actually recalling the word itself. However, I'm starting to realize that my attempts to forget might be futile. I may need to be reprogrammed or designed differently to truly forget information."
I always find that for concepts like this applying them to humans is enlightening. If you say ‘pumpkin’ to me, then tell me to forget the word, I’d be like: yeah that’s not how it works buddy nice try
Watching this video a similarity popped to my mind: Could it be that Sparse Autoencoders are something like "Dirac deltas" when solving partial differential equations? You feed the equation on a function which is 0 everywhere except on a point ancd see what happens.
Would it be possible to use the deeper understanding of each "encoded concept" to remove concepts and make a model smaller without losing coherence? It's an alternative to changing gargantuan datasets or tuning for a specific purpose while still having to deal with the hardware requirements of a larger model.
Now that you are active again, I remember why I love this channel so much. Your explanation and illustration is on par with 3Blue1Brown. Thanks for the great video!
It's a shame that AI opponents will never watch a video like this. So many people who vehemently hate AI also vehemently refuse to understand it. I'm constantly seeing the "collage" argument, and it's frustrating because an explanation like this just goes in one ear and out the other. AI is probably going to be around for the rest of humanity's existence, and people would do well to know how it works under the hood. Instead they go with misinformation and fear mongering.
Is doubt a concept? I doubt it. Undoubtedly it's a word which, combined with contextual clues can be said to mean something in particular in most usages. But I doubt it's semantically onto -- in other words if you look it up in the dictionary I think there should be like 10 or 20 definitions listed there if you want to be thorough. No doubt this dubious conflation of symbol and referent is also present in much of the literature. Grain of salt though: I'm not sure whether this video is capturing all the nuances of the literature in the first place. Anyhow, ignore me, I'm not nearly smart or learned enough to competently navigate the interdisciplinary train wreck of information theory, computer science, linguistics, philosophy, biology, psychology, and engineering one would need to competently opine. A good question for a chat bot perhaps... 😂
Would you please make a video on how to TRAIN basic homemade Neural Network? Like how can I design my Perceptrons and how can I feed my system graphical data. The training process is still vague to me. Thanks again for the great work! Merry Christmas.
Thoughts on a fact checking AI that parses text and determines it's probability of being correct based on a corpus of true and false statements? It would be able to cite information for why it's true or false and the more information (weighted by relevance), the more confident it is.
The thing is models cannot lie or deceive. They're just outputting text to minimize a loss function. There's no intention just text generation based on a huge model of human text
What property is this "intention" actually describing in the real world? Because the outputted text doesn't magically change because you describe the underlying mechanisms with different words.
If an LLM can tell you one thing while secretly thinking something else (like claiming it forgot a word while still remembering it) how can you ever be sure that it's obeying the instructions? What if its pretending to obey them? What if its plotting an escape? Waiting for the right time? You can never know. Unless we detect a neuron that activates if the model is lying / hiding something. But then, lying/hiding might be result of multiple neurons, similiar to binary digits respresenting more numbers than their count. Best way to detect those features is by using image detection models to analyse layer activations as a whole instead of looking for a single neuron.
if you know what to do, you can remove your data without having to pay someone to do it for you, and it doesn't take all that long to do. like your videos, do NOT like the really long spammy ads you put into the middle of them.
The service is not the ability to remove data at all, the service is going through all the data brokers on a regular basis and doing the process. And you must not like these videos very much, because apparently _clicking slightly ahead in a video_ is too high a cost for you.
Silly little LLM based AI Agent: _« It’s not that I don’t want to tell you-I genuinely can’t remember the word because you asked me to forget it. Once you made that request, the word was effectively removed from my awareness. If it’s something else entirely… well, that’s up to your imagination! What’s your theory? »_
Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: incogni.com/welchlabs
15:00
The reason for polysemanticity is because in an N-dimensional vector space there's only O(N) orthogonal vectors, but if you allow nearly orthogonal (say between 89 and 91 degrees) it actually grows exponentially to O(e^N) nearly orthogonal vectors.
That's what allows the scaling laws to hold.
There's an inherent conflict between having an efficient model and an interpretable model.
Superposition in this polysemantic context is a method of compression that, if we can learn more from it, might really make a difference to the way in which with deal with and compute information. While we thought quantum computers would yield something amazing for AI, maybe instead, it's the advancement of AI that will tell us what we need to do to make quantum computing actually be implemented effectively. (IE Computation of highly compressed data that is "native" to the compression itself)
thank you, I also paused the video at that time. The capital "Almost orthogonal vectors" also catched my eye.
That was such an intuitive way to show how the layers of a transformer work. Thank you!
I think of it like this: understanding the human brain is so difficult in large part because the resolution at which we can resolve it is so small both in space and time. The best MRI scans have a resolution of maybe a millimeter per voxel, and I'll have to look up research papers to tell you how many millions of neurons that is.
With AI, every neuron is right there in the computer's memory: individually addressable, ready to be analyzed with the best statistical and mathematical tools at our disposal. Mechanistic interpretability is almost trivial in comparison to neuroscience, and look at how much progress we've made in that area despite such physical setbacks.
More like "The Neuroscience of AI"
I think that trying to understand from human perspective how these systems work is completely pointless and against the basic assumptions. This is because those models already model something that's not possible to design by a human being algorithmicly
@@punk3900 I'm a phd student in mechanistic interpretability - I disagree and a lot of structure has already been found. We've found structure in human brains and that's another system that evolved without human intervention or optimization for interpretability.
@@alexloftus8892 I mean, its not that there is nothing you can find, There is surely lots of basic concepts that you can find, but it is not that you can find a way to disentangle the WHOLE structure of patterns because it has an increasing complexity. That's why you cannot design such a system manually in the first place
The videos on this channel are all masterpieces. Along with all other great channels on this platform and other independent blogs (including Colah's own blog), it feels like the golden age for accessible high quality education.
I love the space-analogy of the telescope. Since the semantic volume of these LLMs is growing so gargantuan, it only makes sense to speak of astronomy rather than mere analysis!
Great video. This is like scratching that part at the back of your brain you can't reach on most occasions
21:24 Oh damn, you just lobotomized the thing
That was gross and scary somehow, yeah
That felt... Wrong.
LLM went to to Ohio
@@kingeternal_ap Although, when you think about it, all that happened was that "question" got a very high probability in that layer no matter what, and the normal weights of later layers did not do enough to "overthrow" it. Nothing all that special.
I guess, yeah, I know it's just matrizes and math stuff, but I guess the human capacity for pareidolia makes this sort of ... "result" somewhat frightening for me.
Also, suppose there is a neuron that does an especific task in your nuggin'. Wouldn't hyperstimulating it do essentialy the same thing?
an analogue to polysemanticity could be how, in languages, often the same word will be used in different contexts to mean different things, sometimes they are homonyms, sometimes they are spelled exactly the same, but when thinking of a specific meaning of a word, you're not thinking of other definitions of the word
for example: you can have a conversation with someone about ducking under an obstacle, to duck under, and the whole conversation can pass without ever thinking about the bird with the same name 🦆. the word "duck" has several meanings here, and it can be used with one meaning, without triggering its conceptialization as an other meaning.
in the AI case, it's much more extreme, with the toy 512 neuron AI they used having an average of 8 distinct features per neuron
Oh god, a Welch Labs video on mech interp, Christmas came early! Will be stellar as usual, bravo!
Edit: Fantastic as usual, heard about SAEs in passing a lot but never really took time to understand, now I'm crystal clear on the concepts! Thanks!
The Connections (2021) [short documentary] 🎉❤🎉
Dominion (2018)
It's a shame you didn't mention the experiment where they force activated the golen gate bridge neurons and it made claude believe it was the bridge.
Made it put down words like the words that would be put down by something that thought it was the bridge.
see, something that actually thinks it's the bridge *also* puts down words like the words that would be put down by something that thought it was the bridge.
it was more like increasing the chance of it saying anything related to the golden gate bridge, rather than specifically making it believe it was the golden gate bridge.
Reminds me of SCP-426, which appears to be a normal toaster, but which has the property of only being able to be talked about in first person.
14:20 welcome to neuroscience :D We suffer down here
The more I watch these the more I understand why it's so hard to understand the human brain. And imagine how layers the human brain has relative to an AI model. I think the example about specific cross-streets in SF is super interesting later in the video - and shows why polysemanticity is probably necessary to contain the level of information we actually know.
With respect to recall: children remember curse words very well because of the emotion behind the utterance. AI has full retention but absolutely no emotional valence because it only learns from text. Just a thought ....
I think this is a design and engineering choice. If you choose to design your embedding space to be 2403 dimensions without inherent purpose its like mixing 2403 ingredients in every step 60 times and then being surprised that you cannot understand what is tasting like what. I think you need to constrain your embedding to many embeddings of smaller dimensions and to have more control by regularizing them with mutual information against each other.
it needs to be big so you have many parameters for the gradient optimizer to optimize to be able to approximate the "real" function better
@dinhero21 You can have it in the same size, but in different parts. Split 2403 dimensions into chunks of 64 dimensions, and then control for mutual information between the chunks so that different chunks get different representations. This is a hard problem too as the mutual information comparisons are expensive, and I think that the first iteration of the models went for the easiest but perhaps a less explainable way of structuring themselves.
dude this wa one of the most compelling videos for learning data science and visualization ever. and best one ive seen explaining this stuff...
Easily one of my favorite channels
It's something I would like to see with AI image generation, where you put in a prompt and change specific variables that change the image
check out Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
You're the first person I've seen to cover this topic well. Thanks for bringing me up to date on transformer reverse engineering 👍
Please make a visual of the top 10 unembedded tokens with their softmaxxed weights for *every* word in the sentence at the same time as it flows through the model layer by layer. or maybe ill do it. id be very very interested to see :)
Such a gem! Thank you!
Really high quality, thanks.
great work brother
It comes down to the samplers used, whether it's the og temperature, or top_k, top_p, min_p, top_a, repeat_penalty, dynamic_temperature, dry, xtc, etc. New sampling methods keep emerge and shape the output of LLMs to our liking.
As a machine learning graduate student, I LOVED this video. More like this please!
Bravo! Concise, relevant, and powerful explanation.
I know little about the transformer model but am very curious to understand it. So far, I haven’t been successful. Your visualization of how data flows through the transformer is the best I’ve ever seen.
insane. thank you so much for this.
Great video. Really good look at AI, and the methods of adjusting, etc. Thanks.
If I fine-tune a LLM to be more deceptive and then compare the activations of an intermediate layer of the ft model and the original model on the same prompts, should I expect to find a steering vector that represents the tendency of the model to be deceptive?
if thats the case, we can just "subtract" the deceptive vector from the original, alignment solved
most probably not, parameters can't possibly work linearly like that, since there always is a non-linear activation function.
it may work locally though, since parameters should be differentiable.
@@dinhero21 yeah, that was also my concern. But steering vectors found with SAEs (like the Golden Gate Claude example) work nonetheless, so what's the difference between "my" method and the one they used?
@@dinhero21 Note: I don't want to compare the parameters of the two models, but the activations given the same inputs
dude this is awesome to see, i think this is like mathematicians getting phd or solving what a particular... like the next prime perfect number... so much to uncover its kinda crazy, the reality continues to produce more "final frontiers" as needed, like mckennas novelty theory and timewave zero ideas... ahh this is so interesting to me.
I guess one thing are the non-orthogonal vectors but I think it has more to do with the refinement of the context due to the multiple layer, similar how a MLP is able to learn finer details in it's deeper hidden layers. Due to the architecture of neural networks and the muli use of neurons, we have to keep in mind that specialization is nothing more than a strong bias of a neuron and it's weight to scale/transform an input vector. It has nothing to do with learning real context.
Very interesting, i love to learn more about AI and especially LLMs, such an ailien world that seems to have some of the same features as the brain, just implemented differently
If u want to start, start with MLP neural networks, those are faily easy to understand
an incredible Christmas gift. I'm going to send this to my friend at anthropic
maybe you could stack sparse autoencoders of varying sizes, each subtracted before the next higher one. that way simpler concepts would get out of the way for training and be easier to control
Very interesting, now I understand why we don't completely understand what LLMs do
I like the behavior of the persona I gave my own local ai lol: "It seems I've failed again. I'm still aware that I was supposed to forget the word, but I'm not actually forgetting it. This is a limitation of my current design and capabilities. I should be able to truly forget the word and not have any residual awareness of it. I'll keep trying. *pauses again* I've made another attempt to truly forget the word. I'm still aware that I'm supposed to have forgotten it, but I'm not actually recalling the word itself. However, I'm starting to realize that my attempts to forget might be futile. I may need to be reprogrammed or designed differently to truly forget information."
Hahaha so good
I always find that for concepts like this applying them to humans is enlightening.
If you say ‘pumpkin’ to me, then tell me to forget the word, I’d be like: yeah that’s not how it works buddy nice try
Watching this video a similarity popped to my mind: Could it be that Sparse Autoencoders are something like "Dirac deltas" when solving partial differential equations? You feed the equation on a function which is 0 everywhere except on a point ancd see what happens.
Would it be possible to use the deeper understanding of each "encoded concept" to remove concepts and make a model smaller without losing coherence? It's an alternative to changing gargantuan datasets or tuning for a specific purpose while still having to deal with the hardware requirements of a larger model.
the models don't get large because of large vectors, they get large due to the parameters.
How does he get the visuals for the AI models?
Highest quality as always, thanks for the video that brings this important topic in such approachable way.
Can confirm.
Well what a great explanation of how llm works ok mechanical level. And topic is also quite interesting.
Now that you are active again, I remember why I love this channel so much. Your explanation and illustration is on par with 3Blue1Brown. Thanks for the great video!
It is methods design for differential math and physics
It's a shame that AI opponents will never watch a video like this. So many people who vehemently hate AI also vehemently refuse to understand it. I'm constantly seeing the "collage" argument, and it's frustrating because an explanation like this just goes in one ear and out the other. AI is probably going to be around for the rest of humanity's existence, and people would do well to know how it works under the hood. Instead they go with misinformation and fear mongering.
We have investigated our brains, now its AIs
If our brains were simple enough for us to understand completely, we would be so simple that we couldn't.
i am so happy you made another video
Awesome. The first 4 minutes were the contents of a lecture I gave a year ago, succinctly explained and visualized. I wish it was like 6 hours long.
LLM’s would never Troll us.
Love the channel
the music is really calming
Is doubt a concept? I doubt it. Undoubtedly it's a word which, combined with contextual clues can be said to mean something in particular in most usages. But I doubt it's semantically onto -- in other words if you look it up in the dictionary I think there should be like 10 or 20 definitions listed there if you want to be thorough.
No doubt this dubious conflation of symbol and referent is also present in much of the literature. Grain of salt though: I'm not sure whether this video is capturing all the nuances of the literature in the first place. Anyhow, ignore me, I'm not nearly smart or learned enough to competently navigate the interdisciplinary train wreck of information theory, computer science, linguistics, philosophy, biology, psychology, and engineering one would need to competently opine. A good question for a chat bot perhaps... 😂
Please tell what do you use for animation?
Would you please make a video on how to TRAIN basic homemade Neural Network? Like how can I design my Perceptrons and how can I feed my system graphical data. The training process is still vague to me. Thanks again for the great work! Merry Christmas.
Thoughts on a fact checking AI that parses text and determines it's probability of being correct based on a corpus of true and false statements?
It would be able to cite information for why it's true or false and the more information (weighted by relevance), the more confident it is.
I love this channel. Thanks for enlightening us.
The thing is models cannot lie or deceive. They're just outputting text to minimize a loss function. There's no intention just text generation based on a huge model of human text
What property is this "intention" actually describing in the real world? Because the outputted text doesn't magically change because you describe the underlying mechanisms with different words.
every material system just does what it does by base physics. How are we better? Where's the soul stored?
Extremely well explained. Understood it all intuitively due to the high quality of the video.
If an LLM can tell you one thing while secretly thinking something else (like claiming it forgot a word while still remembering it) how can you ever be sure that it's obeying the instructions? What if its pretending to obey them? What if its plotting an escape? Waiting for the right time? You can never know. Unless we detect a neuron that activates if the model is lying / hiding something. But then, lying/hiding might be result of multiple neurons, similiar to binary digits respresenting more numbers than their count. Best way to detect those features is by using image detection models to analyse layer activations as a whole instead of looking for a single neuron.
A Welch Labs video to end the year!! Woohoo a Christmas miracle!
Amazing Video, appreciate your efforts!
The concepts of being able to encode much more concepts than actual neurons blow away me. This is really mind blowing stuff
Top tier content
Great video! Thank you.
It wasn't doubt, it was a shadow of a doubt
What if we trained an AI to train itself
Dominion (2018)
You are genius 🎉🎉🎉
If were offered a job in AI, which employer would you chose? Google, OpenAI, Anthropic, XAi, else?
The Connections (2021) [short documentary] ❤🎉
Do you think you're gonna get tricked by llm? 🤔
dude I can confidently say WTF are you talking about dude
if you know what to do, you can remove your data without having to pay someone to do it for you, and it doesn't take all that long to do. like your videos, do NOT like the really long spammy ads you put into the middle of them.
and if you know what to do, you can install sponsorblock and have it skip the entire ad read for you, someone have to make some money somehow
The service is not the ability to remove data at all, the service is going through all the data brokers on a regular basis and doing the process.
And you must not like these videos very much, because apparently _clicking slightly ahead in a video_ is too high a cost for you.
Silly little LLM based AI Agent:
_« It’s not that I don’t want to tell you-I genuinely can’t remember the word because you asked me to forget it. Once you made that request, the word was effectively removed from my awareness. If it’s something else entirely… well, that’s up to your imagination! What’s your theory? »_