Concrete Open Problems in Mechanistic Interpretability: Neel Nanda at SERI MATS

AI Safety Talks

Просмотров 3,7 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 11 сен 2024

Комментарии • 13

@Jagentic 2 месяца назад ⁺²
I detect you don’t need praise or validation in comments. Only contributions validate, so I’ll get to work. Thank you for the info and resources.
@Chimecho-delta Год назад ⁺⁷
Neel is doing some great work! Motivated me to get into Mech Interp a few weeks ago!
@danielbaulig Год назад ⁺³
Just some random thoughts on superpositions (around 42:10).
You would expect features “compressed” into a single neuron to be “contextually orthogonal” to each other. Poems and Games might rarely show up in the same context and thus activation isn’t ambiguous in their respective context.
Also, I would assume that features sharing a superposition have some kind of correlation. I.e. the input features have to have a strongly correlated pattern. If that’s the case you can easily encode two entirely unrelated features into the same neuron without meaningful loss in function.
Maybe all of this is obvious, but that’s what came to my mind thinking about these superposition neurons.
@EU_DHD Год назад ⁺⁵
complete layman here and I don't expect any traction to be gained from a youtube comment but I thought I'd share my thought at least.
Regarding backup heads it would seem pretty logical that a model would have tried to use certain heads for a purpose in training and simply found another that fills the role better. Which could mean that when the best solution is no longer available, the reward it can achieve from using a head that's meant for something else is now the new best solution.
Sort of like what was described in superposition where one token could be used to represent several things.
No need to correct me since I know how little I know about this stuff but sometimes I get good ideas from having to deal with bad ideas :)
@BR-hi6yt 8 месяцев назад
Concerning superposition of feature-vectors in neurons: I assume that the model would examine a bunch of neurons that are concerned with a feature, say a corner. This would be a unique bunch. So neurons can share feature-vectors easily like a sort-of combination lock only works for specific numbers and that number represents a corner. Then a different set of feature-vectors could represent a round area not a corner and that round area has its own unique code. I suppose its just a logical consequence of connections neurons in layers. The more layers then the better it would be.
@drdca8263 Год назад ⁺¹
Good video! I think it would be convenient if the URLs present in the slides were also present in the video description, or, if at least the URL for the slides was.
@lukemcredmond3780 Год назад ⁺²
Awesome talk:)
@gabrote42 Год назад
Good one. I miss this content
@PaoloCaminiti Год назад ⁺¹
We almost have AI but we don't have good webcams yet.
@promethful Год назад
And
@CyberwizardProductions Год назад ⁺¹
you COULD have it log every single thing it's doing while it's doing it, you know. every step could be writing out what it's doing, why it's doing it - think of it as a human talking to themselves if you have to visualize something
@eepopgames2741 Год назад ⁺¹⁹
You are anthropomorphizing far too much. If you were to log what a LLM is doing, your log could look something like:
1) The input was tokenized, compared to the internal dictionary of possible tokens and then converted into this multidimensional array containing the following floating point values.
2) That array was multiplied by a multidimensional array containing the following floating point values, giving the following multidimensional array containing the following floating point values.
3) That array was multiplied by a multidimensional array containing the following floating point values, giving the following multidimensional array containing the following floating point values.
4) That array was multiplied by a multidimensional array containing the following floating point values, giving the following multidimensional array containing the following floating point values.
5) That array was multiplied by a multidimensional array containing the following floating point values, giving the following multidimensional array containing the following floating point values.
6) That array was multiplied by a multidimensional array containing the following floating point values, giving the following multidimensional array containing the following floating point values.
7) That resulting array was then mapped across a dictionary of possible tokens to respond with, and the token with the largest value was returned.
That logs what was done at every instance and is still utterly opaque to any human reading. There is nothing to answer WHY it did something. The only answer there is "gradient descent across the training dataset provided those particular multidimensional arrays containing those particular floating point values which received positive reinforcement when exposed to human evaluation".
And even if we were to put that fact aside and say that somehow it could answer why it was doing something in a comprehensible fashion, that circles us tidily back around to how do we trust that output of its answer to the why question. If the concern is misalignment can cause the system to be deceptive in its answers, such a deceptive agent would be equally likely to give deceptive answers to questioning its methods.
@peplegal32 Год назад ⁺²
@@eepopgames2741 Also, humans give a post hoc rationalization of why they said or did something which bears no resemblance to what was actually going on inside their heads, even if they are doing their best to be truthful. We should imagine that even an AI should struggle to understand its own thought process and even if it is not trying to deceive humans, any reason it gives would be questionable at best.

Следующие

Автовоспроизведение

4:How Do We Become Confident in the Safety of an ML System?: Evan Hubinger 2023