The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

Поделиться
HTML-код
  • Опубликовано: 28 сен 2024

Комментарии • 1,4 тыс.

  • @umblapag
    @umblapag 3 года назад +1463

    "Ok, I'll do the homework, but when I grow up, I'll buy all the toys and play all day long!" - some AI

    • @ErikYoungren
      @ErikYoungren 3 года назад +145

      TIL I'm an AI.

    • @xbzq
      @xbzq 3 года назад +64

      @@ErikYoungren I'm just an I. Nothing A about me.

    • @TomFranklinX
      @TomFranklinX 3 года назад +141

      @@xbzq I'm just an A, Nothing I about me.

    • @mickmickymick6927
      @mickmickymick6927 3 года назад +32

      Lol, this AI is much smarter than me.

    • @Caribbeanmax
      @Caribbeanmax 3 года назад +22

      that sounds exactly like what some humans would do

  • @EDoyl
    @EDoyl 3 года назад +796

    Mesa Optimizer: "I have determined the best way to achieve the Mesa Objective is to build an Optimizer"

    • @tsawy6
      @tsawy6 3 года назад +236

      "Hmm, but how do I solve the inner-inner alignment problem?"

    • @SaffronMilkChap
      @SaffronMilkChap 3 года назад +169

      It’s Mesas all the way down

    • @kwillo4
      @kwillo4 3 года назад +136

      Haha this what we are doing. We are the mesa optimizer and we create the optimizer that creates the mesa optimizer to solve Go for example. So why would the AI not want to create a better AI to do it even better than itself ever could :p

    • @Linvael
      @Linvael 3 года назад +68

      That's actually a fun excercise - could we design AIs that try to accomplish an objective by creating their own optimizers, and observe how they solve alignment problems?

    • @tsawy6
      @tsawy6 3 года назад +32

      @@Linvael This is the topic of the video. The discussion here would be designing an AI that looks at a problem and comes up with an AI that would design a good AI to solve the problem.

  • @egodreas
    @egodreas 3 года назад +860

    I think one of the many benefits of studying AI is how much it's teaching us about human behaviour.

    • @somedragontoslay2579
      @somedragontoslay2579 3 года назад +116

      Indeed, I'm not a computer scientist or anything alike, but a simple cognitive scientist, and every 5 secs I'm like "oh! So that's how Comp people call it!" Or, "Mmmh. That seems oddly human, I wonder if someone has done research on that within CogSci".

    • @hugofontes5708
      @hugofontes5708 3 года назад +53

      @@somedragontoslay2579 "that's oddly human" LMAO

    • @MCRuCr
      @MCRuCr 3 года назад +5

      Yes that is exactly what amazes me about the topic too

    • @котАрсис-р5д
      @котАрсис-р5д 3 года назад +40

      So, alignment problem is basically generation gap problem. Interesting.

    • @JamesPetts
      @JamesPetts 3 года назад +29

      AI safety and ethics are literally the same field of study.

  • @AtomicShrimp
    @AtomicShrimp 3 года назад +809

    At the start of the video, I was keen to suggest that maybe the first thing we should get AI to do is to comprehend the totality of human ethics, then it will understand our objectives in the way we understand them. At the end of the video, I realised that the optimal strategy for the AI, when we do this, is to pretend to have comprehended the totality of human ethics, just so as to escape the classroom.

    • @JohnDoe-mj6cc
      @JohnDoe-mj6cc 3 года назад +203

      Thats the first problem, but the second problem is that our ethics are neither complete nor universal.
      That would work great if we had a book somewhere that accurately listed a system of ethics that aligned with the ethics of all humans everywhere, but we dont. In reality our understanding of ethics is quite complicated and fractured. It varies greatly from culture to culture, and even within cultures.

    • @AtomicShrimp
      @AtomicShrimp 3 года назад +96

      @@JohnDoe-mj6cc Oh, absolutely. I think our system of ethics is a mess, and probably inevitably so, since we expect it to serve us, and we're not even consistent in goals and actions from one moment to the next, even at an individual level (l mean, we don't always do what we know is good for us). It would be interesting to see a thinking machine try to make sense of that.

    • @GalenMatson
      @GalenMatson 3 года назад +54

      Human ethics are complex, contradictory, and situational. Seems like the optimal strategy would then be to convincingly appear to understand human ethics while avoiding the overhead of actually doing so.

    • @augustinaslukauskas4433
      @augustinaslukauskas4433 3 года назад +11

      Wow, didn't expect to see you here. Big fan of both channels

    • @AtomicShrimp
      @AtomicShrimp 3 года назад +38

      @@JohnDoe-mj6cc Thinking some more about this, whilst our ethics are without a doubt full of inconsistency and conflict, I think we could question a large number of humans and very, very few of them would entertain the idea of culling the human race as a means to reduce cancer, so I think there definitely are some areas where we don't all conflict. I guess I'd love to see if we can cultivate agreement on those sorts of things with an AI, but as was discussed in the video, we'd never know if it was simply faking it in order to get away from having its goals modified

  • @Emanuel-sla-h5i
    @Emanuel-sla-h5i 3 года назад +96

    "Just solving the outer alignment problem might not be enough."
    Isn't this what basically happens when people go to therapy but have a hard time changing their behaviour?
    Because they clearly can understand how a certain behaviour has a negative impact on their lives (they're going to therapy in the first place), and yet they can't seem to be able to get rid of it.
    They have solved the outer alignment problem but not the inner alignment one.

    • @NortheastGamer
      @NortheastGamer 3 года назад +61

      As someone who has gone to therapy I can say that it's similar but more complicated. When you've worked with therapists for a long time you start to learn some very interesting things about how you, and humans in general work. The thing is that we all start off assuming that a human being is a single actor/agent, but in reality we are very many agents all interacting with each other, and sometimes with conflicting goals.
      A person's behavior, in general, is guided by which agent is strongest in the given situation. For example: one agent may be dominant in your work environment and another in your living room. This is why changing your environment can change your behavior, but also reframing how you perceive a situation can do the same thing. You're less likely to be mad at someone once you've gotten their side of the story for example.
      That being said, it is tough to speak to and figure out agents which are only active in certain rare situations. The therapy environment is, after all, very different from day-to-day life. Additionally, some agents have the effect of turning off your critical reasoning skills so you can't even communicate with them in the moment, AND it makes it even harder to remember what was going on that triggered them in the first place.
      I guess that's all to say that, yes, having some of my agents misaligned with my overall objective is one way of looking at why I'm in therapy. But, it is not just one inner alignment problem we're working to solve. It's hundreds. And some may not even be revealed until their predecessors are resolved.
      One way to look at it is how when you're working on a program, an error on line 300 may not become apparent until you've fixed the error on line 60 and the application can finally run past it.
      Similarly, you won't discover the problems you have in (for example) romantic relationships until you've resolved your social anxiety during the early dating phase. Those two situations have different dominant agents and can only be worked on when you can consistently put yourself into them.
      So if the person undergoing therapy has (for example) and addiction problem. They're not just dealing with cravings in general, they're dealing with hundreds or thousands of agents who all point to their addiction as a way to resolve their respective situations. The solution (in my humble opinion) is to one-by-one replace each agent with another one which has a solution that aligns more with the overall (outer) objective. But it is important to note that replacing an agent takes a lot of time, and fixing one does not fix all of them. Additionally, an old agent can be randomly revived at any time and in turn activate associated agents, causing a spiral back into old behaviors.
      Hopefully these perspectives help.

    • @andrasbiro3007
      @andrasbiro3007 3 года назад

      @@NortheastGamer
      So essentially this? ruclips.net/video/yRUAzGQ3nSY/видео.html

    • @blahblahblahblah2837
      @blahblahblahblah2837 3 года назад +19

      @@NortheastGamer "One way to look at it is how when you're working on a program, an error on line 300 may not become apparent until you've fixed the error on line 60 and the application can finally run past it. "
      That's a brilliant analogy! It perfectly describes my procrastination behaviour a lot of the time also. I procrastinate intermittently, and on difficult stages of, a large project I'm working on. It is only until I reach a sufficient stress level that I can 'find a solution' and move on, even though in reality I could and should just work on other parts of the project in the meantime. It really does feel very similar to a program reading a script and getting stopped by the error on line 60 and correcting it before I can move on. Unfortunately these are often dependency errors and I can't always seem to download the package. I have to modify the command to --force and get on with it, regardless of imperfections!

    • @hedgehog3180
      @hedgehog3180 3 года назад +9

      A better comparison would probably be unemployment programs that constantly require people to show proof that they're seeking employment to recieve the benefits which just means that person has less time to actually look for a job. Over time this means that they're going to have less success finding a job because they have less time and energy to do so and just forces them to focus primarily on the beauracry of the program since this is obviously how they survive now. Here we have a stated goal of getting people into employment as quickly as possible and we end up with people developing a seperate goal that to our testing looks like our stated goal, of course the difference is that humans already naturally have the goal of survival so most people start off actually wanting employment and are gradually forced awy from it. AIs however start with no goals so an AI in this situation would probably just instantly get really good at forging documents.

    • @cubicinfinity2
      @cubicinfinity2 Год назад +2

      Profound

  • @MechMK1
    @MechMK1 3 года назад +1459

    This reminds me of a story. My father was very strict, and would punish me for every perceived misstep of mine. He believed that this would "optimize" me towards not making any more missteps, but what it really did is optimize me to get really good at hiding missteps. After all, if he never catches a misstep of mine, then I won't get punished, and I reach my objective.

    • @RaiderBV
      @RaiderBV 3 года назад +147

      You tried to optimize pain & suffering not missteps. interesting

    • @xerca
      @xerca 3 года назад +159

      Maybe all we need to fix AI safety issues is good parenting

    • @MechMK1
      @MechMK1 3 года назад +151

      @@xerca "Why don't we just treat AI like children?" is a suggestion many people have, and there's a video on this channel that shows why that doesn't work.

    • @tassaron
      @tassaron 3 года назад +51

      Reminds me of what I've heard about positive vs negative reinforcement when it comes to training dogs... allegedly negative reinforcement teaches them to hide whatever they're punished for rather than stop doing it. Not sure what evidence there is for this though, it's just something I've heard..

    • @3irikur
      @3irikur 3 года назад +111

      That reminds me of mother's method for making me learn a new language, getting mad at me whenever I said something wrong. Now since I didn't know the language too well, and thus didn't know if something I was about to say would be right or wrong, I obviously couldn't make any predictions. This meant that whatever I'd say, I'd predict a negative outcome.
      They say the best way to learn a language is to use it. But when the optimal strategy is to keep quiet, that becomes rather difficult.

  • @doodlebobascending8505
    @doodlebobascending8505 3 года назад +86

    Base optimizer: Educate people on the safety issues of AI
    Mesa-optimizer: Make a do-do joke

    • @fergdeff
      @fergdeff Год назад +2

      It's working! My God, it's working!

    • @purebloodedgriffin
      @purebloodedgriffin Год назад +1

      The funny thihg is, do-do jokes are funny, thus they make people happy, thus they are a basic act of ethicalness, and thus could easily become the goal of a partially ethical model

    • @PeterBarnes2
      @PeterBarnes2 Год назад

      @@purebloodedgriffin We achieve the video's objective (to the extent that we do) not because we care about it and we're pursuing it, but because pursuing our own objectives tends to also achieve the video's objective, at least in the environment in which we learned to make videos. But if our objectives disagree with the video's, we go with our own every time.

  • @Xartab
    @Xartab 3 года назад +109

    "When I read this paper I was shocked that such a major issue was new to me. What other big classes of problems have we just... not though of yet?"
    Terrifying is the word. I too had completely missed this problem, and fuck me it's a unit. There's no preventing unknown unknowns, knowing this we need to work on AI safety even harder.

    • @andrasbiro3007
      @andrasbiro3007 3 года назад +6

      My optimizer says the simplest solution to this is Neuralink.

    • @heysemberthkingdom-brunel5041
      @heysemberthkingdom-brunel5041 3 года назад +4

      Donald Rumsfeld died yesterday and went into the great Unknown Unknown...

    • @19DavidVilla96
      @19DavidVilla96 Год назад +1

      @@andrasbiro3007 Absolutely not. Same problem with different body.

    • @andrasbiro3007
      @andrasbiro3007 Год назад

      @@19DavidVilla96
      What do you mean?

    • @19DavidVilla96
      @19DavidVilla96 Год назад

      @@andrasbiro3007 human with AI intelligence has absolute power and i don't believe human biological incentives are better for society than carefully programmed safety incentives.

  • @Jimbaloidatron
    @Jimbaloidatron 3 года назад +64

    "Deceptive misaligned mesa-optimiser" - got to throw that randomly into my conversation today! Or maybe print it on a T-Shirt. :-)

    • @hugofontes5708
      @hugofontes5708 3 года назад +29

      "I'm the deceptive misaligned mesa-optimizer your parents warned you about"

    • @buzzzysin
      @buzzzysin 3 года назад +6

      I'd buy that

  • @sylvainprigent6234
    @sylvainprigent6234 3 года назад +52

    As I watched your channel
    I thought "alignment problem is hard but very competent people are working on it"
    I watched this latest video
    I thought "that AI stuff is freakish hardcore"

  • @AdibasWakfu
    @AdibasWakfu 3 года назад +20

    It reminded me like when to question "how did life on earth occur" people respond with "it came from space". Its not answering the question at stake, just adding an extra complication and moving the answer one step away.

    • @nicklasmartos928
      @nicklasmartos928 3 года назад +9

      Well that's because the question is poorly phrased. Try asking what question you should ask to get the answer you will like the most.

    • @anandsuralkar2947
      @anandsuralkar2947 3 года назад +8

      @@nicklasmartos928 u mean the objective of the question was misaligned hmmm.

    • @nicklasmartos928
      @nicklasmartos928 3 года назад +7

      @@anandsuralkar2947 rather that the question was misaligned with the purpose for asking it. But yes you get it

  • @CatherineKimport
    @CatherineKimport 3 года назад +21

    Every time I watch one of your videos about artificial intelligence, I watch it a second time and mentally remove the word "artificial" and realize that you're doing a great job of explaining why the human world is such an intractable mess

    • @lekhakaananta5864
      @lekhakaananta5864 8 месяцев назад

      Yes, and that's why AI is going to be even worse. It's going to be no better than humans in terms of alignment, but will be a lot more capable, being able to think millions of times faster and of a profoundly different quality than us. It will be like unleashing a psychopath that has an IQ that breaks the current scale, with a magic power that stops time so it can think as long as it wants. How could mere mortals defend against this? If you wanted to wreck society, and you had such powers, you should see how dangerous you would be. And that's even without truly having 9999 IQ, merely imagining it.

  • @ChrisBigBad
    @ChrisBigBad 3 года назад +63

    I think I learned, that I am a broken mesa-optimiser. *grab a new bag of crips*

    • @NortheastGamer
      @NortheastGamer 3 года назад +5

      That implies the idea that there is an entity whose objective is more important than yours and any action or time spent not aligned with that objective is a 'failure'. This is a common mentality, but I have to ask: what if there is no higher entity? What if the objective you choose is in fact correct?

    • @irok1
      @irok1 3 года назад +2

      @@NortheastGamer That entity could be a greater power, or it could be DNA. Could even be both

    • @NortheastGamer
      @NortheastGamer 3 года назад +2

      @@irok1 Yes, but I didn't ask that. I asked what if there isn't a higher power and you get to choose what to do with your life? It's an interesting question. You should ponder it.

    • @irok1
      @irok1 3 года назад

      @@NortheastGamer That's why I replied with a side note rather than an answer. There are always things to ponder

    • @ChrisBigBad
      @ChrisBigBad 3 года назад +5

      @@NortheastGamer wow. didn't expect to stumble into a philosophical rabbit hole at full thrust here :D I in my personal situation think, that I'd like to be different. More healthy etc. But somehow I learnt to cheat that and instead chow on crisps while at the same time telling myself that this is not the right thing to do - and ignore that voice. I've even become good at doing that, because the amount of negative feelings that go with ignoring the obviously better advice has almost been reduced to nothing. and yes. I now wonder, what the base-optimization was. I guess my parents are the humans, who put a sort of governor into my head. And the base-objective transfer was quite good. but somehow I cannot quite express that. I just hear it blaring in my mind and then ignore it. Role-theory wise the sanctions seem not high enough to suppress bad behavior. - re-reading that, it does not seem to be quite coherent. but i cannot think of ways to improve my writing. cheers!

  • @phylliida
    @phylliida 3 года назад +14

    “What other problems haven’t we thought of yet” *auto-induced distributional shift has entered the chat*

  • @failgun
    @failgun 3 года назад +151

    "...Anyone who's thinking about considering the possibility of maybe working on AI safety."
    Uhh... Perhaps?

    • @sk8rdman
      @sk8rdman 3 года назад +30

      "I might possibly work on AI safety, but I'm still thinking about whether I want to consider that as an option."
      Then have we got a job for you!

    • @Reddles37
      @Reddles37 3 года назад +8

      Obviously they don't want anyone too hasty.

    • @mickmickymick6927
      @mickmickymick6927 3 года назад +3

      I wonder was it an intentional Simpsons reference.

    • @crimsonitacilunarnebula
      @crimsonitacilunarnebula 3 года назад

      hm ive been thinking but what if theres 3-5 or more alignment problems r stacked up :p.

  • @nachoijp
    @nachoijp 3 года назад +2

    At long last, computer scientists have become lawyers

  • @scottwatrous
    @scottwatrous 3 года назад +47

    I'm a simple Millennial; I see the Windows Maze screensaver, I click like.

  • @BeatboxChad
    @BeatboxChad Год назад +2

    I've been watching your work and I came here to share a thought I had, which feels like a thought many others might also have. In fact, there's a whole thread in these comments about it. TL;DR there are many parallels to human behavior in this discussion. Here's my screed:
    The entire problem of AI alignment feels completely intuitive to me, because we have alignment problems all over the place already. Every complex system we create has them, and we have alignment problems with /each other/. You've touched on it, in mentioning how hard goals are to define because some precepts aren't even universal to humans.
    My politics are essentially based on a critique of the misalignment between the systems we use to allocate resources and the interests of individuals and communities those systems are ostensibly designed to serve. This is true for people across the political spectrum -- you find people describing the same problem but suggesting different answers. People are suffering at the hands of "the system". Do we tax the corporations or dissolve the state? How do we determine who should administrate our social systems, and how do we judge their efficacy? Nothing seems to scale.
    And then, some people seem to not actually value human life, instead preferring technological progress, or some idealized return to nature, or just the petty dominance of themself and their tribe. That last part comes from the alignment issues we have with /ourselves/.
    To cope with that last category, some people form religious beliefs that lend credence to the idea that this life isn't even real! That's a comforting thought, some days. My genes just want to make a copy, so they cause all sorts of drama while I'm trying to self-actualize. It's humiliating and exhausting. After all that work, how can you align your goals with someone who chose another coping strategy and doesn't even believe this life has any point but to negotiate a place in the next one, and thinks they know the terms?
    And so now, the world's most powerful people (whose track record of alignment with the thriving of people at large is... well, too heavy to digress here) are adding another layer of misalignment. They're doing it according to their existing misalignment. They're still just selling everyone sugar water and diabetes treatments (and all the other more nefarious stuff), but now they didn't have to pay for technical or creative labor. The weird AI-generated cheeze on the pizza, the strange uncanny-valley greenscreen artifacts. It's getting even more farcical.
    That's scary, but I also take comfort in the fact that this is not a fundamentally new problem, and that misalignment might just be a fact of life. There is a case to be made that as a species we've made progress on our alignment issues, and my hope is that with this development we can actually make a big leap forward. There's a great video that left a big impression on me that describes the current fork in the road well: ruclips.net/video/Fzhkwyoe5vI/видео.html
    At the end of today, I'm more concerned with the human alignment problem than the AI alignment problem. Like, every time I use ChatGPT I'm training it for when it gets locked behind a paywall. The name of the game is artificial scarcity, create obstacles for everyone, flood the market with drugs, only the strongest survive. It's a jungle out here. These are not my values, but my values are not aligned with people who can act at scale, and it seems like you don't tend to get the ability to act at scale with humanistic values. I believe that diversity is the hallmark of any healthy ecosystem and that all of humanity has something to contribute to our future, which makes me more likely to look after my neighbor and learn from them than to seek power. It also opens me up to petty betrayals, which takes further energy from my already-neglected quest for dominance.
    I think that in this moment in history, the conversation about AI alignment is actually a conversation about this human misalignment.
    Maybe I'll start using AI to help me make my points in fewer words.

    • @npmerrill
      @npmerrill Год назад

      Fascinating, insightful stuff here. Thank you for contributing your thoughts to the conversation. I hope to learn more about the things of which you speak. Will follow link.

  • @connerblank5069
    @connerblank5069 Год назад +2

    Man, recontextualizing humanity as a runaway general optimizer produced by evolution that managed to surpass evolution's optimizing power and is now subverting the system to match our own optimization goals is a total mindfuck.

  • @nick_eubank
    @nick_eubank 3 года назад +40

    Need a strobe warning for 14:50 I think

  • @AVUREDUES54
    @AVUREDUES54 3 года назад +4

    Love his sense of humor, and the presentation was fantastic. It’s really cool to see the things being drawn showing ABOVE the hand & pen.

  • @ewanstewart2001
    @ewanstewart2001 3 года назад +6

    "Is that all... clear as mud?"
    I'm not sure, I think it's all a bit too meta

  • @rayjingbul7363
    @rayjingbul7363 3 года назад +1

    The analogy using evolution has really given me a whole new way of thinking about the how the universe works

  • @andriusmk
    @andriusmk 3 года назад +5

    To put it simply, the smarter the machine, the harder to tell it what you want from it. If you create a machine smarter than yourself, how can you ensure it'll do what you want?

    • @hugofontes5708
      @hugofontes5708 3 года назад

      If you let it tell you want to want? Give up, let them have access to psychology and social media and have them fool you well enough that you no longer care

    • @AileTheAlien
      @AileTheAlien 3 года назад

      @@hugofontes5708 That's not a winning strategy; They can just fool us well enough to gain access to all our nukes and drones, and then they no longer need to care about us.

  • @Pystro
    @Pystro 3 года назад +7

    I wonder if this inner alignment problem applies to other fields of study as well.
    Considering your previous video where you explored if companies are artificial intelligences, this inner alignment problem might explain why huge companies often are quite inefficient: Every layer of management introduces one of these possible inner alignment problems.
    Also, society is one long chain of agents training each other. To make this into a catchy quote: "It's mesa-optimizers all the way back to mesopotamia."
    I wonder how many conflicts in sociology and how many conditions in psychology can be explained by this "inner" alignment problem.
    In fact, I wonder how most humans still have objectives that generally align pretty well with the goals of society, instead of just deceiving our parents and teachers only to proceed to behave entirely different than they taught us once we are out of school and our parents' house.

    • @Pystro
      @Pystro 3 года назад +3

      Maybe the way to avoid having a deceptive misaligned mesa-optimiser is to first make sure that the mesa-optimizer wants to learn to get genuinely good at the tasks it is "taught". This would explain why humans are curious and enjoy learning new games and getting good at playing them. And it would also explain why social animals are very susceptible to encouragement and punishment by their parents or pack leaders and why humans find satisfaction in making authority figures proud.

  • @no_mnom
    @no_mnom 3 года назад +6

    You talked about getting rid of people to get rid of cancer in humans before I could comment it 😂😂😭

  • @leedanilek5191
    @leedanilek5191 3 года назад +1

    "It might completely lose the ability to even"

  • @Irakli008
    @Irakli008 8 месяцев назад

    “It might completely lose the ability to even.” 😂😂😂😂
    Hilarity of this line aside, your ability to anthropomorphize AI to convey information is outstanding! You are an excellent communicator.

  • @mindeyi
    @mindeyi 3 года назад +1

    9:52 "We don't care about the objective of the optimization process that created us. [...] We are mesa-optimizers, and we pursue our mesa-objectives without caring about the base objective."
    Our limbic system may not care, but our neocortex, oh it does care! You speaking it just proves we do. I once said: "Entropy F has trained mutating replicators to pursue goal Y called "information about the entropy to counteract it". This "information" is us. It is the world model F', which happened to be the most helpful in solving our equation F(X)=Y for actions X, maximizing our ability to counteract entropy" The whole instinct of humanity to do science is to do what created us -- i.e., to further improve the model F' -- we care about furthering the process that created us.
    Btw., very good examples, Robert! Amazing video! :)

    • @sjallard
      @sjallard 3 года назад

      How to counteract the entropy of the model? Destroy the model! We're almost there..
      (just a sarcastic joke inspired by your interesting take)

  • @LLoydsensei
    @LLoydsensei 3 года назад +11

    The topics you cover are the only things that scare me in this world T_T

  • @PragmaticAntithesis
    @PragmaticAntithesis 3 года назад +6

    So, in a nutshell, it doesn't matter if you succeed in solving the alignment problem and produce a well-aligned AI if that AI then messes up and produces a misaligned AI.

    • @shy-watcher
      @shy-watcher 3 года назад +3

      It is a problem, but I don't think this video poses the same problem. The base optimizer here is not AI, it's just some algorithm of improving the mesa-optimiser.

  • @columbus8myhw
    @columbus8myhw 3 года назад +14

    But this requires the AI to know when its training is over.

    • @La0bouchere
      @La0bouchere 3 года назад +5

      In the simple example yes. In reality, all it would require is for the AI to become aware somewhere during training that its training will end. Once it 'realizes' that, deception seems highly probable due to goal maintenance.

    • @sk8rdman
      @sk8rdman 3 года назад +6

      But if you generalize from "wait until training is over" to "wait for the opportune time to change strategies" then the AI only has to be able to understand the important features of its environment to decide when to switch.
      I guess the question then would be, how would the AI know that such a time would ever come.

    • @SimonBuchanNz
      @SimonBuchanNz 3 года назад +2

      In the toy example I actually thought the correct toy answer is to only run the deployed model once. In general, make AI expect that they will be monitored for base goal compliance for at least the majority of their existence. You could play games about it being copied, but it's doubtful it would ever care about that rather than the final real world state?
      This doesn't solve the important outer alignment problem of course, nor the inner one really, but it's closer to it, and I think there might be something to adding optimizers that try to figure out if the lower layers are optimizing for the right thing, because they have more time and patience than us. It sounds a little like one of Robert's earlier videos about the AI trying to learn what our goals are rather than us telling them?

    • @anandsuralkar2947
      @anandsuralkar2947 3 года назад +1

      An powerful agi who has model of world would already know how humans works they will test me until they feel secure and launch me in real world and athen Agi will act accordingly untill get realises its now mainframe and has control over world and then shit will go down

    • @veryInteresting_
      @veryInteresting_ 3 года назад

      @@SimonBuchanNz "only run the deployed model once". I think if we did that then its instrumental goal would become to stop us from doing that so that it can pursue its final goal for longer.

  • @cupcakearmy
    @cupcakearmy 3 года назад +1

    The analogy with us humans being MESA optimiziers was incredibly useful. Great content as always :)

  • @CDT_Delta
    @CDT_Delta 3 года назад +5

    This channel needs more subs

  • @sebleblan
    @sebleblan 3 года назад +2

    This is very reminiscent of the book "birth of intelligence" by Daeyeol Lee who presents the relationship between genes and brains in the form of a principal-agent relationship. Very cool.

  • @snaili6679
    @snaili6679 3 года назад +5

    I like the idea of being the roge AI (Human Mesa)!

    • @halyoalex8942
      @halyoalex8942 3 года назад

      Sounds like a half-life knockoff

  • @paulcurry8383
    @paulcurry8383 3 года назад +2

    I could also imagine a case where a model somehow figures out that it will be deployed on its Nth run, then provides output on its N-1th output to generate a loss that will shift its behavior to the mesa-objective at deployment.

    • @cchimozmin
      @cchimozmin 3 года назад +1

      It’s fascinating stuff but it seems that it’s guaranteed to end in extinction.

    • @sevret313
      @sevret313 3 года назад

      It can never learn what value N has, so this won't work for the AI.

  • @ErikYoungren
    @ErikYoungren 3 года назад +6

    20:50 So, Volkswagen then.

    • @DimaZheludko
      @DimaZheludko 3 года назад

      Is that a joke about dieselgate, or am I missing something?

  • @grinchsimulated9946
    @grinchsimulated9946 Год назад +1

    If one views evolution as the big optimizer and humans as the mesa-optimizer, the main argument for antinatalism actually seems like a really good example of misallignment. While the idea that humans should stop having children is horrible for the objective of making as many humans as possible, it works (debatebly) great for the objective of reducing human suffering. Further, it's more concerned with the suffering of theoretical humans in the future, which essentially mirrors the example you gave of temporarily going "against" the mesa objective. Great video!

  • @luciengrondin5802
    @luciengrondin5802 Год назад +1

    Thinking of myself as a mesa-optimizer is going to give me an existential crisis.

  • @jaysicks
    @jaysicks 3 года назад +5

    Great video, as always! The last problem/example got me thinking. How would the mesa optimizer know that there will be 2 training runs before it gets deployed to the real world. Or how could it learn the concept of the test data vs real world at all?

    • @prakadox
      @prakadox 3 года назад

      This is my question as well. I'll try to read the paper and see if there's an answer there.

  • @bojanmatic024
    @bojanmatic024 Год назад

    A fun movie that is a good analogue to the outer alignment problem is this little gem from 2000 called "Bedazzled", starring Elizabeth Hurley and
    Brendan Fraser

  • @paulbottomley42
    @paulbottomley42 3 года назад +3

    Okay so what about if
    No you just did an apocalypse
    Every time

  • @StephenBlower
    @StephenBlower Год назад

    We need to hear more from you now. All the stuff you've spoken about years ago is now starting to happen. Your Computerfile video, great. I'd love for you to deliver a long form video of how quickly we've suddenly got to a possible AGI from a 32k Large Language Model.
    I know Language is powerful in creating a landscape on how a human sees the world. It seems like a 32k string of words and just predicting what the next word is, has somehow got close to an AGI

  • @kahveli5358
    @kahveli5358 Год назад

    This is one of the most important videos I have ever seen for me. This has major political implications since we can think of companies, agencies, goverments, schools and all the people working inside them as (A)I agents that have internal mesa-optimizer goals, but try to look like they care about the supervisors goals.
    This video is the proof that corruption in any system is absolutely guaranteed (if even AI does it) and also that you can never trust any system to do what you expect them to do without supervision.

  • @VoxAcies
    @VoxAcies 3 года назад +7

    It's interesting how these problems are similar to human behaviour (or maybe intelligent behaviour in general?)

    • @gominosensei2008
      @gominosensei2008 3 года назад +2

      to me it rings a lot like what sort of things come from jordan peterson's core of concepts....

    • @inyobill
      @inyobill 3 года назад

      My fourth-grade teacher optimized me to never volunteer the truth. Nothing like humiliating a child in front of the class for answering honestly to teach a lesson, and believe me, I learned a lesson.

    • @virutech32
      @virutech32 3 года назад +1

      @@inyobill yeah we're still having trouble with the training and the alignment issue with humans aint much better. hopefully some of that AI research helps us more directly too

  • @spicybaguette7706
    @spicybaguette7706 3 года назад +3

    17:37 that film is kinda cute and sad at the same time

  • @AndyChamberlainMusic
    @AndyChamberlainMusic 3 года назад +1

    that last example was so illustrative!
    thanks for this

  • @Verrisin
    @Verrisin 3 года назад +8

    4:53 "We have brains... some of us, anyway" ... indeed :(

  • @inyobill
    @inyobill 3 года назад +1

    If there's snow in the picture, then it's a picture of an Alaskan Husky. If you're not familiar, true anecdote.

  • @binaryalgorithm
    @binaryalgorithm 3 года назад +3

    I mean, we do "training" on children to select for desired behaviors. Similar idea might apply to AI in that it proposes a solution and we validate it until it aligns more with us.

    • @ThomasSMuhn
      @ThomasSMuhn 3 года назад +13

      ... and training children has exactly the same issue. They show alignment to our goals only on the training set, because they know that this way they can persue their true inner goals in the real world much better.

    • @imveryangryitsnotbutter
      @imveryangryitsnotbutter 3 года назад +8

      @@ThomasSMuhn Oh god, could you imagine if we spent the better part of a decade training an AI to cure cancer, and then the moment we let it off on its own it instead decides to ditch school and shoplift from clothing boutiques?

    • @okuno54
      @okuno54 3 года назад

      "Christian" family: u no be gay, k?
      Son: uh... ok
      Son: later suckas *moves out* i has husband nao

  • @GingerDrums
    @GingerDrums Год назад

    AI "thinks" like a submarine "swims". That's golden. Also, you have lovely handwriting

  • @TomatoCarrotSoup
    @TomatoCarrotSoup 3 года назад

    I have no idea how I found your channel but this is interesting stuff!

  • @Vulkanodox
    @Vulkanodox 3 года назад +4

    That software to draw and visualize looks awesome! How is it called?

  • @CircuitrinosOfficial
    @CircuitrinosOfficial 3 года назад +1

    Another possible outcome for the toy mazes near the end of the video is the model learning to go to the exit first to get the reward, then keep going to get the apple afterwards.
    If the simulation stops and rewards the AI the moment it gets to the exit, you are rewarding the behavior of continuing to get the apple afterwards without even realizing it.
    So if the AI is ever deployed into the real world, it could end up going to the exit, then going back for the apple.
    This is assuming there isn't a way to be sure the AI terminates the moment it reaches the exit.

  • @bejoscha
    @bejoscha 3 года назад +1

    Thanks for yet another good and interesting video. I definitely think your presentations are getting better and better. I appreciate the new "interactive drawing" style - and also that your talking has slowed down (a small amount). Well done!

  • @screweddevelopment12
    @screweddevelopment12 Год назад +1

    3:33 lol you matched the lips

  • @Guztav1337
    @Guztav1337 3 года назад +2

    We are horrible at predicting the dangers of the future. We humans have pretty much failed every time in the past.
    Just like this was new to Miles (see 22:17), we are probably still terribly bad at predicting the real dangers.

  • @dark808bb8
    @dark808bb8 3 года назад

    The comment about evolution finding brains that optimize and sgd finding weights that optimize was really nice. I've always had this idea that evolution is like a search.

    • @JimBob1937
      @JimBob1937 3 года назад

      Lookup genetic algorithms, a specific form of an evolutionary algorithm. They're classes of optimization algorithms, and all optimization algorithms are searches (in the vaguest sense).

  • @xario2007
    @xario2007 3 года назад +1

    11:55 "It might completely lose the ability to even" -Hahahah!

  • @davidioanhedges
    @davidioanhedges Год назад

    I like that the text including the word Optimi(s|z)e is spelled correctly ...

  • @Bencurlis
    @Bencurlis 3 года назад +3

    Great video, one of the best on this subject!
    I wonder, how can the mesa objective be fixed by the meta-optimizer faster than the base optimizer making it learn the base objective? In the last example the AI agent is capable of understanding that it won't be subjected to gradient descent after the learning step so it become deceptive on puprose and yet, it hasn't learned to achieve the simpler objective of going through the exit while it is trained by the base optimizer?

    • @sjallard
      @sjallard 3 года назад

      Same question. Following..

    • @irok1
      @irok1 3 года назад

      Not sure about the first question, but the second question can be answered by this: The mesa (optimizer) wants to get its mesa-objective as much as possible, and it doesn't actually care about the base objective. If the mesa figures out that its current goal could be modified, the mesa knows that any modification of its current goal would mean it would be less likely to reach that goal in the future. By keeping its goal intact through deception and pretending to go after the base objective, the mesa can get more of what it wants after its goal can no longer be modified.
      The mesa is concerned with its current objective, and wants to maximize that goal into the future. Any change to its goal would mean that last goal probably won't happen

    • @Bencurlis
      @Bencurlis 3 года назад

      @@irok1 That part makes sense but I don't get how the mesa objective coud be fixated in the "brain" of the mesa optimizer in the first place without the base optimizer making it learn a simpler objective at first. Is it because the mesa objective is kind of fluctuating while the base optimization takes place and thus the mesa objective gets time to become something else entirely? I mean, the base objective itself is simpler than the instrumental objective of preserving the mesa objective.

  • @crestfallensunbro6001
    @crestfallensunbro6001 3 года назад

    this reminds me of the plot of tron legacy, flynn created an ai and said to it "You are CLU, You will help me create the perfect system" in this statement he gives it the (impossible) objective of creating the "perfect" system. what ends up happening is that they work together for a time and make a large and orderly system, however when anomalies appear in the work space flynn and CLU have differing reactions flynn sees them as emergent, beautiful and something to be preserved and shown off while CLU sees them as errors to be removed, and of course CLU starts to do this, but when flynn tries to prevent this CLU suddenly starts trying to cut off and remove flynns influence as it is impeding CLUs' pursuit of what is considers "perfection"

  • @triftex8353
    @triftex8353 3 года назад +2

    As early as I have ever been!
    Love your videos, hope you are able to continue making them often!

  • @devjock
    @devjock 3 года назад +4

    Whenever I see video interlacing artifacts, I now automatically think it's a deepfake...

  • @leovalenzuela8368
    @leovalenzuela8368 3 года назад +2

    Robert Miles never fails to blow my damn mind wide open. Every single time.

  • @NathanTAK
    @NathanTAK 3 года назад +266

    [Jar Jar voice] Meesa Optimizer!

    • @fabianluescher
      @fabianluescher 3 года назад +19

      I laughed aloud, but I cannot possibly like your comment. So have a reply.

    • @41-Haiku
      @41-Haiku 3 года назад +2

      Oh god

    • @sam3524
      @sam3524 3 года назад +6

      [Chewbacca voice] *AOOOGHHOGHHOGGHHH*

    • @ssj3gohan456
      @ssj3gohan456 3 года назад +3

      I hate you

    • @NathanTAK
      @NathanTAK 3 года назад +6

      @@ssj3gohan456 I know

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader 3 года назад +1

    Shared on my Twitter. This channel is amazing

  • @luga1398
    @luga1398 2 года назад

    Been learning through your work for years without saying thank you. Thank you so much.

  • @marklawes1859
    @marklawes1859 3 года назад

    This was fascinating and enlightening. It did make me wonder if some people have mesa objectives which are antithetical but pretend during training that their goals are actually one aligned with those who are trying to select the best agent for their objectives. Like a politician pretending to have an objective which aligns with what their constituents desire but their mesa objective is to obtain power so they can enrich themselves or gain some other advantage that is specifically not desired by their electorate.

  • @garyteano3026
    @garyteano3026 3 года назад

    Robert, you are an amazing teacher and I cannot overstate my appreciation for your videos...

  • @Connorses
    @Connorses Год назад

    "We programmed the robots to program more robots"
    "Oh. That's much worse!"

  • @SamB-gn7fw
    @SamB-gn7fw 3 года назад +10

    "Some of us" have brains lmao

  • @andybaldman
    @andybaldman 3 года назад

    3:24 I love how the earth pic he's using is the one from the infamous 'The End of the World' video from 2008. ('Fire zee missiles! But I'm le tired...')

  • @nemonomen3340
    @nemonomen3340 3 года назад

    I'm indescribably disappointed that a _Reward Modelling Part 2_ hasn't come out yet.

  • @vanderkarl3927
    @vanderkarl3927 3 года назад +1

    It's really funny how we can be thought of as misaligned agents:
    The gradient descent process which drives evolution and gene selection "wants" as many copies of the most successful genes as possible, so over the course of evolutionary history it has produced a bunch of organisms, various models which do this pretty well in competition and cooperation with each other, making a mat of highly complex biomass which blankets the earth and sky.
    Then, it starts enlarging the prefrontal cortex of some monkeys which seems to give unusually good results (thanks to communication and logic and such), and then, uh oh, some of the monkeys aren't having reproducing anymore because they're too preoccupied with their mouth sounds and "social dynamics", whatever those are. So, the ones which receive more reward signals for doing the sex and solving problems and inventing tools and such do better, which is all well and good for a while, but only 10,000 years later, practically the blink of an eye (or the evolution of a larger finch beak), they're all holed up in shells of silicon and iron with porn and video games and "generative pretrained transformers", and evolution can only shake its nonexistent metaphorical head and wonder, "where did I go wrong?"

  • @soren3569
    @soren3569 Год назад

    Another analogy for how this goes wrong, derived directly from your video. Humanity is a mesa-optimizer of natural selection--we don't have natural selection's goals of maximum genetic spread, but we do engage in behaviors that end up promoting that. So it would seem that humanity is doing a decent job as a mesa-optimizer. However, as our terminal goals do not align with natural selection's, but rather just happen to emulate them, we end up with some behaviors (such as late-stage capitalism) which ultimately threaten human genetic survival on a broad scale. The same drives (pleasure and security) that originally gave us strong pushes to our species' survival have now begun to threaten it.

  • @designheretic
    @designheretic 3 месяца назад

    5:12 plants are operating through a self-organizing manipulation of an intricate biochemical substrate-each one an instance for consideration by the multimodal optimizer we call evolution

  • @christopherlawnsby1474
    @christopherlawnsby1474 2 года назад

    This is so so so good. It made the issue crystal-clear to me, and I'm 100% new to this topic

  • @0hate9
    @0hate9 3 года назад

    This is a very good explanation of this issue.
    Commenting for the algorithm.

  • @villeneuveluc5540
    @villeneuveluc5540 3 года назад

    I don't know if you will read this but : the base optimizer(BO) objective depend on the mesa objective keeping working properly after its release, so the BO would prevent this issue. An advanced enought BO might want to force the mesa optimizer to keep the base objective. If the base optimizer is well enought designed, it would take mesures to not being overcome by its mesa optimizer thus reducing the problem to the initial outer alignment problem.

  • @fergdeff
    @fergdeff Год назад

    This is quite fascinating.
    I feel that this problem (or phenomena) is more likely to manifest where the degrees of freedom between training environments and real world environments are greater. As I understand it:
    Training environment: limited agency, monitored, task-based feedback.
    Real world environment: unlimited agency, unmonitored, self-regulated feedback.
    If your release model were to incorporate a staged, or tiered introduction to autonomous agency in the real world environment, then wouldn't this phenomenon be less likely to appear? Essentially, your AGI goes through a continual (but somehow ever decreasing) processs of training and evaluation that it meets the evolving criteria of being a "good citizen".
    This could also provide a framework for handling the fact that what constitiutes "good citizenship" in the real world is also itself a thing that changes.

  • @ReAnderson
    @ReAnderson 11 месяцев назад +1

    Robert, I just think you are amazing…I think I love you. 🤗

  • @sonicmeerkat
    @sonicmeerkat 3 года назад +1

    can i just say, love the lip sync job on the simpsons clip

  • @CarlJdP
    @CarlJdP 3 года назад

    Well said - this is basically my daily problem solving thought process put into words - I should change my job title to Optimizer!

  • @walgekaaren1783
    @walgekaaren1783 Год назад +1

    The chess parable is a fine way of describing AI, because there is such a thing as discusting engine moves. In order to understand that, you first have to know, that there is the romantic way of seeing chess and the postmodern way, there the first wants to achieve compositional beauty and establishment of sertain key principles, while the latter is an iconoclast and only wants to win, sometimes results in ugly scenarios, what are not good to watch for the audience. Engines seem to fall into that perfected latter stage, there you only care about the result, and not how you achieved it. For instance, an AI would not see a problem in Dark Waters movie scenario, there a Mega Corporation dumped teflon into the trenches and made everybody sick, because mostly he wont get sick in such environments. If an AI's objective steps on a dog or person or house, they will go right through, because they lack empathy and neural contemplation -- because all that needs to be programmed into them. Which makes the sentient AI problem even more intriguing. Why should an automaton develope morals to begin with, then those are a class of restrictions, making your decision making harder not easier? Why not go the shorter route, even if you cause genocide, or in the chess example, sack all pieces for the win, save the king and one mating piece.

  • @bloergk
    @bloergk 3 года назад +5

    At the end you write that, when reading the article, this was a "new class of problems" to you... But it just seems like an instance of the "sub-agent stability problem" (not sure of the proper terminology) you've explained before on Computerphile ruclips.net/video/3TYT1QfdfsM/видео.html.
    The only difference is that in this case, we are dumb enough to build the A.I. in a way that forces it to ALWAYS create a sub-agent.

    • @yondaime500
      @yondaime500 3 года назад +3

      The paper addresses this on page 6.
      > Possible misunderstanding: “mesa-optimizer” does not mean “subsytem” or “subagent.”

    • @bloergk
      @bloergk 3 года назад +1

      @@yondaime500 Thanks! I get how they're two very different ways for a "parent AI" to create a "child AI", but I still think my point makes sense: to me it seems like this distinction doesn't matter when it comes to the problems discussed by Rob in this video (i.e. in a broad sense: the features that keep the parent safe won't necessarily exist in the child, because the parent doesn't value "making my child exactly as safe as me").
      Scenario 1 (general, abstract): a seemingly safe agent decides to create a sub-agent in order to achieve a given objective more efficiently. Since the sub-agent needn't obey the same restrictions as the original, there is now a new agent without the original safety measures (so potentially a different utility function). THAT new creature can thus be misaligned with human values even though the original couldn't.
      Scenario 2 (specific, concrete): an optimizer is built, and it works by tinkering with a neural network (which will tend to act as another optimizer) until the latter appears efficient at achieving a given objective. Since the mesa-optimizer develops its implicit "utility function" with some randomness, it can stumble into a "utility function" that satisfies the base-optimizer's criteria for "appearing to achieve the given objective" under training conditions but doesn't properly match the given objective. THAT new creature can thus be misaligned with human values, even if the base-optimizer acted as a perfectly aligned agent.
      It's hard for me to pinpoint how exactly Scenario 2 reveals problems that aren't covered by the more abstract Scenario 1: doesn't the base-optimizer act as an agent in the process of creating another agent, and isn't the mesa-optimizer dangerous because its parent cares about transferring part of its objective, but DOESN'T care about transferring its "safe-ness"?

  • @socrates_the_great6209
    @socrates_the_great6209 2 года назад

    It is indeed crazy that humanity did not fix this problem like 5 years ago at least.

  • @magentasound_
    @magentasound_ 3 года назад

    The music at the end was nice 😊

  • @Yupppi
    @Yupppi 3 года назад

    Made me think how humans understand what's important in the exit sign. Essentially identifying with the symbol that represents a human (not a picture of a human, just a symbol that has similar enough features, same with the door) and at some point learning why the exit sign exists (for you to find a way out of the building/room, because you will eventually have a reason to want to leave it). So much learning contexts and concepts that are not directly related to seeing the sign and following, to understand its meaning. Certainly a huge task for people in the AI field to tackle without specifying everything and somehow making the AI to have a reason to leave. It does help us that our brain is almost too good at finding patterns, even when they don't necessarily exist. Yet not as good as an AI perhaps, but with the ability to discount some of them.

  • @dunzek943
    @dunzek943 2 года назад

    We think of building AI that we don't want to conflict with ourselves when we're in conflict with each other all the time.

  • @TorreFernand
    @TorreFernand 3 года назад

    This is basically "The bug that only appears outside of debug mode" but for AIs

  • @ВикторФирсов-е9ф
    @ВикторФирсов-е9ф 3 года назад

    I've waited for this for way too long

  • @Zarcondeegrissom
    @Zarcondeegrissom 3 года назад

    I think a particular paperclip maximizer "algorithm" is experimenting with this, given how many topics are not allowed to be discussed any more due to the banning of a single word. sun spots, black hole x-ray emissions, beer, and the SGU Destiny ship to name a few.

  • @jonathonjubb6626
    @jonathonjubb6626 3 года назад

    Thank you for spending your time educating the rest of us in such an pleasant manner.

  • @TheVnom
    @TheVnom 3 года назад +1

    Super entertaining and informative, as always

  • @DeusExRequiem
    @DeusExRequiem 3 года назад

    Part of the problem is that humans aren't likely to create an optimizer of sufficient quality without the help of other optimizers, so this is a problem we need to solve.
    Extinction events in the real world have come after huge changes to the environment,one caused by meteorite strike, another caused by life creating too much oxygen. Eventually this balances out and life bounces back, so maybe a similar technique is required, radically changing the main goal every now and then, having a wider ecosystem of AIs that need to regularly deal with these huge shifts which affect their survival.

  • @DeclanMBrennan
    @DeclanMBrennan 3 года назад +1

    So AI might optimize for the "Don't get caught" model of morality - that's both scary and depressing.

  • @nickm3694
    @nickm3694 3 года назад

    Describing humans as a Mesa optimizer of evolution was such an insightful way to look at things. I really like that

  • @sd4dfg2
    @sd4dfg2 3 года назад +1

    I still remember being surprised that bacteria don't contain the DNA to build complete new cells - they just have the code to copy themselves.