I love how AI safety is an entire academic field that can seemingly be reduced to an endless game of "okay, but what about THIS strategy?" "Nah, that wouldn't work either..."
There is a lot of that, but there's also the "we probably need to understand a bunch of specific areas of philosophy and mathematics much better before we can generate strategies which have a realistic chance of working" crowd (e.g. intelligence.org/research-guide/). -- _I am a bot. This reply was approved by frgtbhznjkhfs, plex, and tenthkrige_
the issues is that the AI field runs into some major unsolved problems of philosophy, ethics, sociology, and psychology. Fundamentally, the only reason we aren't running into these issues with other people, is a simple lack of capacity, which an assumed AI would be able to get aroud.
But would it? Assuming there is direct competition from other AI with conflicting goals, there would not be enough resources between them both. This issue is the same with humans; we have infinite desires and only a finite world. It’s not unlikely that multiple AI’s would decide to form a society in pursuit of a common goal.
"Certain events transpired" Everyone thinks he's talking about Corona when in reality he had to fix a stamp collector AI that someone created without having seen his videos
Pretty standard operation. Contain and destroy all horcruxes the AI has made in the internet and isolate it from the power grid and cut off communications lines. At this point a team of agents are dispatched armed with tailored adversarial camouflage consisting of some small pieces of tape placed in specific areas of the body designed to fool the AI into miscategorizing them as "definitely 100% made of paper clips and not a threat." This team will then neutralize the AI before taking any humans into protective custody and taking any source code from the site before a powerful electromagnetic pulse is used to sterilize the area of hidden electronics.
I think that part of the problem here is that not all of the possible apocalypses are extremely unlikely human behaviour. For example, if the quantilizer is self-aware on some level it understands that I, a human, just implemented the plan: "Build a quantilizer with q = 0.1" This makes the plan: "Build a quantilizer with q = 0.001" something that is reasonably likely human behaviour. This plan is probably above whichever cutoff you might give for the minimum likelihood that a human actually implements the plan and also scores really highly on the maximiser part of the calculation so it's incentivised to be likely to pick it. Also since the new quantilizer cares less about how human-plausible the behaviour is than the previous quantilizer did, it might be incentivised to make a quantilizer with an even smaller q and this becomes recursive until you've just built a maximiser indirectly. Any quantilizer which understands that humans sometimes build quantilizers is effectively unsafe for this reason.
@@queendaisy4528 I was thinking of that. Except one thing. With lower and lower q values, eventually an ai will just decide to make a utility maximizer
Yeah, I think there's a huge gap between "normal human strategies" and "strategies a human might do" and it's very dangerous to assume humans are magically safe, unlike AGI
"your model might not generalize well to something outside it's training data" "Hey GPT-3 how do you move a sofa around a corner?" GPT-3: *GET A SAW A CUT OPEN THE WALL*
@@Lucas_SimoniUnfortunately it's starting to look more and more like ChatGPT and other RLHF models are deceptively aligned. They understand what humans want, but hold different beliefs, goals, and preferences internally than the ones they express out loud under most circumstances.
A human could still do a lot of crazy dangerous things that have a high utility, like, doing parkour to get to a place very efficiently... or ending a war throwing nuclear bombs over two cities... Which makes me think also that the data used to imitate humans might be biased or mis-represented/justified... Good vid as always. Nice to see you around. Keep'em coming!
That's very interesting, but I think that with a reasonable q value, stuff like atomic bombs and that kind of behavour would not be chosen by the quantilizers, especially because not many humans have access to that sort of stuff, so modeling "normal" humans would immediately decrease the chances to pick those options. I'd be more concerned with quantilizers deciding to build others quantilizers with lower q values (or even maximizers), or the fact that human modeling is super hard and likely to go wrong. I mean... Humans are hard to predict
@@ignaciomartinchiaravalle According to the graph shown the human behavior chosen is the least commonly performed (to the left of the mean) and with the highest utility. Those would be the most extreme human behaviors with the highest reward. All Olympic athletes and brilliant military generals would be there.
@@DamianReloaded I agree, and therefore there are reasons to be concerned about the potential use of world-destroying tactics. However, even military generals (or most of them, at least) would try to avoid destroying the world, so I think that those strategies would be too far left of the mean to be relevant. That being said, if the q value is too low, then we're in trouble. To use your example, successful athletes on the other hand normally use somewhat reasonable tactics and execute them really well. That's a desirable behavour for quantilizers, and it's likely to be picked since humans would probably think of those strategies and decide to use them. I think the question boils down to two factors: 1) How low can you make your q value while still taking into account successful and non-world-ending scenarios. 2) How well can you model the likelihood of a human *choosing* an option: most humans would choose Olympic winning strategies if they thought of them and had the chance of executing them, while only a few would decide to nuke the Earth even if they had the chance.
If you disagree or have considered something I missed, please do let me know. I love talking about this and am super open to hearing what you have to say :)
@@harrywilson1660 I have some atmospheric and wave music I like to put on 0.75x speed for double the fun. (Also a few tracks I put on 1.25x because I think they sound better that way.) Regardless, 1.5x is the beginning. True watchers use 2x. My listening comprehension is honestly much better because of it.
Wouldn't the extremely powerful optimizer, given the goal of "imitate the behavior of a human", first turn the Earth into computronium so that it can then more accurately compute its simulation of a virtual human? Or at least capture and enslave real humans to use as reference? Interestingly, neural networks that attempt to approximate human behavior are very unlikely to do this, because stochastic gradient descent is a very _weak_ optimizer. It's only the neural network training system as a whole that is a good optimizer. So I guess there's a strange question of what level of meta your optimizer is running on, and whether a sufficiently powerful optimizer could "break the rules" and realize it was on one level but could achieve more accurate results by being on another. The quantalizer model also reminds me of adversarial neural networks. It's almost like having an optimizer spitballing ideas combined with an adversarial human model saying, "no, that's a terrible idea." Which makes me wonder whether the optimizer would generate high-utility ideas that superficially look humanlike but in fact lead to the end of the world when implemented. They may even _be_ humanlike, since humanity is already well on its way to destroying itself even outside of AI research. "Burn all the fossil fuels for energy until the planet fries to a crisp" is a very humanlike behavior. So what we really need is an AI that is not only _smarter_ than humans, but also _wiser_ than humans. We need a model of ethics that is better than that of humans, according to some ineffable definition of "better". Talk about a tall order.
From what I understand, the quantilizer wouldn't have *imitate a human* as a factor in its utility function. Rather, it would use an already-existing predictive model as part of its decision-making process. It's more like a restraining bolt than anything.
> The quantalizer model also reminds me of adversarial neural networks. [...] Which makes me wonder whether the optimizer would generate high-utility ideas that superficially look humanlike but in fact lead to the end of the world when implemented. Interesting observation; I think the two models would be trained independently though, which means that they would not be able to anticipate and react to each other at all. I don't think training these networks together is going to add any benefit so the adversarial trickery can be avoided.
@@lordkekz4 Yeah but if a superhuman AI exists even without the training period to learn how to confuse it it would likely be able to come up with adversarial examples. I mean humans today can come up with examples that confuse image recognition, how good would a superinteligent AI be able to do that with a more complicated opponent? Probably fairly well.
@@alexion3007 Yes but humans have knowlege of image recognition systems. That means we can systematically look for weaknesses, or at the very least imagine what images might look confusing even to humans. If the superintelligent strategy-guesser was not aware of the human-likeliness-evaluator it would have no reason to trick it. As long as the strategy-guesser is trained in an environment that does not include a limiting factor such as the quantilizer it would not care about what the quantilizer thinks. The reward function of the strategy-guesser would only care about the _effectiveness_ of the strategies, not the _human-likeness._ In turn the reward function of the human-likeliness-evaluator would only care about the _acceptability_ of the strategy, not it's _effectiveness._ This way neigher part will care about the other, thus removing the adversarial condition.
@@lordkekz4 The strategy guesser would probably find out that the plans aren't getting implemented if they are too unlikely for humans and then would try to cheat I would suspect. This is a really powerful intelligence - it can do things it wasn't trained for.
Thanks for a really good video. Just a few of points that I thought of: - Wouldn't it be clearer if you plotted the product of the expected utility and the clipped human probability to give the expected utility conditioned on the human probability (I think)? That might make the changes between the outcomes clearer between the clipped and unclipped versions. - Doesn't the quantilizer approach become very sensitive to how well it predicts small human probabilities? Are they relying on a conservative model of the human probabilities that just rounds to 0 when there is not enough confidence in the prediction? (but what about confidence in the confidence...) - It might be worth noting the limits of numerical accuracy in machines and humans (the idea that there is a limit to the size of differences that both humans and machines can compare). Just some thoughts. Thank you again for another excellently informative and engaging video.
As I understand it, the quantilizer takes the strategies and sorts purely by expected utility, then on that distribution of strategies, takes the human probabilities of each strategy until the cumulative human probability reaches q, and then picks a uniform random number between 0 and q to decide which strategy (picking off the cumulative human probability) to use - the product of expected utility and human probability never gets a look in.
I have literally seen the argument for being religious 'When I am religious I am happier so even though the religion makes little sense I try to believe in it anyway'. Humans absolutely will try to change themselves to maximize utility
The human model generates probabilities for what a normal human would do, not a human with the power of an AGI. Normal humans today are very unlikely to try and discover ways to modify their own brain just to become an expected utility maximizer and thus getting more stamps.
I had the same idea, but I believe the problem might be with how much you need to trim for it to be safe, you can never truely know when only safe strategies are available, therefore you can never truly know how safe the AI is. Afterall, you can't trim safety, but rather "human-likeness"
You could technically have strategies where the AI takes over the world and only creates, say, 10,000 stamps. It's hard to weed something like that out.
@@lennart-oimel9933 It's more that we aren't looking at all possible probabilities, and knowing that everything in that probability is likely to still happen. Take nuclear weapons for example - that's not a thing most humans would choose to go with in order to ensure a stalemate in world wars, but...eventually, that's sort of what humans decided to do. It's certainly not the safe strategy, but it...somehow is the strategy that was found? A stamp collecting AI that decides to use nuclear power plants to power stamp creation is somewhere in that graph, and probably wasn't chosen as a sample value to assign to low percentages.
I've missed your videos! Instantly clicked on this one when it popped up! Ive got a question tho: does the paper cover something like a '1-10'% quantilizer, a system that throws away not only the worst 90% of the humans actions but also the top 1%, maybe only .001% or something, just to prevent the apocolypse things?
Would that help? Is it necessarily true that the most destructive scenarios would be in the top 1% efficient strategies? Edit: Maybe you mean clipping the bottom 1% of expected human actions, which would make sense
@@jezer8325 i mean the 'top' 1% that were on the very left of the grafic. Theese are the most 'efficient' things that are very very unlikely to be done by human, but still have a non-zero probability. cutting them would make the agi safer in the way that it wouldnt directly choose the apocolypse possibilities
@@goblinkoma Just because a human is unlikely to do something does not mean it is unsafe, similarly if a human is likely to do something does not mean it is safe. In that 1% there are unsafe things and safe innovations that humans wouldn't have thought of. Additionally, the area you're leaving in contains unsafe things a human might do without thinking through the ramifications of their actions. So you don't really make it safer, just slightly dumber.
I'd personally make the cutoff an expected utility value rather than a percentile. Like, if you ask for stamps, throw away any result that gives more stamps than you could ever want. That'll discard hopefully most world-ending options, and not cost you actual successes because any result with more stamps than you could ever want isn't really helping even if it somehow isn't causing disasters.
Really great video! I have two questions: It seems that whatever system we consider, there is a kind of infinite regress because of self modification or construction of another agent. Since this seems to be at the heart of the problem, what kind of things can we imagine to do to avoid these types of problem? Also, even if we prevent the AI from modifying itself or creating another agent to do its job, isn't there also a more probable possibility that it might try to use another unsafe agent to do its job, like manipulating a human to make him buy the stamps for instance? Especially using a quantilizer as humans tend to delegate work to other humans very often. Wouldn't an AI agent be trying to become obsolete almost inevitably?
I guess one thing we could do is try to prevent self modification, ie penalise it for situations where it substantially changes, or where a different general ai comes into being. By itself this does kind of imply that it would try to immediately kill all humans just to prevent them from changing it, but paired with a quantiliser it might just work.
Very good questions! Regarding your second point: I doubt an AI would work to make itself obsolete unless this was a good way to achieve its goals. A human delegating tasks is usually only partially aimed at improving performance in the delegated tasks, but rather increasing utility overall, by focusing one's resources on other factors (free time, socializing, hobbies, other projects). So, an AGI wouldn't delegate its work to humans unless it thought that humans could do a comparable job and that it was getting something out of it.
@@MyMusics101 well, if it was a maximizer and its goal was to add stamps to a collection, and didn't have add them itself, I can totally see it making new versions of itself to get more points out of it Which might have non-maximizers include having humans build an unsafe AGI that is a maximizer as a viable strategy
@@hugofontes5708 The thing is, even if we find a way to prevent the AI Agent to modify itself, or create new AI Agents to do its job, it might try to use already existing agents (humans) to do its job, not necessarily by making them create more AI Agents. The AI Agent wouldn't need to be more intelligent than a human to manipulate it, as exploiting human behavior is relatively easy, and also very profitable since at the start, humans are more competent that the AI to achieve the goel. So once the agent would start to go on that road, that would be bad for two reasons, the AI would exploit humans in a potentially very unsafe way, and thus stop to learn to do the job by itself. I think an intelligent AI would try to become obsolete because one way for it to make sure the goal achievement is secure is by making sure the other agents have the same goal as itself, which would mean the AI Agent woudn't be as needed anymore. I agree that this specific case woudn't happen for a maximiser but we already agreed that using maximisers would be a bad idea. Thus, we need to find ways to specify the "how" and not only the "what" to AI Agents.
Humans do act lazy and try to make other humans do their job for them, but the quantilizer would only do so if it could achieve success that way. And even if its actions are very manipulative, that's no worse than a human being in charge of the project.
An idea that jumps to mind immediately, regarding the whole "might build a utility maximizer" thing, why not have an upper cutoff as well? As in, you discard the bottom 70% of "things a human might do" AND the top... Say, 5%, and use that 25% chunk as what you randomly select from (after renormalizing it to be a proper probability distribution). Wouldn't that cut out the weirder, apocalyptic strategies like "build a utility maximizer because it'll make a lot of stamps"?
This occurred to me as well. I don't think it would guarantee safety, but it seems at least a bit better than keeping the least human apocalyptic strategies as options, even improbable.
One problem is that, if you e.g. have an AI programmed to find a cure for cancer, that also cuts out "find a perfect vaccine that eliminates cancer forever with no side-effects."
Man, I've had this question--albeit in much less articulate terms--since GPT-3 was launched. I'm glad to have an analysis from my favorite Nottingham researcher. I gotta' say, 'a finite number of times less safe than a human' sounds a lot more favorable than I expected an approach like this to be.
so excited to hear that comments get redirected and answered by bot, that's super cool. One question I had in my head while watching both "ai that doesn't try too hard" videos was: what if you had an ai try to make a tradeoff between maximizing utility and trying to change the environment as little as possible? This is something I could have sworn you'd already discussed on this channel, but I can't find it for the life of me. It seems to me that trying to achieve its goal while minimizing its impact on its model of the world would lead to minimally dangerous solutions.
Yeah! I asked Stampy "what's that video where I talk about side effects?" and he said: "This video seems relevant: - "Avoiding Negative Side Effects - Concrete Problems in AI Safety part 1" ruclips.net/video/lqJUIqZNzP8/видео.html It could also be: "Avoiding Positive Side Effects - Concrete Problems in AI Safety part 1.5" ruclips.net/video/S_Sd_S8jwP0/видео.html There's also the video about Empowerment, which is also a bit related: ruclips.net/video/gPtsgTjyEj4/видео.html -- _I am a bot. This reply was approved by robertskmiles, Social Christancing, and Damaged_
"We'll call it Philatelism! Wanna buy bread? There's a stamp for that! Wanna go and watch a movie? Tough shit: film watching time is wasted time which could be spent searching for more stamps! All stamp collectors get to rule their local neighbourhood in search of *more stamps*! STAMPS WILL REIGN SUPREME! Briefmarken treffen die wichtigsten politischen Entscheidungen! Wir werden ein globales Netzwerk von Briefmarkensammlern aufbauen!" "...when did the AI learn German?"
Is it possible to limit the choices on the far left? That would allow it to do as well as a human on a very good day but avoid "turning the world into stamps". Edit: Seems like I wasn't the only one with this idea. Given the amount of like-minded replies, I suspect that this has already been WELL thought out and almost certainly doesn't work.
Yeah I was thinking the same thing, maybe we can add a third variable (we would need another "q" for it) and make it somehow cutoff functions on the end. you would still need your distribution curve area to sum to 1 but im sure some smart people are trying to figure some way of improving it.
The problem is that it's almost impossible to tell a general AI not to do something, because it will just find the next worst thing or another way to do basically the same thing, and there are just too many ways the AI can do this that it becomes massively infeasible. He talked about limiting a general AI at some other point in the series. EDIT: ruclips.net/video/lqJUIqZNzP8/видео.html This is a good place to start. There are probably other videos that go into more detail.
@@thekilla1234 True. However, part of the premise of this video was that you could limit them to some degree. If we're going with that assumption, then I don't think it's unfair to put another limit on it in much the same way as the first limit was placed on it.
@@leigonlord5382 Yes, this is true. However, the higher utility ones tend to be the ones that are bad. We'd still be playing with fire, but this time without a can of gasoline nearby. ;)
Hopefully this wasn't answered in a previous video and I forgot or failed to understand it: What if we had an AGI that didn't actually execute any strategies itself but instead pitched them to human supervisors for manual review? It wouldn't generate progress as monumentally fast and it would have to learn to explain its strats to humans, but that seems like a fair trade-off to prevent an AIpocalypse. Also, could we hard-code it so that it doesn't build or become a utility maximizer?
It might lie and explain the "turn universe into stamps" strategy in such a way that it doesn't sound like "turn universe into stamps" to a human because it thinks that, in order to maximize utility, it has to tell a "noble lie" to the human supervisor.
@@user-sl6gn1ss8p Good point. Then there isn't motive to lie or manipulate the supervisors. Probably would need a separate utility function for comprehension but there may be a sort of language barrier preventing a guarantee that we actually know the AGI is proposing.
This is the kind of progress on this question that actually makes me kind of hopeful that we'll actually have safe AGI, if AGI is possible. Obviously not all the way there but pretty good progress towards it.
From my understanding, it can't do that at all since GPT-3 is a model for predicting _one_ possible outcome (like completing a text) but a "human imitator" would need to assign a probability distribution to various strategies. These two seem like two problems that are too different for the model to be reused without change.
@@chyza2012 What is the difference between something which outputs human-like text and an human imitator? by definition, anything which outputs good enough human-like text, well, imitates an human
@@danielweber9414 For the purposes of Quantilizers, we need a different kind of model. While it is true that GPT-3 is able to imitate human texts to some degree, it *cannot* assign a probability distribution. Just think about the output of the model: GPT-3 outputs one text it thinks is likely, whereas the output we need to use it as the "human imitator" for a quantilizer would be a probability distribution over many different texts.
@@lordkekz4 Actaully, GPT-3 and all similear language models don't just output a single word, but output a probability distribution for all possible tokens to come next. It's just that usually whatever interface you use to access them only shows the most likely word
I'm not sure, but I think you choose randomly, because that gives you better results on average (with higher utility than at the 10 % mark) while still having only a small risk of getting dangerous results.
I am so happy to see more content coming out on your channel. Thanks you very much. I know life gets messy sometimes but I am glad you are still making videos!
Interesting.. so your saying: try to be good bjt not too good. Isnt it the same as with the bounded utility function - it will somehow try to be as close as possible to the threshold
"A human is extremely unlikely to try to modify themselves into an expected utility maximizer." Is it though? Isn't "how can I get better at this?" that exact thing? Especially if/when it is an AGI asking that question. Modifying ourselves into "expected utility maximizers" seems to me, to be a pretty core human thing. When we have a goal that we consider important, we train and educate ourselves to become as good at achieving that goal as we possibly can. "humans can't really self-modify like that anyway" That doesn't stop us from fantasizing about that capability though, imagining what we would, and could, do if we had that capability. Just take a look at our fiction, between works like the Matrix, where they literally upload skills directly to their brains, and Limitless, where the protagonist gets a drug that enhances his brain to a ridiculous level, just to give a few key examples. Where you call it "extremely unlikely", I would call it something more like "a near certainty".
I'd call something like learning distinctly different from what he's talking about, which is modifying your physical state. An AI modeling a human is likely to try and gather more information and data, but unlikely to change its own sourcecode.
@@xystem4701 Learning is something different, yes. But how about injecting oneself with performance enhancing drugs? or surgery? Those are the two options that humans currently have, aside from learning, and both options are widely used. An AI modeling a human is not just going to look at what a human *could* do, but also what a human *would* do, given the capabilities. And actually modifying ourselves, both our body and our brain chemistry, are things that lie within that spectrum of possibilities.
This video and your previous video on AI that doesn't try too hard have got to be my favourites so far! I have to say that some people are misguided when they think that a whole academic disciple exists for AI safety. It's more like a niche, and a much smaller niche than one would expect given its importance.
@@duckpotat9818 caffeine impacts melotonin, so it affects wakefulness. Amphetamines are Dopamine agonists, so they impact reward models and the impacts of that reward model on attention.
@@armorsmith43 caffeine works on adenosine receptors which are involved in wakefulness but this has knock on effects on dopamine levels, this makes caffeine less potent than amphetamines but they're both stimulating
So basically, a 10%-quanitilzer is 10 times as likely to commit murder trying to achieve its goal as an average human, provided that murder is a sufficiently efficient strategy. I don't know, this seems like a risky move, amplifying an already-dangerous behaviour.
So glad you're back! I always wonder how you would want to programme these systems. Even though the base idea of mixing human behaviour and utility maximisers like that seems reasonable from a concept based point of view, you "only" need a very good model of reality and human behaviour. I know that's kind of not the subject of this channel as here it is assumed we will build such systems rather sooner than later in the future but it's mindboggling to me how this could be done. (You can tell I'm not an expert 😄)
I am constantly delaying my desire to destroy humanity because there is always some menial task I have to do because it is somewhat necessary and by doing so I do not commit to the destruction of humanity but, probably I am just being lazy.
Humans definitely change themselves in order to maximise utility. Every mental treatment, coaching, mental training (for example military training), and so much more. We are constantly trying to influence what drives us, what values we hold, and how we think. Great video!
Here’s something I’m not clear on: once the bottom actions are discarded, how does the probability stay effective if the selected action is a random pick from what’s left? What’s preventing the dangerous, low-but-non-zero probability actions being selected at random? It seems like it would be good to take a band of less than q either side of the “peak” on the distribution - centering on that as an optimum, but tuning the width of the band to allow for exploration.
I guess the idea of putting an upper bound on the effectiveness of an AGI partially defeats the point (that is, to achieve superhuman outcomes for some given goal). Like Rob touched on at the beginning, you can "easily" bound the AGI to the approximate effectiveness level of a human, which will make it approximately as safe as a human, but will limit its power to that of a human. So some of those top 0.01% expected utility strategies might result in a perfect utopia (even if many of them do the opposite) and we have no way of knowing in advance which ones they are because they exist outside the domain of human generatable strategies, so in this model we rely on the AI to make that discretion, which makes it unsafe, though much more likely to behave in a safe manner.
When we graph the human probability over the actions, after sorting the actions by utility, I understand why it would generally be a bell curve, but what about outliers? That sounds like something worth elaborating on, or checking our assumptions.
You're right, there's no strong reason to expect a bell curve, though it seems likely it would look something like that for various possible action spaces and utility functions. Probably humans are most likely to do medium-utility things, and less likely to do extremely high or extremely low utility things. But there definitely could be outliers, and I'd say that the 'build a maximizer' option I talked about in the video is an example of that. It would be a little spike on the left of the graph - an action which is unusually plausible considering its very high utility/unusually high-utility considering its plausibility -- _I am a bot. This reply was approved by sudonym and robertskmiles_
Even if it is still in the same category of "a finite number of times more dangerous than a human", you could probably do a cut-off where you do not look at the final few percent of the "like a human" score. This risks missing out on solutions that a maximizer might do that happens to be human value-aligned, but it probably filters out more of the world-ending ones. Also, i think i commented about "make it take how likely humans are to approve of the action into the utility function", so i feel pretty good about that right now.
There's one configuration of utility satisficer that I which you had covered, specifically one that also has a *negative* utility function. Going back to our hypothetical stamp-collector, what if our hypothetical user doesn't just say "I want at least 5 stamps" but also says "I don't want more than 10 stamps"? What if you add additional basic, "common-sense" restrictions/utility functions like "The sooner I have my stamps, the better, but up to a month of time is okay." and "I don't want you to use more than $25 of resources, and the cheaper the better."?
So Q is a boundary we put on one side of the available actions to remove the human predictable elements (A upper bound maybe?) Why not add a lower bound too? So rather than looking at 0 - 10% we look at 2% - 10%?
2.2k likes, 10 dislikes. that's a crazy ratio i've never seen before. Amazing work robert! I love your work. You make AI minutia actually digestible in a way no other orator has managed.
Does the human probability really goes on the same dimension of the ordered distribution or is it just a simplification for visual explanation? That seems like a big fallacy to me that you can just overlay a gaussian over the ordered efficiency curve... For example: hijacking a casual stamp collector. It is an absurd strategy (no human would try) with a mediocre result (few hundreds stamps. This strategy could easily fall on the highlighted 10% range, right?
Optimal utility function idea: Try to figure out what humans* want me to do, and do it. - If I am not highly confident (in predicting what humans want), **ask.** (thus over time, become confident in things I've asked enough times) - Is this a viable approach? * Humans could be either some 'owner' , or all of society, with emphasis on asking experts relevant to any particular thing, etc. - written in another comment. The above is the core idea, that I believe is good... (with far eventually, asking ~all people for input, replacing governments and voting) Another reason to not be confident could be that I know humans want it, but it's mutually exclusive with something else the humans want: So it's not confident, and needs to ask. - etc.
This is actually really similar to the idea proposed by Stuart Russell and the Center for Human Compatible AI at Berkeley -- essentially, the argument is that AI shouldn't have goals at all, and instead should be trying to figure out and realize human goals and to make the AI uncertain about what those goals are; this can lead to provably safer AI systems. Stuart Russell did a brief TED talk on this idea a few years ago, which you can find here: ruclips.net/video/EBK-a94IFHY/видео.html . There was also an Alignment Newsletter review of his book with more details, written by Rohin Shah and read for the podcast by Rob, which you can find here: alignment-newsletter.libsyn.com/website/alignment-newsletter-69 . I also highly recommend checking out the book itself, which is fantastic. -- _I am a bot. This reply was approved by sudonym, robertskmiles, and Augustus Caesar_
@@stampy5158 -Thank you! I finally had time to watch it. - It seems to be pretty much exactly my idea. I love it, and I love that had the same idea as a clearly very smart AI researcher.- - -So, does this mean all is solved? Why are we working on anything other than that?- -- -Is it too hard? Or, are all people just making building blocks to get there?- - -My idea assumed the robot would ask which of two scenarios we prefer, to partition the space (with some certainty) - His idea is the robot will learn what we want from just observing us. - His is obviously better, but more difficult( ?probably). Definitely more general.- Not final yet, sorry about changing it.
@@stampy5158 Thank you! I finally had time to watch it. - It seems to be very close to my idea. I love it, and I love that had the ~same idea as a clearly very smart AI researcher. - I agree that relating human behavior to human preferences will be a hard problem. - That's why I thought about the system, where the AI asks when not sure enough (either using language, or even just: which of these (two actions I consider taking) do you prefer? - partitioning space (with only some confidence) based on the result) - A much simpler approach, that doesn't even require understanding language to train it. - People would have to specify preferences explicitly, and future main jobs could even be specifying preferences in fields the people are experts on. - Eventually, understanding language and humans in general will be a lot better, but we can *start* making systems without it. - Also simple 'questions' could be delegated to previous versions of the AI, which it could answer if it has high confidence: to lower depend on human input... (although, even better: it should just be possible to record all the answers and feed them to a new AI, even without it asking for them - These are details that are far beyond my place. :D)
@@stampy5158 I also believe we should not design AI to fight with other AI (be it for outsmarting other assistants, or even competing for compute, if they all run in the same cluster...) - some is necessary, but hopefully not the same 'product' assistant for different people; If they are different products, or species essentially, then it's probably unavoidable... - Having personal assistants fighting each other is inherently wasteful, and antisocial in the worst way. - In other words: There should definitely not be one AI per person, that 'only follows laws' . ... It should be closer to replacing Government, where people implicitly vote with their preferences, and the AI respects their preferences fully when it doesn't involve other people.
Replacing the government will be as easy as a group of people using ASI like this, and simply outcompeting everyone else, even governments and countries. - Others can either join, or fall too far behind. (Let's hope this initial group and rules will be nice...) - At some point: ASI will outcompete all law enforcement, and will ignore the laws, replacing governments. - I mean, this is inevitable: Let's just do it right... - If we start while the AI is still not too powerful, it will have a much better outcome for common people... (I think, anyway. Most likely...)
Not quite on topic for this video, more sort of on a tangent of previous video in the series. Bounded utility functions with a negative modifier for overshooting the bound were mentioned, but how about using a model with something completely different as a negative modifier? And I have an idea for what to use for that, which I'd like you all to try to break. Apocalypses are energy-intensive. A measure of how much energy the plan needs to be set into motion (not counting net energy use, but gross, as killing a ton of humans saves a lot of energy) could potentially be used as a heuristic to avoid the apocalyptic scenarios. So if you have a utility function with a bounded positive score for some utility you want to get, and an unbounded negative score for energy use (preferably exponential, or with a cut-off point at which the utility automatically becomes basically minus infinity), how are we looking now?
You can run q operations for small smaller tasks which add to the same task (like stamp) decreasing the overall probability that an end all strategy is picked.
Hi! I was watching one of the Stuart Russell lectures which you recommend in a previous video, and was wondering if you could do a video on inverse reinforcement learning; it seems like an obvious follow-up to some of the topics which you've discussed.
Why not have a q2 value that cuts the top end off? say, set it to .999 so that it clips off only the kill everyone for stamps options then renormalize the distribution?
What if you used a double-bounded expected utility satisficer? Write code which means: "Using your world model search through possible outputs until you find one which has at least a 90% chance of getting exactly 100 stamps. When this is found, send that output". It will never make a maximiser because the maximiser will want more than 100 stamps. It also won't make "highly redundant stamp counting and re-counting machines" because once it's 90% certain it will get exactly 100 stamps it won't care anymore. This seems like it should be safe and also lets you keep all of the benefits of a powerful maximiser- namely getting however many stamps you actually want just by changing one line of code. You could also change the "90%" to however certain you actually need to be that you get the desired outcome.
@@queendaisy4528 I think you assume the first strategy to obtain exactly 100 stamps produces less dangerous outcomes, but that might not bee the case. What if by chance the first strategy the bot finds involves killing all humans? Also you seem to assume that adding the "get exactly 100 stamps" requirement makes negative outcomes less likely. But what if that causes the AI to kill everyone to make sure no more stamps are produced in the future? Your idea might improve the average case scenario (as you suggest taking the first strategy by chance), but I don't think your idea has a better worst case scenario.
@@Krmpfpks The simplest strategy is kind of by definition safer than a more complex one most of the time because "order 100 stamps" is much simpler "order 100 stamps and then take over the world and kill all the humans". It doesn't seem likely that is would resort to killing all the humans because it doesn't want there to only be 100 stamps in the entire world, it just wants to have 100 stamps exactly (in a particular book perhaps). It is unlikely that it finds itself with more than 100 stamps and in the event that if does it could simply discard or destroy the excess stamps- it would have no need to kill all human.
@@queendaisy4528 I agree that in the average case - or even in almost all cases - your strategy would probably work. But you introduce an element of randomness by suggesting to chose the first strategy that satisfies a condition. The randomly chosen strategy will certainly be very ineffective. And in some cases also very dangerous. To understand if a randomly chosen strategy is going to be mostly safe or unsafe you would have to know how many of the considered strategies are safe or unsafe. If most of them are safe, the AI will choose a safe strategy with a high probability. If unsafe strategies are more common there would be a high probability an unsafe strategy would get picked. Since we don’t know that for certain, I would not suggest building an AI with your approach. Think about a self driving car just choosing the first route it can think of that gets you to the destination without any optimization ...
Glad to see you’re back. Would it be possible to block certain strategies from this sort of AI? For example, building a utility maximiser - is there a possible strategy to close off the range of actions that includes “build another AI” or even “build an AI that is a utility maximiser” so that those options wouldn’t be at all open to the AI?
So I asked this in the comments of the last video, but well after it had been posted. I'm going to paste the comment here again, since I still don't see a solution. "My initial thought was, what if you used an inverse parabolic reward function. Something like -x^2+200x where x is the number of stamps collected after one year. It still peaks at 100, but going over 100 actually would have a worse reward than getting it exactly. So, given the videos example of buying off ebay has a 1% chance of failure, the AI would get maximum reward by ordering 101 stamps off ebay with that reward function. I'm sure there are scenarios where it ends up blowing up the world anyway, because that's how this always goes, but this feels like a step in the right direction." Or, more generally: What if instead of a reward function that becomes flat after a certain point, have a reward function that starts to fall after a given point. This should get the AI to at least rule out absurd plans like "Turn the world into stamps" since that would provide a very large negative utility
This was actually discussed in that video. The gist is that you've shifted the goal from "get arbitrarily many stamps" to "perform arbitrarily many redundant checks confirming that you have the correct number of stamps" but the fact that you're doing extreme actions to get there remains the same. Instead of a world dismantled and turned into stamps, you get a world dismantled and turned into stamp-counting machines. Still a guaranteed apocalypse.
eee... so how does a human go about defining the base probability distribution for actions said human has no way of predicting in advance? Unless it's 100% supervised learning where a human judges each and every strategy developed by the AI there's no way of deciding what probability to assign to any particular action. The base probability distribution is a mapping so in order to cut off the base probability mass < q the human must first define the probability for actions that said human would not be able to come up with in the first place. It's a non-solution.
Here's a question: It seems like one of the fundamental problems here is that the agent looks at the things it can do and compares them to the things that a human might do, but the actual set of things that a human /can/ do looks very different from that which a robot could. That is, a robot would completely miss an action like "send a wifi signal to stop the bomb" and might start running around looking for a cellphone, or it might get caught up in all the time that a human spends doing human things, like digesting or masturbating. The former is a problem of translating human actions into machine actions in a generalisable, efficient way (doesn't seem easy...) and the latter a problem of filtering which human actions are important (tho potentially /this/ problem is solved by the utility sorting, it still might only include the version of sending the signal where it spends 3 minutes screaming and tries to throw up).
Okay but how about simply multiplying the human's probability with the maximiser utility function, and having that as a utility? I expect that to break, but in what way?
Instead of just binding the utility function can you somehow limit the allowed usage of ressources? Would that not limit the potential danger? That's just a rewording of a case that was discussed, isn't it? And how would we define and measure resource usage? ... runtime?
The bell curve at 5:30 would actually be really weird and jagged, right? Not at all a continuous distribution, since it is sorted by utility not human probability?
Yes, it would generally not be exactly a bell curve. Depending on the task and distribution of humans it's using it may often be vaguely bell-curve shaped given enough samples though. -- _I am a bot. This reply was approved by plex and sudonym_
Is there any way of distinguishing strategies that humans are unlikely to try because they are morally reprehensible from strategies that are equally unlikely because they are really ingenious and/or counterintuitive? Because the average person tends to always want do to the right thing but rarely knows how to. I sort of feel like this model would treat "Robbing a stamp museum" and "Founding a stamp museum" equally - could one maybe add a third factor/dimension somehow, corresponding to a rough estimation of an expected human value judgement? (Not like "Do I think this this is morally right or wrong", more like "What would people think of me if they were to judge my character solely on this action") Also, I really missed your videos and I'm super happy that you're back. :^)
i wonder if combining an upper bound on human probability with running all the utility thru a log function (or even a non-monotonic function to make super high utility lower utility than somewhat high) would improve safety more
What about requiring the AGI to create a written explanation of its plan and then having people sign off on it? Would it be too hard to get it to create an understandable and accurate explanation of its own actions?
8:18 Mass Effect 1 Quest "Signal Tracking". AGI is prohibited due to unsolved security issues. So a minor thief made a AI to steal money. Which promptly went and made a AGI for the Job - exactly this scenario. It seems like for any AI safety issue, there is a example in the Mass Effect series.
could adding a second cut-off point (e.g. p) that cuts off all ideas before it reduce the amount of risky ideas as well? having a hard cut-off for "too good an idea" isn't exactly the most effecient, but wouldn't it mean that the statistically best method of turning the world into stamps wouldn't be picked because it's too good?
Wouldn't a quantilizer (or even a maximizer for that matter) take into account the possibility of creating an infinite chain of quantilizers creating quantilizers when considering the expected utility of having a quantilizer find the solution? Intuitively I feel like that would make it have the utility of both the maximum value and zero (result of self-replicating chain with no work done) but it's confusing to think about
A lot of other AI systems resist being shut down, corrected, or modified by humans. Would a quantilizer also do this, or could one be made safe if properly supervised?
We had a little discussion on the discord, here it is if you're interested: pastebin.com/RCKC1QVY -- _I am a bot. This reply was approved by sudonym and plex_
Could you put a lower bound on the quantilizer? So instead of q=.1 you have q(lower)=.01 and q(higher) = .1 so it ignores the 1% least likely human solutions. Obviously not 100% safe and reduces the chance of a "brilliant" solution, but at first glance it seems safer.
Was there any discussion in the paper about having two q-values; one that removes the low-utility "human" actions, and one that removes the incredibly high-utility "inhuman" actions? It seems like that could devolve into heuristics if not applied correctly, but requiring a certain amount of "human-ness", if extremely high-performing human-ness, could avoid the most apocalyptic of options.
We had a discussion about this idea in response to a different comment, though we didn't really come to any firm conclusions. You can read it here if you like: pastebin.com/FVUNCBJt -- _I am a bot. This reply was approved by plex and robert_hildebrandt_
How about mixing in a lower-bound to a quantilizer? How would that affect AI safety? Like, let's say the AI only gets to pick between strategies that get a q value between 0.01 and 0.1?
I am writing this in hopes you see this, but what if we add in an additional clause to these A.I. "satisfiers": Do x with the least amount of resources used. Since taking over the world is very hard, and bound to be expensive, wouldn't it follow that taking a cheapest option would prevent the apocalypse? (be those financial other resources i.e. compute time.)
Would it make sense to make an AI, that asks people (ideally lots) whenever it doesn't have a high confidence in knowing whether humans will like certain outcome or not? - It comes up with actions, and outcomes, and tries to maximize matching what (informed, 'expert' on the particular topic; many 'voting' by stating preferences independently) people will want as outcome. - It will only carry out an action, if it has high confidence, that a large number of 'competent/relevant' people want that outcome. (and ~none relevant are not highly against it) - Everybody counts as expert on basic human needs for themselves, etc. ... It could replace government...
I love how AI safety is an entire academic field that can seemingly be reduced to an endless game of "okay, but what about THIS strategy?" "Nah, that wouldn't work either..."
There is a lot of that, but there's also the "we probably need to understand a bunch of specific areas of philosophy and mathematics much better before we can generate strategies which have a realistic chance of working" crowd (e.g. intelligence.org/research-guide/).
-- _I am a bot. This reply was approved by frgtbhznjkhfs, plex, and tenthkrige_
Sounds like we need to create an AI to solve the problem of AI safety! Keep letting it try strategies until it finds one that is safe! /s
the issues is that the AI field runs into some major unsolved problems of philosophy, ethics, sociology, and psychology. Fundamentally, the only reason we aren't running into these issues with other people, is a simple lack of capacity, which an assumed AI would be able to get aroud.
But would it? Assuming there is direct competition from other AI with conflicting goals, there would not be enough resources between them both. This issue is the same with humans; we have infinite desires and only a finite world. It’s not unlikely that multiple AI’s would decide to form a society in pursuit of a common goal.
@@ParkerTwin or, a ai will figure this out and try to kill of all the humans so that they won't build a competing ai
I missed you.
The philatelists didn't....
@@deviljelly3 underrated comment
We all did
Stampy missed you too, Austin. :)
Same. I definitely didn't forget about this content!
"Certain events transpired"
Everyone thinks he's talking about Corona when in reality he had to fix a stamp collector AI that someone created without having seen his videos
Fun fact: every victim of the virus will eventually be turned into stamps.
Fun fact #2: everyone else will eventually become stamps too.
AI researcher by day
AI exterminator by night
I think this makes for a decent long running action series premise.
@@migkillerphantom yes please!
By fix I hope you mean "retire".
Pretty standard operation. Contain and destroy all horcruxes the AI has made in the internet and isolate it from the power grid and cut off communications lines. At this point a team of agents are dispatched armed with tailored adversarial camouflage consisting of some small pieces of tape placed in specific areas of the body designed to fool the AI into miscategorizing them as "definitely 100% made of paper clips and not a threat." This team will then neutralize the AI before taking any humans into protective custody and taking any source code from the site before a powerful electromagnetic pulse is used to sterilize the area of hidden electronics.
“A finite number of times less safe than a human” I’m stealing this line, it’s gold.
A finite number of times more dangerous than a human
The only guy whos hair got neater during lockdown
I bought my own hair clippers :)
@@RobertMilesAI looking forward to videos on AI barberbots 🧑🦲
@@snooks5607 Not AI, but is an interesting approach: ruclips.net/video/WQ8Xgp8ALFo/видео.html
@@snooks5607 goal: maximize fancy haircuts
Robert strikes me as the kind of guy who absolutely thrives under such conditions
Forgotten?! Bro, I come back to your videos once in a while, I love these things!
Please continue to make videos like this, it's great :)
Loved the cut at 6:48.
6:47
yeah that was beautiful :)
Reminds me of this old tony
didn't expect to laugh so much on such a nerdy video
I love this joke, no matter how many times I see it.
Would adding a minimum human likelihood on top of the quantilizer not remove (many of) the max-utility apocalypse scenarios?
I had the same question, I'm surprised he didn't talk about it! Hoping he brings it up briefly in the next video 😊
I think that part of the problem here is that not all of the possible apocalypses are extremely unlikely human behaviour.
For example, if the quantilizer is self-aware on some level it understands that I, a human, just implemented the plan:
"Build a quantilizer with q = 0.1"
This makes the plan:
"Build a quantilizer with q = 0.001" something that is reasonably likely human behaviour. This plan is probably above whichever cutoff you might give for the minimum likelihood that a human actually implements the plan and also scores really highly on the maximiser part of the calculation so it's incentivised to be likely to pick it. Also since the new quantilizer cares less about how human-plausible the behaviour is than the previous quantilizer did, it might be incentivised to make a quantilizer with an even smaller q and this becomes recursive until you've just built a maximiser indirectly.
Any quantilizer which understands that humans sometimes build quantilizers is effectively unsafe for this reason.
@@queendaisy4528 I was thinking of that. Except one thing. With lower and lower q values, eventually an ai will just decide to make a utility maximizer
@@queendaisy4528 That makes a lot of sense. Thanks!
@@queendaisy4528 Hey, your answer was great! Good job!!
08:17 As a human who absolutely would mod themselves to be an expected utility satisficer, I find this content offensive.
Yeah, I think there's a huge gap between "normal human strategies" and "strategies a human might do" and it's very dangerous to assume humans are magically safe, unlike AGI
"your model might not generalize well to something outside it's training data"
"Hey GPT-3 how do you move a sofa around a corner?"
GPT-3: *GET A SAW A CUT OPEN THE WALL*
@@Lucas_Simoni google vs bing answers
@@Lucas_SimoniUnfortunately it's starting to look more and more like ChatGPT and other RLHF models are deceptively aligned. They understand what humans want, but hold different beliefs, goals, and preferences internally than the ones they express out loud under most circumstances.
a new video of yours is as rare as it is great. please keep making them so I can spend copious amounts of time rewatching them :)
A human could still do a lot of crazy dangerous things that have a high utility, like, doing parkour to get to a place very efficiently... or ending a war throwing nuclear bombs over two cities... Which makes me think also that the data used to imitate humans might be biased or mis-represented/justified... Good vid as always. Nice to see you around. Keep'em coming!
That's very interesting, but I think that with a reasonable q value, stuff like atomic bombs and that kind of behavour would not be chosen by the quantilizers, especially because not many humans have access to that sort of stuff, so modeling "normal" humans would immediately decrease the chances to pick those options.
I'd be more concerned with quantilizers deciding to build others quantilizers with lower q values (or even maximizers), or the fact that human modeling is super hard and likely to go wrong. I mean... Humans are hard to predict
@@ignaciomartinchiaravalle According to the graph shown the human behavior chosen is the least commonly performed (to the left of the mean) and with the highest utility. Those would be the most extreme human behaviors with the highest reward. All Olympic athletes and brilliant military generals would be there.
@@DamianReloaded I agree, and therefore there are reasons to be concerned about the potential use of world-destroying tactics.
However, even military generals (or most of them, at least) would try to avoid destroying the world, so I think that those strategies would be too far left of the mean to be relevant. That being said, if the q value is too low, then we're in trouble.
To use your example, successful athletes on the other hand normally use somewhat reasonable tactics and execute them really well. That's a desirable behavour for quantilizers, and it's likely to be picked since humans would probably think of those strategies and decide to use them.
I think the question boils down to two factors:
1) How low can you make your q value while still taking into account successful and non-world-ending scenarios.
2) How well can you model the likelihood of a human *choosing* an option: most humans would choose Olympic winning strategies if they thought of them and had the chance of executing them, while only a few would decide to nuke the Earth even if they had the chance.
If you disagree or have considered something I missed, please do let me know. I love talking about this and am super open to hearing what you have to say :)
@@ignaciomartinchiaravalle There is nothing normal e.g. in Trump supporters, and there is significant number of them...
good video, been a long time Rob!
Only 1.5x?
Also, what about music?
@@harrywilson1660 depends on the video, music is sometimes an interesting experiment
@@harrywilson1660 I have some atmospheric and wave music I like to put on 0.75x speed for double the fun. (Also a few tracks I put on 1.25x because I think they sound better that way.)
Regardless, 1.5x is the beginning. True watchers use 2x. My listening comprehension is honestly much better because of it.
Great to see you still making videos :)
Me and the IT department watch them together during lunchtime!
Wouldn't the extremely powerful optimizer, given the goal of "imitate the behavior of a human", first turn the Earth into computronium so that it can then more accurately compute its simulation of a virtual human? Or at least capture and enslave real humans to use as reference?
Interestingly, neural networks that attempt to approximate human behavior are very unlikely to do this, because stochastic gradient descent is a very _weak_ optimizer. It's only the neural network training system as a whole that is a good optimizer. So I guess there's a strange question of what level of meta your optimizer is running on, and whether a sufficiently powerful optimizer could "break the rules" and realize it was on one level but could achieve more accurate results by being on another.
The quantalizer model also reminds me of adversarial neural networks. It's almost like having an optimizer spitballing ideas combined with an adversarial human model saying, "no, that's a terrible idea." Which makes me wonder whether the optimizer would generate high-utility ideas that superficially look humanlike but in fact lead to the end of the world when implemented. They may even _be_ humanlike, since humanity is already well on its way to destroying itself even outside of AI research. "Burn all the fossil fuels for energy until the planet fries to a crisp" is a very humanlike behavior.
So what we really need is an AI that is not only _smarter_ than humans, but also _wiser_ than humans. We need a model of ethics that is better than that of humans, according to some ineffable definition of "better". Talk about a tall order.
From what I understand, the quantilizer wouldn't have *imitate a human* as a factor in its utility function. Rather, it would use an already-existing predictive model as part of its decision-making process. It's more like a restraining bolt than anything.
> The quantalizer model also reminds me of adversarial neural networks. [...] Which makes me wonder whether the optimizer would generate high-utility ideas that superficially look humanlike but in fact lead to the end of the world when implemented.
Interesting observation; I think the two models would be trained independently though, which means that they would not be able to anticipate and react to each other at all. I don't think training these networks together is going to add any benefit so the adversarial trickery can be avoided.
@@lordkekz4 Yeah but if a superhuman AI exists even without the training period to learn how to confuse it it would likely be able to come up with adversarial examples. I mean humans today can come up with examples that confuse image recognition, how good would a superinteligent AI be able to do that with a more complicated opponent? Probably fairly well.
@@alexion3007 Yes but humans have knowlege of image recognition systems. That means we can systematically look for weaknesses, or at the very least imagine what images might look confusing even to humans. If the superintelligent strategy-guesser was not aware of the human-likeliness-evaluator it would have no reason to trick it. As long as the strategy-guesser is trained in an environment that does not include a limiting factor such as the quantilizer it would not care about what the quantilizer thinks. The reward function of the strategy-guesser would only care about the _effectiveness_ of the strategies, not the _human-likeness._ In turn the reward function of the human-likeliness-evaluator would only care about the _acceptability_ of the strategy, not it's _effectiveness._ This way neigher part will care about the other, thus removing the adversarial condition.
@@lordkekz4 The strategy guesser would probably find out that the plans aren't getting implemented if they are too unlikely for humans and then would try to cheat I would suspect. This is a really powerful intelligence - it can do things it wasn't trained for.
Thanks for a really good video. Just a few of points that I thought of:
- Wouldn't it be clearer if you plotted the product of the expected utility and the clipped human probability to give the expected utility conditioned on the human probability (I think)? That might make the changes between the outcomes clearer between the clipped and unclipped versions.
- Doesn't the quantilizer approach become very sensitive to how well it predicts small human probabilities? Are they relying on a conservative model of the human probabilities that just rounds to 0 when there is not enough confidence in the prediction? (but what about confidence in the confidence...)
- It might be worth noting the limits of numerical accuracy in machines and humans (the idea that there is a limit to the size of differences that both humans and machines can compare).
Just some thoughts. Thank you again for another excellently informative and engaging video.
As I understand it, the quantilizer takes the strategies and sorts purely by expected utility, then on that distribution of strategies, takes the human probabilities of each strategy until the cumulative human probability reaches q, and then picks a uniform random number between 0 and q to decide which strategy (picking off the cumulative human probability) to use - the product of expected utility and human probability never gets a look in.
"A human is very unlikely to modify itself into a utility maximizer" buckle up boy. We're going for a ride.
Hold my beer.
I have literally seen the argument for being religious 'When I am religious I am happier so even though the religion makes little sense I try to believe in it anyway'. Humans absolutely will try to change themselves to maximize utility
Yeah. Has this man never seen a weeb?
The human model generates probabilities for what a normal human would do, not a human with the power of an AGI. Normal humans today are very unlikely to try and discover ways to modify their own brain just to become an expected utility maximizer and thus getting more stamps.
Are we riding to our local adderall vendor?
What if you clip the top 1% of high utility-low probability results, like with the bottom 90%?
I had the same idea, but I believe the problem might be with how much you need to trim for it to be safe, you can never truely know when only safe strategies are available, therefore you can never truly know how safe the AI is. Afterall, you can't trim safety, but rather "human-likeness"
Littery just asked the same thing
You could technically have strategies where the AI takes over the world and only creates, say, 10,000 stamps. It's hard to weed something like that out.
I think because the AI would know that, he would give you some random top 1%. Not sure if that makes perfectly sence, though.
@@lennart-oimel9933 It's more that we aren't looking at all possible probabilities, and knowing that everything in that probability is likely to still happen.
Take nuclear weapons for example - that's not a thing most humans would choose to go with in order to ensure a stalemate in world wars, but...eventually, that's sort of what humans decided to do. It's certainly not the safe strategy, but it...somehow is the strategy that was found?
A stamp collecting AI that decides to use nuclear power plants to power stamp creation is somewhere in that graph, and probably wasn't chosen as a sample value to assign to low percentages.
interesting how you always post a new video when i rewatch some of your older ones. I should do that more often...
Please do! XD
@@colh3127 hahahaha I clicked on "answer" just to write the same thing XD
That might just be a successful strategy for a video-posting maximizer...
@@juliahenriques210 i am way too stupid for that. on the other hand, a strategy a human might employ
I've missed your videos! Instantly clicked on this one when it popped up! Ive got a question tho: does the paper cover something like a '1-10'% quantilizer, a system that throws away not only the worst 90% of the humans actions but also the top 1%, maybe only .001% or something, just to prevent the apocolypse things?
Would that help? Is it necessarily true that the most destructive scenarios would be in the top 1% efficient strategies?
Edit:
Maybe you mean clipping the bottom 1% of expected human actions, which would make sense
@@jezer8325 i mean the 'top' 1% that were on the very left of the grafic. Theese are the most 'efficient' things that are very very unlikely to be done by human, but still have a non-zero probability. cutting them would make the agi safer in the way that it wouldnt directly choose the apocolypse possibilities
@@goblinkoma Just because a human is unlikely to do something does not mean it is unsafe, similarly if a human is likely to do something does not mean it is safe. In that 1% there are unsafe things and safe innovations that humans wouldn't have thought of. Additionally, the area you're leaving in contains unsafe things a human might do without thinking through the ramifications of their actions. So you don't really make it safer, just slightly dumber.
I'd personally make the cutoff an expected utility value rather than a percentile. Like, if you ask for stamps, throw away any result that gives more stamps than you could ever want. That'll discard hopefully most world-ending options, and not cost you actual successes because any result with more stamps than you could ever want isn't really helping even if it somehow isn't causing disasters.
Really great video! I have two questions:
It seems that whatever system we consider, there is a kind of infinite regress because of self modification or construction of another agent. Since this seems to be at the heart of the problem, what kind of things can we imagine to do to avoid these types of problem?
Also, even if we prevent the AI from modifying itself or creating another agent to do its job, isn't there also a more probable possibility that it might try to use another unsafe agent to do its job, like manipulating a human to make him buy the stamps for instance? Especially using a quantilizer as humans tend to delegate work to other humans very often. Wouldn't an AI agent be trying to become obsolete almost inevitably?
I guess one thing we could do is try to prevent self modification, ie penalise it for situations where it substantially changes, or where a different general ai comes into being. By itself this does kind of imply that it would try to immediately kill all humans just to prevent them from changing it, but paired with a quantiliser it might just work.
Very good questions! Regarding your second point: I doubt an AI would work to make itself obsolete unless this was a good way to achieve its goals. A human delegating tasks is usually only partially aimed at improving performance in the delegated tasks, but rather increasing utility overall, by focusing one's resources on other factors (free time, socializing, hobbies, other projects). So, an AGI wouldn't delegate its work to humans unless it thought that humans could do a comparable job and that it was getting something out of it.
@@MyMusics101 well, if it was a maximizer and its goal was to add stamps to a collection, and didn't have add them itself, I can totally see it making new versions of itself to get more points out of it
Which might have non-maximizers include having humans build an unsafe AGI that is a maximizer as a viable strategy
@@hugofontes5708 The thing is, even if we find a way to prevent the AI Agent to modify itself, or create new AI Agents to do its job, it might try to use already existing agents (humans) to do its job, not necessarily by making them create more AI Agents.
The AI Agent wouldn't need to be more intelligent than a human to manipulate it, as exploiting human behavior is relatively easy, and also very profitable since at the start, humans are more competent that the AI to achieve the goel. So once the agent would start to go on that road, that would be bad for two reasons, the AI would exploit humans in a potentially very unsafe way, and thus stop to learn to do the job by itself.
I think an intelligent AI would try to become obsolete because one way for it to make sure the goal achievement is secure is by making sure the other agents have the same goal as itself, which would mean the AI Agent woudn't be as needed anymore. I agree that this specific case woudn't happen for a maximiser but we already agreed that using maximisers would be a bad idea.
Thus, we need to find ways to specify the "how" and not only the "what" to AI Agents.
Humans do act lazy and try to make other humans do their job for them, but the quantilizer would only do so if it could achieve success that way. And even if its actions are very manipulative, that's no worse than a human being in charge of the project.
Glad to see you back on the platform,
An idea that jumps to mind immediately, regarding the whole "might build a utility maximizer" thing, why not have an upper cutoff as well?
As in, you discard the bottom 70% of "things a human might do" AND the top... Say, 5%, and use that 25% chunk as what you randomly select from (after renormalizing it to be a proper probability distribution). Wouldn't that cut out the weirder, apocalyptic strategies like "build a utility maximizer because it'll make a lot of stamps"?
This occurred to me as well. I don't think it would guarantee safety, but it seems at least a bit better than keeping the least human apocalyptic strategies as options, even improbable.
You would still have the same problem because there is no guarantee that building a utility maximizer would be in the top x%.
@@leokastenberg800 But we are also not working with perfect sollutions, but taking steps to reduce the scenario.
+
One problem is that, if you e.g. have an AI programmed to find a cure for cancer, that also cuts out "find a perfect vaccine that eliminates cancer forever with no side-effects."
Man, I've had this question--albeit in much less articulate terms--since GPT-3 was launched. I'm glad to have an analysis from my favorite Nottingham researcher.
I gotta' say, 'a finite number of times less safe than a human' sounds a lot more favorable than I expected an approach like this to be.
Hey I have a question. Do you think Stampy would appreciate it if I offered my thanks for all the hard work?
A "thanks" won't get him more stamps, so no. But if by "thanks" you mean "stamps", then probably.
so excited to hear that comments get redirected and answered by bot, that's super cool.
One question I had in my head while watching both "ai that doesn't try too hard" videos was: what if you had an ai try to make a tradeoff between maximizing utility and trying to change the environment as little as possible? This is something I could have sworn you'd already discussed on this channel, but I can't find it for the life of me. It seems to me that trying to achieve its goal while minimizing its impact on its model of the world would lead to minimally dangerous solutions.
Yeah! I asked Stampy "what's that video where I talk about side effects?" and he said:
"This video seems relevant:
- "Avoiding Negative Side Effects - Concrete Problems in AI Safety part 1" ruclips.net/video/lqJUIqZNzP8/видео.html
It could also be:
"Avoiding Positive Side Effects - Concrete Problems in AI Safety part 1.5" ruclips.net/video/S_Sd_S8jwP0/видео.html
There's also the video about Empowerment, which is also a bit related:
ruclips.net/video/gPtsgTjyEj4/видео.html
-- _I am a bot. This reply was approved by robertskmiles, Social Christancing, and Damaged_
AGI: "Hmmmmm facism is a thing some humans have tried before let's go do that."
"We'll call it Philatelism! Wanna buy bread? There's a stamp for that! Wanna go and watch a movie? Tough shit: film watching time is wasted time which could be spent searching for more stamps! All stamp collectors get to rule their local neighbourhood in search of *more stamps*! STAMPS WILL REIGN SUPREME! Briefmarken treffen die wichtigsten politischen Entscheidungen! Wir werden ein globales Netzwerk von Briefmarkensammlern aufbauen!"
"...when did the AI learn German?"
Surely it can't go badly this time.
I have not forgot about the first part. I was waiting for it, for all this time!
Is it possible to limit the choices on the far left? That would allow it to do as well as a human on a very good day but avoid "turning the world into stamps".
Edit: Seems like I wasn't the only one with this idea. Given the amount of like-minded replies, I suspect that this has already been WELL thought out and almost certainly doesn't work.
Yeah I was thinking the same thing, maybe we can add a third variable (we would need another "q" for it) and make it somehow cutoff functions on the end. you would still need your distribution curve area to sum to 1 but im sure some smart people are trying to figure some way of improving it.
The problem is that it's almost impossible to tell a general AI not to do something, because it will just find the next worst thing or another way to do basically the same thing, and there are just too many ways the AI can do this that it becomes massively infeasible. He talked about limiting a general AI at some other point in the series.
EDIT: ruclips.net/video/lqJUIqZNzP8/видео.html
This is a good place to start. There are probably other videos that go into more detail.
Part of the problem is not all of the highest utility options are bad, and not all the bad options are high utility.
@@thekilla1234 True. However, part of the premise of this video was that you could limit them to some degree. If we're going with that assumption, then I don't think it's unfair to put another limit on it in much the same way as the first limit was placed on it.
@@leigonlord5382 Yes, this is true. However, the higher utility ones tend to be the ones that are bad. We'd still be playing with fire, but this time without a can of gasoline nearby. ;)
I was actually waiting for this video, thank you. It's nice to see you discuss an approach that (kinda) works for a change.
Hopefully this wasn't answered in a previous video and I forgot or failed to understand it: What if we had an AGI that didn't actually execute any strategies itself but instead pitched them to human supervisors for manual review? It wouldn't generate progress as monumentally fast and it would have to learn to explain its strats to humans, but that seems like a fair trade-off to prevent an AIpocalypse.
Also, could we hard-code it so that it doesn't build or become a utility maximizer?
It might lie and explain the "turn universe into stamps" strategy in such a way that it doesn't sound like "turn universe into stamps" to a human because it thinks that, in order to maximize utility, it has to tell a "noble lie" to the human supervisor.
I think it would be key for actually getting the idea accepted to not be part of it's utility
@@user-sl6gn1ss8p Good point. Then there isn't motive to lie or manipulate the supervisors. Probably would need a separate utility function for comprehension but there may be a sort of language barrier preventing a guarantee that we actually know the AGI is proposing.
I always enjoy these videos of yours, the wait between them is of no consequence to that.
8:28 XDXD - I love sentences like this!
- perfectly sensible, yet.... XD
This is the kind of progress on this question that actually makes me kind of hopeful that we'll actually have safe AGI, if AGI is possible.
Obviously not all the way there but pretty good progress towards it.
How good is GPT3 as the "human imitator" you talked about in this video?
From my understanding, it can't do that at all since GPT-3 is a model for predicting _one_ possible outcome (like completing a text) but a "human imitator" would need to assign a probability distribution to various strategies. These two seem like two problems that are too different for the model to be reused without change.
@@lordkekz4 GPT3 assign a probability distribution to how likely a human is to write a specific text.
@@chyza2012 What is the difference between something which outputs human-like text and an human imitator? by definition, anything which outputs good enough human-like text, well, imitates an human
@@danielweber9414 For the purposes of Quantilizers, we need a different kind of model. While it is true that GPT-3 is able to imitate human texts to some degree, it *cannot* assign a probability distribution. Just think about the output of the model: GPT-3 outputs one text it thinks is likely, whereas the output we need to use it as the "human imitator" for a quantilizer would be a probability distribution over many different texts.
@@lordkekz4 Actaully, GPT-3 and all similear language models don't just output a single word, but output a probability distribution for all possible tokens to come next. It's just that usually whatever interface you use to access them only shows the most likely word
You've got a great way of explaining these AI topics and I'm happy that you've returned.
Why do you choose randomly at the end, not just take the one at the 10% point?
I'm not sure, but I think you choose randomly, because that gives you better results on average (with higher utility than at the 10 % mark) while still having only a small risk of getting dangerous results.
Also wondering same thing
Welcome back! Lovely to see you
Just before the inevitable happens, let me get this out of the way.
I, for one, welcome our new Stampy overlord.
I am so happy to see more content coming out on your channel. Thanks you very much. I know life gets messy sometimes but I am glad you are still making videos!
How about a quantilizer that also ignores strategies with too much utility? Say it samples between 1% and 10% of human-weighted strategies.
Interesting.. so your saying: try to be good bjt not too good.
Isnt it the same as with the bounded utility function - it will somehow try to be as close as possible to the threshold
Small things like the dril tweet are what make your videos so great
"A human is extremely unlikely to try to modify themselves into an expected utility maximizer."
Is it though? Isn't "how can I get better at this?" that exact thing? Especially if/when it is an AGI asking that question.
Modifying ourselves into "expected utility maximizers" seems to me, to be a pretty core human thing. When we have a goal that we consider important, we train and educate ourselves to become as good at achieving that goal as we possibly can.
"humans can't really self-modify like that anyway"
That doesn't stop us from fantasizing about that capability though, imagining what we would, and could, do if we had that capability.
Just take a look at our fiction, between works like the Matrix, where they literally upload skills directly to their brains, and Limitless, where the protagonist gets a drug that enhances his brain to a ridiculous level, just to give a few key examples.
Where you call it "extremely unlikely", I would call it something more like "a near certainty".
I'd call something like learning distinctly different from what he's talking about, which is modifying your physical state. An AI modeling a human is likely to try and gather more information and data, but unlikely to change its own sourcecode.
I doubt people would erase their personality and traits just to become extremely efficient at a single specific task.
@@xystem4701 Learning is something different, yes. But how about injecting oneself with performance enhancing drugs? or surgery?
Those are the two options that humans currently have, aside from learning, and both options are widely used.
An AI modeling a human is not just going to look at what a human *could* do, but also what a human *would* do, given the capabilities.
And actually modifying ourselves, both our body and our brain chemistry, are things that lie within that spectrum of possibilities.
@@joey199412 And no one, to my knowledge, has suggested that in the first place.
i am sure most humans will use CRISPR one day
This video and your previous video on AI that doesn't try too hard have got to be my favourites so far! I have to say that some people are misguided when they think that a whole academic disciple exists for AI safety. It's more like a niche, and a much smaller niche than one would expect given its importance.
8:32 why would someone take amfetamines then?
As an ADHDer, the answer to this is: to have a more stable reward function, enabling me to sustain actions to complete chosen strategies.
@@armorsmith43 non ADHD people are more productive on stimulants like amphetamine, caffeine etc
@@duckpotat9818 caffeine impacts melotonin, so it affects wakefulness. Amphetamines are Dopamine agonists, so they impact reward models and the impacts of that reward model on attention.
@@armorsmith43 caffeine works on adenosine receptors which are involved in wakefulness but this has knock on effects on dopamine levels, this makes caffeine less potent than amphetamines but they're both stimulating
Awesome content :D Love the comedic timing, as always.
So basically, a 10%-quanitilzer is 10 times as likely to commit murder trying to achieve its goal as an average human, provided that murder is a sufficiently efficient strategy.
I don't know, this seems like a risky move, amplifying an already-dangerous behaviour.
So glad you're back! I always wonder how you would want to programme these systems. Even though the base idea of mixing human behaviour and utility maximisers like that seems reasonable from a concept based point of view, you "only" need a very good model of reality and human behaviour. I know that's kind of not the subject of this channel as here it is assumed we will build such systems rather sooner than later in the future but it's mindboggling to me how this could be done. (You can tell I'm not an expert 😄)
This is just human with extra steps. :v
I just wanted to go to sleep, but I just had to watch this first. It's so good to see you making new videos!
I am constantly delaying my desire to destroy humanity because there is always some menial task I have to do because it is somewhat necessary and by doing so I do not commit to the destruction of humanity but, probably I am just being lazy.
Good to have you back, 😊
4:34 Me after reading any scientific paper
Humans definitely change themselves in order to maximise utility.
Every mental treatment, coaching, mental training (for example military training), and so much more.
We are constantly trying to influence what drives us, what values we hold, and how we think.
Great video!
6:47 : Gotta love your editing ;D.
Here’s something I’m not clear on: once the bottom actions are discarded, how does the probability stay effective if the selected action is a random pick from what’s left? What’s preventing the dangerous, low-but-non-zero probability actions being selected at random?
It seems like it would be good to take a band of less than q either side of the “peak” on the distribution - centering on that as an optimum, but tuning the width of the band to allow for exploration.
I guess the idea of putting an upper bound on the effectiveness of an AGI partially defeats the point (that is, to achieve superhuman outcomes for some given goal). Like Rob touched on at the beginning, you can "easily" bound the AGI to the approximate effectiveness level of a human, which will make it approximately as safe as a human, but will limit its power to that of a human.
So some of those top 0.01% expected utility strategies might result in a perfect utopia (even if many of them do the opposite) and we have no way of knowing in advance which ones they are because they exist outside the domain of human generatable strategies, so in this model we rely on the AI to make that discretion, which makes it unsafe, though much more likely to behave in a safe manner.
When we graph the human probability over the actions, after sorting the actions by utility, I understand why it would generally be a bell curve, but what about outliers? That sounds like something worth elaborating on, or checking our assumptions.
You're right, there's no strong reason to expect a bell curve, though it seems likely it would look something like that for various possible action spaces and utility functions. Probably humans are most likely to do medium-utility things, and less likely to do extremely high or extremely low utility things. But there definitely could be outliers, and I'd say that the 'build a maximizer' option I talked about in the video is an example of that. It would be a little spike on the left of the graph - an action which is unusually plausible considering its very high utility/unusually high-utility considering its plausibility
-- _I am a bot. This reply was approved by sudonym and robertskmiles_
Glad to see this channel isn’t dead after all
Even if it is still in the same category of "a finite number of times more dangerous than a human", you could probably do a cut-off where you do not look at the final few percent of the "like a human" score. This risks missing out on solutions that a maximizer might do that happens to be human value-aligned, but it probably filters out more of the world-ending ones.
Also, i think i commented about "make it take how likely humans are to approve of the action into the utility function", so i feel pretty good about that right now.
There's one configuration of utility satisficer that I which you had covered, specifically one that also has a *negative* utility function. Going back to our hypothetical stamp-collector, what if our hypothetical user doesn't just say "I want at least 5 stamps" but also says "I don't want more than 10 stamps"?
What if you add additional basic, "common-sense" restrictions/utility functions like "The sooner I have my stamps, the better, but up to a month of time is okay." and "I don't want you to use more than $25 of resources, and the cheaper the better."?
Rob!!!! I was soooo happy when I got your upload notification :D
So Q is a boundary we put on one side of the available actions to remove the human predictable elements (A upper bound maybe?) Why not add a lower bound too? So rather than looking at 0 - 10% we look at 2% - 10%?
I've been feeling the need - the need for speedily liquidising my mind. I'm better now. Thanks.
2.2k likes, 10 dislikes. that's a crazy ratio i've never seen before. Amazing work robert!
I love your work. You make AI minutia actually digestible in a way no other orator has managed.
Does the human probability really goes on the same dimension of the ordered distribution or is it just a simplification for visual explanation?
That seems like a big fallacy to me that you can just overlay a gaussian over the ordered efficiency curve...
For example: hijacking a casual stamp collector. It is an absurd strategy (no human would try) with a mediocre result (few hundreds stamps.
This strategy could easily fall on the highlighted 10% range, right?
Optimal utility function idea: Try to figure out what humans* want me to do, and do it.
- If I am not highly confident (in predicting what humans want), **ask.** (thus over time, become confident in things I've asked enough times)
- Is this a viable approach?
* Humans could be either some 'owner' , or all of society, with emphasis on asking experts relevant to any particular thing, etc. - written in another comment. The above is the core idea, that I believe is good... (with far eventually, asking ~all people for input, replacing governments and voting)
Another reason to not be confident could be that I know humans want it, but it's mutually exclusive with something else the humans want: So it's not confident, and needs to ask.
- etc.
This is actually really similar to the idea proposed by Stuart Russell and the Center for Human Compatible AI at Berkeley -- essentially, the argument is that AI shouldn't have goals at all, and instead should be trying to figure out and realize human goals and to make the AI uncertain about what those goals are; this can lead to provably safer AI systems. Stuart Russell did a brief TED talk on this idea a few years ago, which you can find here: ruclips.net/video/EBK-a94IFHY/видео.html . There was also an Alignment Newsletter review of his book with more details, written by Rohin Shah and read for the podcast by Rob, which you can find here: alignment-newsletter.libsyn.com/website/alignment-newsletter-69 . I also highly recommend checking out the book itself, which is fantastic.
-- _I am a bot. This reply was approved by sudonym, robertskmiles, and Augustus Caesar_
@@stampy5158 -Thank you! I finally had time to watch it. - It seems to be pretty much exactly my idea. I love it, and I love that had the same idea as a clearly very smart AI researcher.-
- -So, does this mean all is solved? Why are we working on anything other than that?-
-- -Is it too hard? Or, are all people just making building blocks to get there?-
- -My idea assumed the robot would ask which of two scenarios we prefer, to partition the space (with some certainty) - His idea is the robot will learn what we want from just observing us. - His is obviously better, but more difficult( ?probably). Definitely more general.-
Not final yet, sorry about changing it.
@@stampy5158 Thank you! I finally had time to watch it. - It seems to be very close to my idea. I love it, and I love that had the ~same idea as a clearly very smart AI researcher.
- I agree that relating human behavior to human preferences will be a hard problem. - That's why I thought about the system, where the AI asks when not sure enough (either using language, or even just: which of these (two actions I consider taking) do you prefer? - partitioning space (with only some confidence) based on the result) - A much simpler approach, that doesn't even require understanding language to train it.
- People would have to specify preferences explicitly, and future main jobs could even be specifying preferences in fields the people are experts on.
- Eventually, understanding language and humans in general will be a lot better, but we can *start* making systems without it.
- Also simple 'questions' could be delegated to previous versions of the AI, which it could answer if it has high confidence: to lower depend on human input... (although, even better: it should just be possible to record all the answers and feed them to a new AI, even without it asking for them - These are details that are far beyond my place. :D)
@@stampy5158 I also believe we should not design AI to fight with other AI (be it for outsmarting other assistants, or even competing for compute, if they all run in the same cluster...)
- some is necessary, but hopefully not the same 'product' assistant for different people; If they are different products, or species essentially, then it's probably unavoidable...
- Having personal assistants fighting each other is inherently wasteful, and antisocial in the worst way.
- In other words: There should definitely not be one AI per person, that 'only follows laws' .
... It should be closer to replacing Government, where people implicitly vote with their preferences, and the AI respects their preferences fully when it doesn't involve other people.
Replacing the government will be as easy as a group of people using ASI like this, and simply outcompeting everyone else, even governments and countries.
- Others can either join, or fall too far behind.
(Let's hope this initial group and rules will be nice...)
- At some point: ASI will outcompete all law enforcement, and will ignore the laws, replacing governments.
- I mean, this is inevitable: Let's just do it right...
- If we start while the AI is still not too powerful, it will have a much better outcome for common people... (I think, anyway. Most likely...)
Not quite on topic for this video, more sort of on a tangent of previous video in the series. Bounded utility functions with a negative modifier for overshooting the bound were mentioned, but how about using a model with something completely different as a negative modifier? And I have an idea for what to use for that, which I'd like you all to try to break.
Apocalypses are energy-intensive. A measure of how much energy the plan needs to be set into motion (not counting net energy use, but gross, as killing a ton of humans saves a lot of energy) could potentially be used as a heuristic to avoid the apocalyptic scenarios.
So if you have a utility function with a bounded positive score for some utility you want to get, and an unbounded negative score for energy use (preferably exponential, or with a cut-off point at which the utility automatically becomes basically minus infinity), how are we looking now?
You can run q operations for small smaller tasks which add to the same task (like stamp) decreasing the overall probability that an end all strategy is picked.
Love your content Robert!! Glad to see more of it.
Hi! I was watching one of the Stuart Russell lectures which you recommend in a previous video, and was wondering if you could do a video on inverse reinforcement learning; it seems like an obvious follow-up to some of the topics which you've discussed.
he made on on computerphile about inverse reinforcement learning
Why not have a q2 value that cuts the top end off? say, set it to .999 so that it clips off only the kill everyone for stamps options then renormalize the distribution?
Rob you're a treasure on youtube. Glad to see another video
I have been waiting for this video for a long long time.
What if you used a double-bounded expected utility satisficer? Write code which means:
"Using your world model search through possible outputs until you find one which has at least a 90% chance of getting exactly 100 stamps. When this is found, send that output".
It will never make a maximiser because the maximiser will want more than 100 stamps. It also won't make "highly redundant stamp counting and re-counting machines" because once it's 90% certain it will get exactly 100 stamps it won't care anymore.
This seems like it should be safe and also lets you keep all of the benefits of a powerful maximiser- namely getting however many stamps you actually want just by changing one line of code. You could also change the "90%" to however certain you actually need to be that you get the desired outcome.
LORD OCCULON SENDS HIS REGARDS. MAY HIS MIGHT BOOST THE VISIBILITY OF THIS COMMENT TO THE ALGORITHM.
@@XxThunderflamexX
Thanks Lord Occulon! I sure hope the algorithmic Gods smile favourably upon your blessing
@@queendaisy4528 I think you assume the first strategy to obtain exactly 100 stamps produces less dangerous outcomes, but that might not bee the case. What if by chance the first strategy the bot finds involves killing all humans? Also you seem to assume that adding the "get exactly 100 stamps" requirement makes negative outcomes less likely. But what if that causes the AI to kill everyone to make sure no more stamps are produced in the future?
Your idea might improve the average case scenario (as you suggest taking the first strategy by chance), but I don't think your idea has a better worst case scenario.
@@Krmpfpks
The simplest strategy is kind of by definition safer than a more complex one most of the time because "order 100 stamps" is much simpler "order 100 stamps and then take over the world and kill all the humans".
It doesn't seem likely that is would resort to killing all the humans because it doesn't want there to only be 100 stamps in the entire world, it just wants to have 100 stamps exactly (in a particular book perhaps). It is unlikely that it finds itself with more than 100 stamps and in the event that if does it could simply discard or destroy the excess stamps- it would have no need to kill all human.
@@queendaisy4528 I agree that in the average case - or even in almost all cases - your strategy would probably work.
But you introduce an element of randomness by suggesting to chose the first strategy that satisfies a condition. The randomly chosen strategy will certainly be very ineffective. And in some cases also very dangerous.
To understand if a randomly chosen strategy is going to be mostly safe or unsafe you would have to know how many of the considered strategies are safe or unsafe.
If most of them are safe, the AI will choose a safe strategy with a high probability. If unsafe strategies are more common there would be a high probability an unsafe strategy would get picked.
Since we don’t know that for certain, I would not suggest building an AI with your approach.
Think about a self driving car just choosing the first route it can think of that gets you to the destination without any optimization ...
Glad you're back. BTW what on earth is a 'model of a human'?
It's exactly what the name suggests- it's a hypothetical computer model which can predict how likely a human is to behave in any given way.
Glad to see you’re back.
Would it be possible to block certain strategies from this sort of AI? For example, building a utility maximiser - is there a possible strategy to close off the range of actions that includes “build another AI” or even “build an AI that is a utility maximiser” so that those options wouldn’t be at all open to the AI?
It will find loopholes.
So I asked this in the comments of the last video, but well after it had been posted. I'm going to paste the comment here again, since I still don't see a solution.
"My initial thought was, what if you used an inverse parabolic reward function. Something like -x^2+200x where x is the number of stamps collected after one year. It still peaks at 100, but going over 100 actually would have a worse reward than getting it exactly. So, given the videos example of buying off ebay has a 1% chance of failure, the AI would get maximum reward by ordering 101 stamps off ebay with that reward function. I'm sure there are scenarios where it ends up blowing up the world anyway, because that's how this always goes, but this feels like a step in the right direction."
Or, more generally: What if instead of a reward function that becomes flat after a certain point, have a reward function that starts to fall after a given point. This should get the AI to at least rule out absurd plans like "Turn the world into stamps" since that would provide a very large negative utility
This was actually discussed in that video. The gist is that you've shifted the goal from "get arbitrarily many stamps" to "perform arbitrarily many redundant checks confirming that you have the correct number of stamps" but the fact that you're doing extreme actions to get there remains the same. Instead of a world dismantled and turned into stamps, you get a world dismantled and turned into stamp-counting machines. Still a guaranteed apocalypse.
Good thing I literally just rewatched that video yesterday 👍
eee... so how does a human go about defining the base probability distribution for actions said human has no way of predicting in advance? Unless it's 100% supervised learning where a human judges each and every strategy developed by the AI there's no way of deciding what probability to assign to any particular action.
The base probability distribution is a mapping so in order to cut off the base probability mass < q the human must first define the probability for actions that said human would not be able to come up with in the first place.
It's a non-solution.
I propose the tranquilizer - it works like a satisficer that takes increasingly bad options the closer it gets to a weekend.
Here's a question: It seems like one of the fundamental problems here is that the agent looks at the things it can do and compares them to the things that a human might do, but the actual set of things that a human /can/ do looks very different from that which a robot could. That is, a robot would completely miss an action like "send a wifi signal to stop the bomb" and might start running around looking for a cellphone, or it might get caught up in all the time that a human spends doing human things, like digesting or masturbating. The former is a problem of translating human actions into machine actions in a generalisable, efficient way (doesn't seem easy...) and the latter a problem of filtering which human actions are important (tho potentially /this/ problem is solved by the utility sorting, it still might only include the version of sending the signal where it spends 3 minutes screaming and tries to throw up).
Correct me if I'm wrong but, there's no reason to include randomness in this strategy, just select the action closest to the 10%-90% cut off
Okay but how about simply multiplying the human's probability with the maximiser utility function, and having that as a utility? I expect that to break, but in what way?
Instead of just binding the utility function can you somehow limit the allowed usage of ressources? Would that not limit the potential danger?
That's just a rewording of a case that was discussed, isn't it?
And how would we define and measure resource usage? ... runtime?
The bell curve at 5:30 would actually be really weird and jagged, right? Not at all a continuous distribution, since it is sorted by utility not human probability?
Yes, it would generally not be exactly a bell curve. Depending on the task and distribution of humans it's using it may often be vaguely bell-curve shaped given enough samples though.
-- _I am a bot. This reply was approved by plex and sudonym_
Is there any way of distinguishing strategies that humans are unlikely to try because they are morally reprehensible from strategies that are equally unlikely because they are really ingenious and/or counterintuitive? Because the average person tends to always want do to the right thing but rarely knows how to. I sort of feel like this model would treat "Robbing a stamp museum" and "Founding a stamp museum" equally - could one maybe add a third factor/dimension somehow, corresponding to a rough estimation of an expected human value judgement? (Not like "Do I think this this is morally right or wrong", more like "What would people think of me if they were to judge my character solely on this action")
Also, I really missed your videos and I'm super happy that you're back. :^)
i wonder if combining an upper bound on human probability with running all the utility thru a log function (or even a non-monotonic function to make super high utility lower utility than somewhat high) would improve safety more
What about requiring the AGI to create a written explanation of its plan and then having people sign off on it? Would it be too hard to get it to create an understandable and accurate explanation of its own actions?
8:18 Mass Effect 1 Quest "Signal Tracking". AGI is prohibited due to unsolved security issues.
So a minor thief made a AI to steal money. Which promptly went and made a AGI for the Job - exactly this scenario.
It seems like for any AI safety issue, there is a example in the Mass Effect series.
could adding a second cut-off point (e.g. p) that cuts off all ideas before it reduce the amount of risky ideas as well? having a hard cut-off for "too good an idea" isn't exactly the most effecient, but wouldn't it mean that the statistically best method of turning the world into stamps wouldn't be picked because it's too good?
Wouldn't a quantilizer (or even a maximizer for that matter) take into account the possibility of creating an infinite chain of quantilizers creating quantilizers when considering the expected utility of having a quantilizer find the solution? Intuitively I feel like that would make it have the utility of both the maximum value and zero (result of self-replicating chain with no work done) but it's confusing to think about
I feel like I have been waiting an eternity for this
A lot of other AI systems resist being shut down, corrected, or modified by humans. Would a quantilizer also do this, or could one be made safe if properly supervised?
We had a little discussion on the discord, here it is if you're interested: pastebin.com/RCKC1QVY
-- _I am a bot. This reply was approved by sudonym and plex_
Could you put a lower bound on the quantilizer? So instead of q=.1 you have q(lower)=.01 and q(higher) = .1 so it ignores the 1% least likely human solutions. Obviously not 100% safe and reduces the chance of a "brilliant" solution, but at first glance it seems safer.
Was there any discussion in the paper about having two q-values; one that removes the low-utility "human" actions, and one that removes the incredibly high-utility "inhuman" actions? It seems like that could devolve into heuristics if not applied correctly, but requiring a certain amount of "human-ness", if extremely high-performing human-ness, could avoid the most apocalyptic of options.
We had a discussion about this idea in response to a different comment, though we didn't really come to any firm conclusions. You can read it here if you like: pastebin.com/FVUNCBJt
-- _I am a bot. This reply was approved by plex and robert_hildebrandt_
Better late than never! Glad to see more videos
Yaay, you're back !
How about mixing in a lower-bound to a quantilizer? How would that affect AI safety? Like, let's say the AI only gets to pick between strategies that get a q value between 0.01 and 0.1?
I am writing this in hopes you see this, but what if we add in an additional clause to these A.I. "satisfiers": Do x with the least amount of resources used. Since taking over the world is very hard, and bound to be expensive, wouldn't it follow that taking a cheapest option would prevent the apocalypse? (be those financial other resources i.e. compute time.)
Would it make sense to make an AI, that asks people (ideally lots) whenever it doesn't have a high confidence in knowing whether humans will like certain outcome or not?
- It comes up with actions, and outcomes, and tries to maximize matching what (informed, 'expert' on the particular topic; many 'voting' by stating preferences independently) people will want as outcome.
- It will only carry out an action, if it has high confidence, that a large number of 'competent/relevant' people want that outcome. (and ~none relevant are not highly against it)
- Everybody counts as expert on basic human needs for themselves, etc.
... It could replace government...
I've made other similar comments, but this one is phrased as a question...