DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning [1/13]

Поделиться
HTML-код
  • Опубликовано: 3 фев 2025

Комментарии • 116

  • @hasuchObe
    @hasuchObe 3 года назад +537

    A full lesson on reinforcement learning from a Deep Mind researcher. For Free! What a time to be alive.

    • @mawkuri5496
      @mawkuri5496 3 года назад +42

      lol.. 2 minute papers

    • @321conquer
      @321conquer 3 года назад +1

      You might be dead it is your AI clone is typing this... Hope this helps ®

    • @alejandrorodriguezdomingue174
      @alejandrorodriguezdomingue174 3 года назад +4

      yes, apart from sharing knowledge (do not get me wrong) they also target a market place and teach people so to use their future products

    • @masternobody1896
      @masternobody1896 3 года назад +3

      yes

    • @jakelionlight3936
      @jakelionlight3936 3 года назад +1

      @@mawkuri5496 lol

  • @JamesNorthrup
    @JamesNorthrup 2 года назад +33

    TOC in-anger
    0:00 class details
    5:00 intro-ish
    6:50 turing
    8:20 define goals of class
    9:00 what is RL?
    12:14 interaction loop
    17:20 reward
    25:49 atari game example
    28:18 formalization
    29:40 reward
    30:10 the return
    34:00 policy, actions denoted with Q
    35:00 goto lectures 3,4,5,6
    43:00 markov is maybe not the most important property
    44:00 partial observability
    46:10 the update function
    53:48 Policy -> mapping of agent state to action.
    54:20 stochastic policy
    56:00 discount factor magnify local reward proximity (or not)
    59:00 pi does not mean 3.14. means probability distribution. Bellman Equation so named here
    1:02:00 optional Model
    1:04:00 model projects next state+reward, or possibly any state and any reward, because reasons
    1:07:00 Agent Categories
    1:10:00 common terminology

  • @chevalier5691
    @chevalier5691 3 года назад +18

    Thanks for the amazing lecture! Honestly I prefer this online format rather than an actual lecture, not only because the audio and presentation are more clear, the concepts are explained more thoroughly without any interruption from students

    • @luisleal4169
      @luisleal4169 Год назад +3

      And also you can go back to sections you missed or didn't fully understand.

  • @cennywenner516
    @cennywenner516 Год назад +5

    For those who may view this lecture, note that I think it is a bit non-standard in reinforcement learning to denote the state S_t as the "agent's state". It usually refers to the environment's state. This is important for other literature. The closest thing for the agent's state is perhaps "the belief state" b_t. Both are relevant depending on what is being done, and some of the formalization might not work when the two are mixed. Notably, most of the environments that are dealt with are Markovian in the (possibly hidden) environment state but not in the observations or even what the agent may derive about the state, which also means most of the time it is insufficient to condition on only "S_t=s" the way it is defined here, rather than the full history H_t.
    Considering how a lot of the RL formalism is with regard to the state that is often not fully observable to the agent, maybe this approach is useful.

  • @prantikdeb3937
    @prantikdeb3937 3 года назад +38

    Thank you for releasing those awesome tutorials 🙏

  • @abanoubyounan9331
    @abanoubyounan9331 Год назад +1

    Thank you, DeepMind, for sharing these resources publicly.

  • @loelie01
    @loelie01 3 года назад +7

    Great course, thank you for sharing Hado! Particularly enjoyed the clear explanation of Markov Decision Processes and how they relate to Reinforcement Learning.

  • @Adri209001
    @Adri209001 3 года назад +9

    Thank so much for this, We love you from Africa

  • @DilekCelik
    @DilekCelik Год назад +2

    Some diamond lectures from top researchers are public. Amazing.. Get benefit guys. You will not get this much quality lectures from the universities.

  • @alisheheryar1770
    @alisheheryar1770 3 года назад +3

    The type of learning in which your AI agent learns/tunes itself by interacting with its environment, is called Reinforcement Learning. More generalization power than a neural network and more able to cater to those unforeseen situations that were not considered when designing such a system.

  • @TexasBUSHMAN
    @TexasBUSHMAN 3 года назад +3

    Great video! Thank you! 💪🏾

  • @Sol_Survivor-d4s
    @Sol_Survivor-d4s Месяц назад

    Unbelievable, thank you for this!!

  • @0Tsutsumi0
    @0Tsutsumi0 Год назад

    "Any goal can be formalized as the outcome of maximizing a cumulative reward." A broader question would be "Can all possible goals be transformed into a Math formula?", it starts getting trickier whenever you deal with subjective human concepts such as love.

  • @billykotsos4642
    @billykotsos4642 3 года назад +8

    The man, the myth the legend.
    OMG ! I’m in !

  • @charliestinson8088
    @charliestinson8088 Год назад +1

    At 59:06, does the Bellman Equation only apply to MDPs? If it depends on earlier states I don't see how we can write it in terms of only v_pi(S_{t+1})

  • @matthewfeeley6226
    @matthewfeeley6226 3 года назад

    Thankyou very much for this lesson and for you to take the time to deliver the content.

  • @Fordance100
    @Fordance100 3 года назад +2

    Amazing introduction on RL.

  • @tamimyousefi
    @tamimyousefi 3 года назад

    15:45
    Goal: Prosper in all societies.
    Env.: A world comprised of two societies, killers and pacifists.
    These two groups despise the actions of the other. You will find reward from one and penalty from the other for any given action.

    • @gkirgizov_ai
      @gkirgizov_ai 3 года назад

      just kill all the pacifists and the goal becomes trivial

    • @MikeLambert1
      @MikeLambert1 2 года назад

      I think you're still maximizing a reward in your scenario, but it's just the reward is not static, and is instead a function of your state (ie, which society you are physically in).

  • @chadmcintire4128
    @chadmcintire4128 3 года назад +4

    Why the downvote on free education? Thanks, I am comparing this to to cs 285 for Berkeley, so far it has been good, different focus.

  • @kejianshi299
    @kejianshi299 3 года назад +3

    Thanks so much for this lesson!

  • @umarsaboor6881
    @umarsaboor6881 Год назад +1

    amazing

  • @sh4ny1
    @sh4ny1 5 месяцев назад

    45:20
    if we have a differentiable physics simulator as an evnironment from which we can feed the gradient information to agent,
    would that be consiered a fully observable environment ?

  • @goutamgarai6632
    @goutamgarai6632 3 года назад +3

    thanks DeepMind

  • @kiet-onlook
    @kiet-onlook 3 года назад +12

    Does anyone know how this course compares to the 2015 or 2018 courses offered by Deepmind and UCL? I’m looking to start with one but not sure which one to take.

    • @meer-cz1rs
      @meer-cz1rs 13 дней назад

      hi did you find an answer to that? im just starting as well and wondering the same

  • @robertocordovacastillo3035
    @robertocordovacastillo3035 3 года назад +1

    That is awesome! thank you from Ecuador

  • @randalllionelkharkrang4047
    @randalllionelkharkrang4047 2 года назад +9

    Please , can you link the assignments for this course? For non UCL students

  • @bhoomeendra
    @bhoomeendra Год назад

    37:28 what is ment by prediction is it different from the actions?

  • @yulinchao7837
    @yulinchao7837 2 года назад

    15:45 Let's say my goal is to live forever and I can take 1 pill per day and that gaurantees my survival the next day. If I don't take it, I die. How do I formalize the goal by the cumulative rewards? My goal would be getting infinite rewards. However, the outcomes of me taking the pill or not at some day in the future are both infinite. In other words, I can't distinguish if I can live forever from maximizing the cumulative reward. Does this count as a success to breaking the hypothesis?

  • @sumanthnandamuri2168
    @sumanthnandamuri2168 3 года назад +7

    @DeepMind Can you share assignments?

  • @patrickliu7179
    @patrickliu7179 2 года назад

    16:22
    For a task that fails due to trying to maximize a cumulative reward, would casino games that have turns of independent probability such as roulette break the model? This is tentative to the reward accumulation period expanding beyond one turn, resulting in a misapplication of the model. While it is more of a human error than machine error, its a common human misconception of the game and thus liable to be programmed that way.
    Another example may be games with black swan events, so the reward accumulation period is too short to have witnessed a black swan event.

  • @malcolm7436
    @malcolm7436 Год назад

    If your goal is to win the lottery, you incur a weekly debt for each attempt, and the chance is the same with no guarantee of achieving the goal. If the reward is your profit over time, then the cumulative reward could even be negative and decreasing with each attempt.

  • @mohammadhaadiakhter2869
    @mohammadhaadiakhter2869 10 месяцев назад

    At 1:05:49, how did we approximate the policy?

  • @bobaktadjalli
    @bobaktadjalli 2 года назад

    Hi, at 59:50 I couldn't understand the meaning of argument "a" under "max". It would be appreciated if anyone could explain this to me.

    • @SmartMihir
      @SmartMihir 2 года назад

      I think
      Regular value function would get value of a state when we pick action by following pi.
      Optimal value function however would pick action such that value is maximum (for all time steps further)

  • @rudigerhoffman3541
    @rudigerhoffman3541 2 года назад

    After around 30:00 it was said that we can't hope to always optimize the return itself therefore we need to optimize the expected return. Why? Is this because we don't know the return yet and can only calculate an expected return based on some inference made on the basis of known returns? Or is it only because of the need of discounted returns in possibly infinite markov decision processes? If so, why wouldn't it work in finite MDPs?

    • @MikeLambert1
      @MikeLambert1 2 года назад

      My attempt at an answer, on the same journey as you: If you have built/trained/learned a model, it is merely an approximation of the actual environment behavior (based on how we've seen the world evolve thus far). If there's any unknowns (ie, you don't know what other players will do, you don't know what you will see when you look behind the door, etc) then you need to optimize E[R], based on our model's best understanding of what our action will do. Optimizing E[R] will still push us to open the doors because we believe there _might_ be gold behind them. But if we open a particular door without any gold, it doesn't help R (in fact, I believe it lowers R, because any gold we find is now "one additional step" out in the future), even though it maximized E[R].

  • @adwaitnaik4003
    @adwaitnaik4003 3 года назад +1

    Thanks for this course.

  • @hadjseddikyousfi00
    @hadjseddikyousfi00 2 месяца назад

    Thank you!

  • @ianhailey
    @ianhailey 3 года назад +6

    Are the code and simulation environments for these examples available somewhere?

  • @gokublack4832
    @gokublack4832 3 года назад +2

    At 49:40 how about just storing the number of steps the agent has taken? Would that make it Markov?

    • @thedingodile5699
      @thedingodile5699 3 года назад

      No, you would still be able to stand in the two places he highlighted with the same amount of steps taken, so you can't tell the difference. While if you knew the entire history of the states you visited you would be able to tell the difference.

    • @gokublack4832
      @gokublack4832 3 года назад

      @@thedingodile5699 Maze games like this usually have an initial state (i.e, position in the grid) where the game starts, so I'm not sure why if you stored the number of steps taken you wouldn't be able to tell the difference. You'd just look at the steps taken and notice that although the two observations are the same, they are very far away from each other and they're likely different. I'd agree if the game could start anywhere on the grid, but that's usually not the case.

    • @thedingodile5699
      @thedingodile5699 3 года назад

      @@gokublack4832 even if you start the same place you can most likely reach the two squares at the same time-step (unless there is something like you can only reach this state in an even amount of steps or something like that)

    • @gokublack4832
      @gokublack4832 3 года назад

      ​@@thedingodile5699 True, yeah I guess it's theoretically possible to construct a maze that starts at the same place, but then comes to a fork in the road later where the mazes are identical on both sides except only one contains the reward. In that case, counting steps wouldn't help you distinguish between two observations on either side... 🤔tricky problem

    • @hadovanhasselt7357
      @hadovanhasselt7357 3 года назад +4

      @@gokublack4832 It's a great question. In some cases adding something simple as counting steps could make the state Markovian, in other cases it wouldn't. But even if this does help disentangle things (and make the resulting inputs Markov), adding such information to the state would also result in there being more states, which could make it harder to learn accurate predictions or good behaviour. In general, this is a tricky problem: we want the state to be informative, but also for it to be easy to generalise from past situations to new ones. If each situation is represented completely separately, the latter can be hard.
      In later lectures we go more into these kind of questions, including how to use deep learning and neural networks to learn good representations, that hopefully can result in a good mixture between expressiveness on the one hand, and ease of generalisation on the other.

  • @lqpchen
    @lqpchen 3 года назад +7

    Thank you! Is there any assignments pdf files?

  • @swazza9999
    @swazza9999 3 года назад +4

    Thanks Hado, this has been very well explained. I've been through similar lectures/ intro papers before but here I learned more of the finer points / subtleties of the RL formalism - things that a teacher might take for granted and not mention explicitly.
    Question: 1:03:23 anyone know why the second expression is an expectation value and the first is a probability distribution? Typo or a clue to something much more meaningful?

    • @TheArrowShooter
      @TheArrowShooter 3 года назад

      Given that for a pair (s, a) there is one "true" reward signal in the to be learnt model, the expected value should suffice. I.e. if you would model this with a distribution, this would in the limit be a dirac delta function at value r. The alternative where there are two (or more) possible reward values for a state-action pair, a probability distribution that you sample from could make more sense.
      You can ask yourself if it even make sense to have multiple possible rewards for an (s, a)-pair. I think it could be useful to model your reward function like a distribution when your observed state is only a subset of the environment for example. E.g. assume you can't sense whether it is raining or not, and this will respectively determine the reward of your (s, a) pairs being either 5 or 10. Modelling the reward as an expected value (would be 7.5 given that it rains 50 percent of the time) would ignore some subtleties of your model here I suppose.
      I'm no RL specialist so don't take my word for it!

    • @swazza9999
      @swazza9999 3 года назад

      ​@@TheArrowShooter hmm is it really right that there is one "true" reward signal for a given pair (s, a)? If a robot makes a step in a direction it may or may not slip on a rock so despite the action and state being determined as a prior, the consequences can vary.
      I was thinking about this more and I realised the first expression is asking about a state, which is a member of a set of states, so it makes sense to ask for the probability that the next state is s'. But in the second expression we are dealing with a scalar variable, so it makes more sense to ask for an expectation value. But don't take my word for it :)

    • @TheArrowShooter
      @TheArrowShooter 3 года назад +1

      @@swazza9999 I agree that there are multiple possible reward signals for a given state action pair. I tend to work with deterministic environments (no slipping, ending up in different states, ..), hence our misunderstanding :)!
      My main point was that you could model it as a probability distribution as well. The resulting learnt model would be more faithful to the underlying "true" model as it could return rewards by sampling (i.e. 5 or 10 in my example).

    • @willrazen
      @willrazen 3 года назад +1

      It's a design choice, you could choose whatever formulation that is suitable for your problem. For example, if you have a small and finite set of possible states, you can build/learn a table with all state transition probabilities, i.e. the transition matrix. As mentioned in the same slide, you could also use a generative model, instead of working with probabilities directly.
      In Sutton&Barto 2018 they say:
      "In the first edition we used special notations, P_{ss'}^a and R_{ss'}^a, for the transition
      probabilities and expected rewards. One weakness of that notation is that it still did not
      fully characterize the dynamics of the rewards, giving only their expectations, which is
      sufficient for dynamic programming but not for reinforcement learning. Another weakness is the excess of subscripts and superscripts. In this edition we use the explicit notation of
      p(s',r | s,a) for the joint probability for the next state and reward given the current state
      and action."

    • @Cinephile..
      @Cinephile.. 3 года назад

      Hi I want to learn Data science , machine learning. And AI
      I am unable to get the right approach and study material there are numerous courses as well but still struggling to find the right one

  • @JuanMoreno-tj9xh
    @JuanMoreno-tj9xh 3 года назад +5

    "Any goal can be formalized as the outcome of maximizing a cumulative reward." What about the goal being to know if a program will halt?

    • @judewells1
      @judewells1 3 года назад +1

      My goal is to find a counter example that disproves the reward hypothesis.

    • @hadovanhasselt7357
      @hadovanhasselt7357 3 года назад +13

      Great question, Juan! I would say you can still represent this goal with a reward. E.g., give +1 reward when you know the program will halt.
      So in this case the problem perhaps isn't so much to formulate the goal. Rather, the problem is that we cannot find a policy that optimises it. This is, obviously, a very important question, but it's a different one.
      One could argue that the halting problem gives an example that some problems can have well-formalised goals, but still do not allow us to find feasible solutions (in at least some cases, or in finite time). This itself doesn't invalidate the reward hypothesis. In fact, this example remains pertinent if you try to formalise this goal in any other way, right?
      Of course, there is an interesting question which kind of goals we can or cannot hope to achieve in practice, with concrete algorithms. We go into that a bit in subsequent lectures, for instance talking about when optimal policies can guaranteed to be found, and discussing concrete algorithms that can find these, and discussing the required conditions for these algorithms to succeed.

    • @nocomments_s
      @nocomments_s 3 года назад +1

      @@hadovanhasselt7357 thank you very much for such an elaborate answer!

    • @JuanMoreno-tj9xh
      @JuanMoreno-tj9xh 3 года назад +1

      @@hadovanhasselt7357 True. I didn't think about it that way. I just thought that if you couldn't find a framework, a reward to give to your agent, such that you could solve your problem by finding the right policy then you could say that the reward hypothesis was false. Since there is no way to get around it.
      But you are right. It's a different question. But then it's still a hypothesis. Thanks for your time. :)

    • @JuanMoreno-tj9xh
      @JuanMoreno-tj9xh 3 года назад +2

      @@judewells1 Nice one!

  • @AyushSingh-vj6he
    @AyushSingh-vj6he 3 года назад

    Thanks, I am marking 49:21

  • @neerajdokania300
    @neerajdokania300 2 месяца назад

    Hi, Can anyone please explain why does the optimal value v_star does not depend on the policy and only depend on the states and actions?

  • @AtrejuTauschinsky
    @AtrejuTauschinsky 3 года назад

    I'm a bit confused by models... In particular, value functions map states to rewards, but so do (some) models -- what's the difference? You seem to have the same equation (S -> R) for both on the slide visible at 1:16:30

    • @hadovanhasselt7357
      @hadovanhasselt7357 3 года назад +6

      Some models indeed use explicit reward models, that try to learn the expected *immediate reward* following a state or action. Typically, a separate transition model is also learnt, that predicts the next *state*.
      So a reward model maps a state to a number, but the semantics of that number is not the same as the semantics of what we call a *value*. Values, in reinforcement learning, are defined as the expected sum of future rewards, rather than just the immediate subsequent reward.
      So while a reward model and a value function have the same functional form (they both map a state to a number), the meaning of that number is different.
      Hope that helps!

  • @anhtientran3158
    @anhtientran3158 3 года назад

    Thank you for your informative lecture

  • @cuylerbrehaut9813
    @cuylerbrehaut9813 Год назад +1

    Suppose the reward hypothesis is true. Then the goal “keep this goal’s reward function beneath its maximum” has a corresponding reward function (rendering the goal itself meaningful) whose maximization is equivalent to the achievement of the goal. If the reward function were maximized, the goal would be achieved, but then the function must not be maximized. This is a contradiction. Therefore the reward function cannot be maximized. Therefore the goal is always achieved, and therefore the reward function is always maximized. This is a contradiction. Therefore the reward hypothesis is false.

    • @cuylerbrehaut9813
      @cuylerbrehaut9813 Год назад +1

      This assumes that the goal described exists. But any counter-example would require such an assumption. To decide if this argument disproves the reward hypothesis, we would need some formal way of figuring out which goals exist and which don’t.

  • @abdul6974
    @abdul6974 3 года назад +1

    is there any practical Course in Python in RL to apply the theory of the RL?

  • @boriskabak
    @boriskabak 2 года назад

    where we can see a coding intrioduction how to code reinforcment learning models

  • @theminesweeper1
    @theminesweeper1 2 года назад

    Is the reward hypothesis generally regarded as true among computer scientists and other smart people?

  • @iasonliagkas124
    @iasonliagkas124 Месяц назад

    is it only 13 lectures? seems so little for a university course.

  • @anshitbansal215
    @anshitbansal215 3 года назад

    "If we are observing the full environment then we do not need to worry about keeping the history of previous actions". Why would this be the case, because then from what the agent will learn?

  • @robinkhlng8728
    @robinkhlng8728 3 года назад

    Could you further explain what v(S_t+1) formally is?
    Because v(s) is defined with lowercase s as input. From what you said I would say it is SUM_s' [ p(s'|S_t=s ) * v(s') ], so the expected value over all possible states s' for S_t+1.

    • @a4anandr
      @a4anandr 3 года назад

      That seems right to me. Probably, it is conditioned on the policy \pi as well.

  • @matiassandacz9145
    @matiassandacz9145 2 года назад

    Does anyone know where can I find assignments for this course? Thank you in advance!

  • @may8049
    @may8049 3 года назад

    when will we be able to download alpha go and play with him.

  • @comradestinger
    @comradestinger 3 года назад +1

    ow right in the inbox

  • @rakshithv5073
    @rakshithv5073 3 года назад

    Why do we need to maximize expectation of return ?
    What will happen if I maximize return alone without expectation ?

    • @ckhalifa_
      @ckhalifa_ 3 года назад

      expectation of return (reward actually) includes the relevant discount factor for each future reward.

  • @mabbasiazad
    @mabbasiazad 3 года назад

    Can we have access to the assignments?

  • @Saurabhsingh-cl7px
    @Saurabhsingh-cl7px 3 года назад

    So I have to watch videos of previous years on RL by deep minds to understand this ?

    • @los4776
      @los4776 3 года назад +1

      No it would not be a requirement

  • @chanpreetsingh007
    @chanpreetsingh007 2 года назад

    Could you please share assignments?

  • @philippededeken4881
    @philippededeken4881 2 года назад

    Lovely

  • @zuphr1n
    @zuphr1n 3 года назад +4

    I miss David silver

  • @extendedclips
    @extendedclips 3 года назад +1

    ✨👏🏽

  • @mattsmith6509
    @mattsmith6509 3 года назад +2

    Can it tell us y people bought toilet paper in the pan dam

  • @madhurivuddaraju3123
    @madhurivuddaraju3123 3 года назад +1

    Pro tip: Always switch off vaccum cleaner when recording lectures.

    • @spectator5144
      @spectator5144 2 года назад

      he is most probably not using an Apple M1 computer

  • @AineOfficial
    @AineOfficial 3 года назад +1

    Day 1 asking him when AlphaZero Back to Chess Again.

  • @WhishingRaven
    @WhishingRaven 3 года назад +1

    이게 또 나오네

  • @garrymaemahinay3046
    @garrymaemahinay3046 3 года назад

    i have solution but i need a team

  • @jonathansum9084
    @jonathansum9084 3 года назад

    Many great people have said RL is replaced by DL.
    If so, I think we should focus more on newer topics like perceive.IO. I think they are much important and practical rather than those histories.
    I hope you do not mind what I said.

    • @felipemaldonado8028
      @felipemaldonado8028 3 года назад

      Do you mind to provide evidence about those "many great people"?

  • @MsJoChannel
    @MsJoChannel 3 года назад

    slides like it was 1995 :)