Policies and Value Functions - Good Actions for a Reinforcement Learning Agent

Поделиться
HTML-код
  • Опубликовано: 29 ноя 2024

Комментарии • 94

  • @deeplizard
    @deeplizard  6 лет назад +13

    Check out the corresponding blog and other resources for this video at:
    deeplizard.com/learn/video/eMxOGwbdqKY

  • @DanielWeikert
    @DanielWeikert 6 лет назад +49

    I have said it before but I feel obliged to say it again with each new video. Your videos are awesome. I really like the explanations starting from scratch and then continuously building up. Really helpful and highly appreciated

    • @deeplizard
      @deeplizard  6 лет назад +2

      We really appreciate that, Daniel! Always happy to see your comments :)

  • @sashamuller9743
    @sashamuller9743 4 года назад +7

    i love the little snippets of real-life reinforcement learning at the end of the video, it keeps me inspired to continue!

  • @iqbalagistany
    @iqbalagistany 2 года назад +7

    "If you can't tell it simply, you don't understand it enough"
    It is the simplest explanation I have found on RUclips.
    Thanks a lot

  • @sahanakaweerarathna9398
    @sahanakaweerarathna9398 6 лет назад +23

    hey deeplizard we need a series about RNN also. Plz choose it as your next series

  • @arjunkashyap8896
    @arjunkashyap8896 4 года назад +4

    Your channel is a gem.. I have watched so many tutorials on ML but yours is the one which I fully understand.
    M gonna comment​ on every video I watch on your channel.
    Thanks a lot deeplizard, you're shaping future ML engineers.

  • @jy2592
    @jy2592 3 года назад +1

    This channel deserves more viewers for sure.

  • @pepe6666
    @pepe6666 5 лет назад +7

    i love how this video is so calm & soothing at the end then BAM WACKY RUNNING CRAZY STICK MAN

  • @adityanjsg99
    @adityanjsg99 Год назад +1

    I learnt about RL from this series, what I couldn't from MIT and Stanford open course ware

  • @happyduck70
    @happyduck70 3 года назад +8

    I love the series, real clear explanation. Although it would be nice to have example in between. It now stays very abstract to me, but if it is visualized during the explanation it would land better in my opinion. But super thanks for making these series on such complex matter!

  • @slowonskor
    @slowonskor 4 года назад +11

    Regarding: V(s) and Q(s,a)
    I want to point out what wasn't clear to me and what needs to be emphasized in my opinion to understand the whole stuff:
    Vπ(s) expresses the expected value of following policy π forever when the agent starts following it from state s.
    Qπ(s,a) expresses the expected value of first taking action a from state s and then following policy π forever.
    The main difference then, is the Q-value lets you play a hypothetical of potentially taking a different action in the first time step than what the policy might prescribe and then following the policy from the state the agent winds up in.

    • @yuhanyao2857
      @yuhanyao2857 4 года назад

      Hi! RL newbie here. I love your explanation. The "following policy pi forever" part is especially enlightening. But I am not sure what it means. Rather, I don't think I ever quite understand the difference between action and policy in an intuitive way. I would guess policy pi, or the probabilities, changes after every time step, but does following policy pi forever mean the probability never changes? Or does following policy pi mean always choosing the same action? I would love to hear more of your insight :)

    • @sender1496
      @sender1496 Год назад +1

      @@yuhanyao2857 From what I understand, the policy gives a complete probability distribution from each state. You can think of it as first plugging in the state of the system, then receiving a probability distribution of actions. So what happens is this: after your model has taken a step, you get a new state, but if you plug this state into the policy function, then you get a probability distribution of the different possible actions. Just sample an action from this distribution and let your model take that action. This gives a new state, which you then plug into the policy again. In this sense, the policy completely describes how your model should act going forward. It doesn't matter that the state changes.

    • @yuhanyao2857
      @yuhanyao2857 Год назад +1

      @@sender1496 Wow can't believe it's been 2 years since I posted this question. Thank you for your explanation!

  • @hadiphonix3352
    @hadiphonix3352 4 года назад

    by a huge distance you make the best tutorials in this field .thanks a lot

  • @joaopedrofelixamorim2534
    @joaopedrofelixamorim2534 3 года назад +1

    This channel saved my life! Many thanks!

  • @techpy5730
    @techpy5730 4 года назад +2

    This series is golden! Thank you so much for creating it!

  • @tingnews7273
    @tingnews7273 6 лет назад +3

    I watched the video and read the post.
    What I learned:
    1、Policy and value functions are two things.
    2、Value functions have two form. One is tell agent how good is the state.The other is tell the agent how good is the action on some certain state.
    3、Policy is the probility for agent choose the certain action on some certain state.
    4、I think the two value functions just make me confuse at the first time. After I read the post get it clear. For short . One for the state. One for the action under the state.
    5、What is the Q-function: action-state pair value function.I finally get it.....Thank for the state value function.

    • @deeplizard
      @deeplizard  6 лет назад

      I'm happy to hear that the blog post is acting as a supplemental learning tool to the video!

  • @rahulnawale8310
    @rahulnawale8310 Год назад +1

    Very helpful videos. Very well explained in simple words. Thank you for creating such videos.

  • @hello3141
    @hello3141 6 лет назад +3

    Really like your discussions: you get right to the heart of the matter. By the way, great video production and graphics! The cyclist is nifty!

  • @rashikakhandelwal702
    @rashikakhandelwal702 3 года назад +1

    Thank you for these series. I am so grateful to you . Its simply awesome! :)

  • @light-qn2jb
    @light-qn2jb 11 месяцев назад +1

    I was done trying to learn the topic till i saw this series the simplificationt the structure wow

  • @jamesbotwina8744
    @jamesbotwina8744 6 лет назад +2

    Very nice summary! I’m taking the Udacity DRL course and this video helped me understand the distinction between value function components

    • @deeplizard
      @deeplizard  6 лет назад

      Glad to hear that, James! Thanks for letting me know!

  • @tinyentropy
    @tinyentropy 5 лет назад +4

    Maybe I missed it. But I think you haven’t described what it means „to follow a policy“. In other words, how do you make use of the probability distribution over actions in any given state? You could sample from it to determine your next action. But doing so, how does it relate to the optimal value criterium, because you won’t be able to reach the global optimum then.

  • @Bjarkediedrage
    @Bjarkediedrage 4 года назад +6

    I wish someone would explain all of this in a non-formal way, not using any notations, rudimentary math and visualizations. I had the same experience as when people tried to explain backrpropergation and the chain rule. I now undestand both concepts and how to implement them, but damn it was hard. I had to reverse engineer a simple neural network and really look at the code and step through it, in order to understand what it actually does and how it works. It's frustarting that it's faster to learn it that way, vs listening to people trying to explain it in an abstract way. Why do we have to introduce so many terms and abstractions you have to remember. In the end, what the algorithm does is all multiplication, addition, and data manipulation. My programmer brain is just wired differently. I'm sure that once I undestand all of this, I can throw away 90% of all of the terms and math introduced here, and undestand how it works intuitivly without remembering what MDP stands for... Same thing with NN's I'm ny inventing my own neurons, playing with different ways of making them recurrent, giving them memory and crazy features, and I know very little about math and its notations.

  • @sumeetdeshpande4825
    @sumeetdeshpande4825 8 месяцев назад +1

    Nice editing!! Lovely!!

  • @hugaexpl0it
    @hugaexpl0it Год назад +1

    Are you using those videos at the end to give us rewards for completing each video?
    If so, that's pretty meta and impressive.

  • @ramasamyseenivasagan4174
    @ramasamyseenivasagan4174 4 года назад +1

    Excellent work!!! Well defined concepts...

  • @fatemehoseini7614
    @fatemehoseini7614 7 месяцев назад

    Thanks for great explanation, that was great , but it would be great to discuss more about probability distribution of policy , I just did not understand that concept. The rest was great 🙏

  • @rominashams7280
    @rominashams7280 2 месяца назад +1

    This videos are perfect

  • @kareemmohamed4064
    @kareemmohamed4064 3 года назад +1

    Really Great, Thank You

  • @PenAndSpecs007
    @PenAndSpecs007 Год назад +1

    Amazing content!

  • @kaierliang
    @kaierliang 4 года назад +1

    you are a lifesaver

  • @evolvingit
    @evolvingit 5 лет назад +2

    Super!!Going awesome!!!

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 6 лет назад +2

    Awesome video! Really clear!

  • @prathampandey9898
    @prathampandey9898 2 года назад

    What is "E" in the formula and what does "|" symbol i.e. bar means in the formula?
    I would suggest to provide appropriate meaning of symbols below each formula.

  • @GauravSharma-ui4yd
    @GauravSharma-ui4yd 4 года назад +1

    Great job, as an awesome explanation. Can you please give the link of deepmind video snippet at the end

    • @deeplizard
      @deeplizard  4 года назад

      ruclips.net/video/t1A3NTttvBA/видео.html

  • @justinunion7586
    @justinunion7586 5 лет назад +1

    "If an agent follows policy 'pi' at time t, then pi(a|s) is the probability that At = a if St = s. This means that, at time t, under policy 'pi', the probability of taking action a in state s is 'pi'(a|s)"

  • @xentox5016
    @xentox5016 3 года назад

    At 4:00 What exactly is G_t? is it the value of the state? and what does the E_pi[ ] return?

  • @priyambasu5529
    @priyambasu5529 4 года назад +2

    {
    "question": "What is the difference between a state value function and an action value function?",
    "choices": [
    "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.",
    "State value function tell us the policy whereas action value function tells us the action and state.",
    "Both are the same.",
    "State value function tells us the state as well as action whereas action value function tell us only action for any state."
    ],
    "answer": "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.",
    "creator": "whopriyam",
    "creationDate": "2020-04-09T11:32:01.613Z"
    }

    • @deeplizard
      @deeplizard  4 года назад

      Thanks, priyam! I changed the wording just a bit, but I've now just added your question to deeplizard.com/learn/video/eMxOGwbdqKY :)

  • @ProfessionalTycoons
    @ProfessionalTycoons 5 лет назад +2

    thank you for this!

  • @mariaioannatzortzi
    @mariaioannatzortzi 4 года назад +1

    {
    "question": "With respect to what the value function is defined?",
    "choices": [
    "Value function is defined with all of the choices.",
    "Value function is defined with respect to the expected return.",
    "Value function is defined with respect to specific ways of acting.",
    "Value function is defined with respect to policy."
    ],
    "answer": "Value function is defined with all of the choices.",
    "creator": "marianna tzortzi",
    "creationDate": "2020-11-18T14:57:58.033Z"
    }

    • @deeplizard
      @deeplizard  4 года назад +1

      Thanks, Marianna! Just added your question to deeplizard.com/learn/video/eMxOGwbdqKY :)

    • @JJJ-ee5dc
      @JJJ-ee5dc Год назад

      @@deeplizard I am in a situation where I need to learn deep q learning in just 3 to 4 days. I have deep learning background but don't know anything in RL. And your videos gave me so much information and intuition.

  • @dmitriys4279
    @dmitriys4279 5 лет назад +4

    What is E in value function formulas ?

    • @deeplizard
      @deeplizard  5 лет назад +3

      Expected value

    • @SLR_96
      @SLR_96 4 года назад

      @@deeplizard So it's the expected value of the expected return right?

  • @lingchen8849
    @lingchen8849 3 года назад

    Thanks for your video. I watched twice but still don't understand why there should be a big "E" for expected value @4:41. I am confused because I found other refereces do not have that.

  • @sametozabaci4633
    @sametozabaci4633 5 лет назад

    Thanks for the explanatory content.
    It's unnecessary but you may want to change it. On the blog page, the word 'following' was written two times under the action-value function topic in the first paragraph in the second line.

    • @deeplizard
      @deeplizard  5 лет назад +2

      Thanks, samet! Appreciate you spotting this. I will get this fix out in the next website update!

  • @carchang4843
    @carchang4843 Год назад

    what does big E mean. 4:04
    since expected return is Gt

  • @Hyuna11112
    @Hyuna11112 2 года назад

    Is there any reinforcement learning problems not solved with Markov decision processes ?

  • @mateusbalotin7247
    @mateusbalotin7247 3 года назад

    Thank you!

  • @matthewchung74
    @matthewchung74 4 года назад +2

    What is the term E at 5:30 in the video? I can't find an explanation.

    • @deeplizard
      @deeplizard  4 года назад +3

      E used in this way means "expected value." In our specific case, the value we're referring to is the return, so we're looking at the expected return.

  • @wtfhej
    @wtfhej 5 лет назад +1

    Thanks!

  • @pepe6666
    @pepe6666 5 лет назад +1

    i am somewhat confused though about rewards. say if we're doing something complicated like playing an atari game. how do you program in rewards? winning and losing a game is kinda all there is, right?

    • @paulgarcia2887
      @paulgarcia2887 5 лет назад

      If you are doing something like Atari breakout then every time the agent hits a block it can get a +1 reward. It doesn't necessarily need to beat the level until it gets a reward.

    • @pepe6666
      @pepe6666 5 лет назад +1

      @@paulgarcia2887 the answer i was looking for was that when the agent gets his +100 reward for winning he stores the state before the winning state & what to do & the potential reward for that state. which creates a new state with new data. and it propagates backwards that way. Cheers though. this was a difficult thing for me to learn.

    • @43_damodarbanaulikar71
      @43_damodarbanaulikar71 5 лет назад +2

      U should checkout siraj raval he has a explanation as well as practical video where he is using StarCraft game

    • @pepe6666
      @pepe6666 5 лет назад

      @@43_damodarbanaulikar71 cool man i will. cheers.

  • @hanserj169
    @hanserj169 5 лет назад +1

    I am a very begginer! could someone please explain me why do we need a state-value function and an action-value function? It seems to me that the last one if enough since it can map the state, the action and the reward for this particular pair. Could I select one out of them?

    • @deeplizard
      @deeplizard  5 лет назад

      The state-value functions doesn't account for a given action, while the action value does. Going forward, we stick mostly with the action-value function.

    • @hanserj169
      @hanserj169 5 лет назад +1

      @@deeplizard thanks! Great content!

  • @asdfasdfuhf
    @asdfasdfuhf 4 года назад +1

    3:55 I found it confusing/weird that *the expected return starting from s* is equal to *the discounted return starting from s*, since the two quantities obviously aren't always equal.
    Is this an error? Or perhaps just a way of saying that $v_\pi (s)$ is equal to either *expected return starting from s* or *discounted return starting from s*?

    • @deeplizard
      @deeplizard  4 года назад +6

      Hey Sebastian - Maybe it would've been more clear if I said "expected discounted return" instead of just "expected return." From episode 3 onward, as long as we don't explicitly state otherwise, "return" means "discounted return." So, the value of state s under policy pi is equivalent to the expected discounted return from starting at state s and following pi. Hope the helps!

  • @tallwaters9708
    @tallwaters9708 5 лет назад

    Thanks for the videos, what do you mean "following policy pi thereafter"? If I am in a state, and take an action with a given policy in that state, and then transition to a new state, am I still following the same policy in that state? Shouldn't the policy change in each state? This really confuses me, sorry if my question is not clear. Are you saying that when calculating the value functions, we always use the same policies in each state?
    Thanks again,

    • @tallwaters9708
      @tallwaters9708 5 лет назад +1

      Wait I think I get it, so the policy never changes in a value calculation. The policy is essentially the Q-table, and sums up how an agent will act in an environment. This stays the same, the values are calculated...

  • @ogedaykhan9909
    @ogedaykhan9909 5 лет назад +1

    awesome

  • @ravishankar2180
    @ravishankar2180 6 лет назад +1

    i thought policy was "collection of all the actions taken together in all the states in a complete lifetime to achieve some overall reward" ! and policies were "too many lifetimes with too many different overall reward". Optimal policy i thought to be "a policy in policies with highest reward".
    i didn't know that policy was only for "a single state with different action choices".

    • @deeplizard
      @deeplizard  6 лет назад +3

      Hey ravi - Yes, a policy is a function that maps each state in the state space to the probabilities of taking each possible action.

    • @ravishankar2180
      @ravishankar2180 6 лет назад +1

      thank you miss deep-lizard

  • @haneulkim4902
    @haneulkim4902 3 года назад

    Amazing tutorial, I appreciate it very much. Would it be possible to lower the lizard appearing sound in the intro? it seems louder than your voice therefore hurts my ear........... :(

    • @deeplizard
      @deeplizard  3 года назад

      Thanks, Haneul! The intro has been modified in later episodes.

  • @louerleseigneur4532
    @louerleseigneur4532 4 года назад

    merci

  • @debayanganguly838
    @debayanganguly838 5 лет назад

    The videos which I watched ate awesome... but most of the videos are not loading and buffering.... I have no idea why becase all other videos in RUclips are working well

    • @deeplizard
      @deeplizard  5 лет назад

      Thanks, Debayan. Aside from an issue with internet connection/speed, I'm not sure what else could cause the issue. The videos are all uploaded in standard 1080p HD quality. You could try lowering the quality to 720p to see if that helps load the videos. Also, you can use the corresponding blogs to the videos as well:
      deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv

    • @debayanganguly838
      @debayanganguly838 5 лет назад

      I know this channel has got very good quality content but my issue is only with videos of your channels all other channels and videos are working and loading well... are there any copy of this video anywhere else other than RUclips

    • @debayanganguly838
      @debayanganguly838 5 лет назад

      This issue also occurred on some of your videos in DeepLearnig Neural Networks.... not all only some videos are having problems.... it may be due to some policies of RUclips no issue with your 1080 quality video

  • @abhishek_raghav
    @abhishek_raghav Год назад

    great only missing examples ,, especially about
    return , policies and value

  • @raminbakhtiyari5429
    @raminbakhtiyari5429 3 года назад

    just traffic.

  • @aminezitoun3725
    @aminezitoun3725 6 лет назад +1

    this is scary dude... xD

    • @deeplizard
      @deeplizard  6 лет назад

      Haha in what way?

    • @aminezitoun3725
      @aminezitoun3725 6 лет назад

      @@deeplizard i like the series and all(still waiting for a new vid xD) but knowing what ai can do and the fact that i just saw a vid about how 4 ai killed 29 humans in japan is just scary tbh xD

  • @Sickkkkiddddd
    @Sickkkkiddddd Год назад

    You gotta use examples, man. You gotta use examples. A bunch of notation without practical examples to tie them to makes folks tune out. That's the only way to intuitively understand the ideas. This is the fourth video and you've barely touched on any 'game'. Aren't games at the 'example' applications for RL? These concepts aren't difficult. They just need to be explained by teachers who are passionate about the student experience and want them to learn. All the dedication on the student's part is absolutely worthless if the teacher or material isn't coming across. I have downloaded and deleted a bunch of textbooks because they were all filled with abstract nonsense that left me frustrated with migraines. Please find a way to break this stuff down with game or real-world examples in a language a dummy can understand. I'd be very pissed if I paid for your course and got a bunch of abstract notations thrown at my face. PLEASE PROVIDE EXAMPLES. PLEASE!!!
    Save the crap notations for later and provide examples. You started with a raccoon agent in the first video. Where did he go to? Why isnt he utilised for subsequent videos? Sigh