ChatGPT and Reinforcement Learning

Поделиться
HTML-код
  • Опубликовано: 19 окт 2024
  • ChatGPT + Reinforcement Learning. We're also going to talk about the method ChatGPT learns to be so factual + non-toxic: Proximal Policy Optimization
    ABOUT ME
    ⭕ Subscribe: www.youtube.co...
    📚 Medium Blog: / dataemporium
    💻 Github: github.com/ajh...
    👔 LinkedIn: / ajay-halthor-477974bb
    ChatGPT Playlist of all other videos: • ChatGPT
    Transformer Neural Networks: • Transformer Neural Net...
    RESOURCES
    [1] Human feedback used in training ChatGPT: arxiv.org/pdf/...
    [2] Proximal Policy Optimization (PPO): arxiv.org/pdf/...
    [3] Reinforcement Learning explained : karpathy.github...
    [4] Example to show how the policy backward step is only called when an episode (a sequence of actions that is complete response in ChatGPT) is complete: gist.github.co...
    [5] Dictionary of Terms in Reinforcement Learning: towardsdatasci...
    MATH COURSES (7 day free trial)
    📕 Mathematics for Machine Learning: imp.i384100.ne...
    📕 Calculus: imp.i384100.ne...
    📕 Statistics for Data Science: imp.i384100.ne...
    📕 Bayesian Statistics: imp.i384100.ne...
    📕 Linear Algebra: imp.i384100.ne...
    📕 Probability: imp.i384100.ne...
    OTHER RELATED COURSES (7 day free trial)
    📕 ⭐ Deep Learning Specialization: imp.i384100.ne...
    📕 Python for Everybody: imp.i384100.ne...
    📕 MLOps Course: imp.i384100.ne...
    📕 Natural Language Processing (NLP): imp.i384100.ne...
    📕 Machine Learning in Production: imp.i384100.ne...
    📕 Data Science Specialization: imp.i384100.ne...
    📕 Tensorflow: imp.i384100.ne...

Комментарии • 43

  • @profdrkatharinazweig
    @profdrkatharinazweig Год назад +3

    Thanks - that is a great video series of yours. Splendid to get an insight into these intricate mechanisms.

    • @CodeEmporium
      @CodeEmporium  Год назад +2

      Thanks so much for the donation and so glad you liked these videos and their deep dives! I’ll be making more videos starting with just more GPT next week :)

  • @SanskarSoni-zd3td
    @SanskarSoni-zd3td Год назад

    By far the best and the most easy explained video regarding ChatGPT !!

  • @marton8769
    @marton8769 Год назад

    amazing video!Good structure, nice pace and good effort to make it easier to understand.

  • @MadiJunkie
    @MadiJunkie Год назад +3

    Still a little complex for me to understand, especially with all of the loss functions but still a lot better than other videos I found about this topic. Thanks a lot!

    • @CodeEmporium
      @CodeEmporium  Год назад

      Sorry about that. This video was intended to be a deep dive into the objective function since I already have a simple overview of ChatGPT in my video “ChatGPT - Explained!”

  • @wadergu
    @wadergu Год назад

    Without watching this video, I would not rather spend a little time to understand the algorithm implementaiton of the PPO model. Thank you!

  • @madushaninimeshika3910
    @madushaninimeshika3910 3 месяца назад

    great explanation😍

  • @grownupgaming
    @grownupgaming Год назад +1

    11:54 how do you get the "new" parameters (and hence the reward ratio) when you havent calculated how much to adjust the old parameters by (the full equation) as it depends on these new parameters?

  • @grownupgaming
    @grownupgaming Год назад

    5:51 Does the rewards model itself need to take in one word at a time like GPT? Since it is "just like GPT"... feels like it. But since we already know the entire output, seems like we should actually design a rewards model that can take it in all at once?

  • @Thrashmetalman
    @Thrashmetalman Год назад

    So is this PPO model finetuned online or done offline for the fine tuning? that part isnt clear to me.

  • @orsimhon133
    @orsimhon133 Год назад

    Thanks for your explanation!
    I want to note that at your explanation of r_t(theta) is not entirely correct. It's not the ratio between rewards with new parameters to rewards with old parameters. Instead, r_t(theta) represents the ratio between the probability of taking action a (token) from state s (output string until now) with the new parameters compared to the old parameters.

    • @aytackas4977
      @aytackas4977 9 месяцев назад

      Yes, I was going to write this. For those who try to understand the objective of PPO, r(Theta) is the probability ratio of a specific action compared to current and old policy. Speaking of reward function, it's included by advantage function. A(t) = Q(t) [rewards] - V(t) [baseline estimate]

  • @魏泽坤
    @魏泽坤 Год назад

    How to define the advantage function A_t in the loss function

  • @chuanliangjiang7390
    @chuanliangjiang7390 Год назад

    How to define the advantage function A_t in the loss function. In reinforcement learning class, A_t=Q(a,s)-V(s) : Action value function minus state value function which indicate how good the action a compared with benchmark. But I am not quite sure how to define A_t in this specific scenario

  • @creativeuser9086
    @creativeuser9086 Год назад

    quite weird why we would do gradient ascent instead of abs(1/value) and doing good old gradient descent.

  • @josephpareti9156
    @josephpareti9156 Год назад

    I miss the Kullback-Leibler divergence in your loss function: does the clip(epsilon) do the same job?

    • @CodeEmporium
      @CodeEmporium  Год назад +1

      Yep exactly. This is apparently a newer technique that the PPO process employs (more info in the description where I reference the policy gradients paper)

    • @josephpareti9156
      @josephpareti9156 Год назад

      @@CodeEmporium yes but 'Training language models to follow instructions
      with human feedback' speaks about KL

  • @younginnovatorscenterofint8986

    Excellent work. Just to be clear, what kind of interview questions can you expect for NLP api chatgpt? i dont know what to prepare on ,thanx

    • @CodeEmporium
      @CodeEmporium  Год назад +1

      I haven't really been through interviews that quizzed me heavily on this (I am a machine learning engineer, not technically a researcher). That said, I would pay attention to core concepts for building language models in general (language model objective, word representations, how would attention/reinforcement learning help).

    • @younginnovatorscenterofint8986
      @younginnovatorscenterofint8986 Год назад

      @@CodeEmporium thank you

  • @josephpareti9156
    @josephpareti9156 Год назад

    I read in a press report that chatGPT prompted with the same question a couple of months apart, initially gave the wrong answer, but on the second attempt it answered correctly. Unless the model has been retrained in the meantime, I do not understand how this is possible. The press report said it is due to the many users who query the system, yet none of STEP 1,2,3 is capable of doing that ?

    • @gorgolyt
      @gorgolyt Год назад

      The press report is wrong. They have improved the model a few times, it could have been that. But also, the output has some inherent randomness, as described in this video.

  • @remusmocanu3390
    @remusmocanu3390 Год назад

    Great video! Thank you for all the useful info! I learned quite a bit!
    What app do you use to write down your notes?

    • @CodeEmporium
      @CodeEmporium  Год назад

      Thank you! For just personal notes, I use notion

  • @minhajulhoque2113
    @minhajulhoque2113 Год назад

    Great video!

  • @rombohnallavan1861
    @rombohnallavan1861 Год назад

    Nice explanation

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Год назад

    Great stuff!

  • @paull923
    @paull923 Год назад

    Great, thx!

  • @NeoShameMan
    @NeoShameMan Год назад

    There is a huge blind spot though, they didn't shared how they manage the working memory model. I call it working memory because no paper actually talk about it in details. But generally these model have limited token input based on the input size of the model, it's too small to follow a long chat log and remember things, so how does chat GPT does?
    Well the first time an architecture of working memory was used was AI dungeon, what it did was simply put round of user input and ai answer in a list, then drop the older round. This lead to rambly ai who forget old details and can have huge mood swing in the middle of the conversation as semantic drift happen and data is lost. The network itself is basically a fancy function, the state is the input, if you change the input the state change.
    The way Chat GPT handle long chat is simply by having an attention network go through the chat log based on the latest user input, rank data that are relevant, summarize them with a LLM, then concatenate the summary to the user input as a context state, this is a form of lossy compression. This why chat GPT has a global knowledge, but sometimes fails at local knowledge. The input takes a working memory data structure that has a initialization prompts + the summary of the chatlog relevant to the user input + the user input. Working memory innovation is key for this tech to progress.
    The over issues chat GPT have is overfitting vs overgeneralization. Overfitting is generally seen as bad in ai, but in this case, we need selective overfitting, such that some data are memorized rather than generalized, for example, it often hallucinate function that don't exist in certain programming language, that's over generalizing, it understand that programming is composable but some part are exception that need to be memorized, like game boy asm don't have div instructions, shader.create(String shader) don't exist in unity3d, similarly some books or character don't exist but it makes them up by blending existing data, this is where overfitting on knowledge should be applied.

    • @ryanhewitt9902
      @ryanhewitt9902 Год назад

      Very insightful comment! I have been trying to figure out their working memory approach for some time based on behaviour that seemed impossible given the input limit, as you mentioned. I assumed it had something to do with a periodic parsing and a concatenated summary would explain why it seems to paraphrase details from earlier in the conversation, which leads to drift as the details are "reinterpreted". It's as if it is playing a game of telephone with itself.
      Do you have a source for this attention mechanism you described, or are you guessing at the implementation?

    • @NeoShameMan
      @NeoShameMan Год назад

      @@ryanhewitt9902 It's a guess, because i have been thinking about it since ai dungeon ... but if you ask chat gpt to help, by exposing the working memory problem, it give you similar answer 😅

    • @CodeEmporium
      @CodeEmporium  Год назад

      Very interesting. I hadn't thought about this from the "Chatbot perspective" so there must be some retention of previous responses (which I didn't mention in the video). I think your guess on a "workable memory" sounds plausible. Thanks for commenting :)

    • @NeoShameMan
      @NeoShameMan Год назад

      ​@@CodeEmporium "Working" memory it's a computer programming and ml lingo, "workable" sound weird lol, apparently in language model it's called the "context", I come from video games and ai there use the old lingo lol Thanks for your video btw

    • @luvsuneja
      @luvsuneja Год назад

      @@CodeEmporium I had a meeting with AWS recently. They mentioned it uses some kind of text summarization model.

  • @jonathanlatouche7013
    @jonathanlatouche7013 Год назад

    Breakfast habit pf