Improving LLM accuracy with Monte Carlo Tree Search

Поделиться
HTML-код
  • Опубликовано: 28 июн 2024
  • ➡️ ADVANCED-inference Repo (incl. notebooks in this vid.): trelis.com/enterprise-server-...
    VIDEO RESOURCES:
    - Slides: docs.google.com/presentation/...
    - Paper: arxiv.org/abs/2406.07394 (Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B)
    - Code from Paper: github.com/trotsky1997/MathBl...
    OTHER TRELIS LINKS:
    ➡️ Trelis Newsletter: blog.Trelis.com
    ➡️ Trelis Resources and Support: Trelis.com/About
    TIMESTAMPS:
    0:00 Large Language Models Make Things Up!
    0:42 Boosting Llama 3 8B performance to GPT-4 (only on certain benchmarks!)
    3:13 How prompting affects accuracy
    4:58 How Monte Carlo tree search works
    7:49 Balancing exploitation with exploration
    10:18 Jupyter Notebook Code
    26:59 Testing Monte Carlo Tree Search on a simple example
    29:16 Boosting Performance on Maths problems
    31:48 Limitations on Monte Carlo Performance Boosts
    32:58 Resources
  • НаукаНаука

Комментарии • 50

  • @KopikoArepo
    @KopikoArepo 8 дней назад +1

    Beautiful. Just like us. the more we fail, the better. Explore vs. Exploit. I love humanity. ❤

  • @miladmirmoghtadaei5038
    @miladmirmoghtadaei5038 2 дня назад

    Thanks man. Great intro to MCTS. What I am curious about is why we do a random selection among the first generation and not have it rate that one and select the best answer from the root.

    • @TrelisResearch
      @TrelisResearch  День назад

      That may also work.
      You do want your initial seeds to be very random just so you have the tree searching a wide search space. If you just start with one rated answers, all following derivatives will be a bit similar, which limits scope.

  • @andrew_moffat
    @andrew_moffat 6 дней назад

    yeahhhh perfect explanation, thank you bro

  • @MasamuneX
    @MasamuneX 6 дней назад +1

    A potental improvement is to have a dynamic child node amount based on the rating also the weight defining exploration vs exploitation could be dynamically set too maybe even by the llm filling in more than just a score
    also the backprop of the ratings is cool but there could be some decay so that nodes wayyy up on the tree dont get super locked in if you're doing a tree that is 8 layers deep

    • @TrelisResearch
      @TrelisResearch  5 дней назад +1

      Good thoughts.
      Yeah probably using the rating/assessment is smart because right now that info isn't incorporated into generating responses (it's only used for UCT).
      UCT basically decays away from exploration as more experiments are run and focuses more on exploitation. This broadly makes sense, but yes, possibly this should be tuned.

  • @nashvillebrandon
    @nashvillebrandon 8 дней назад +3

    You're the exact person I was hoping would make a video in this after I read that paper. Could this technique be enhanced even further with retrieval?

  • @_paixi
    @_paixi 4 дня назад

    Fascinating paper and excellent demonstration. Llama3-8B can answer some difficult math and coding problems using this that the top open-source models fail to do with a direct answer. The first thing I noticed was it games the rating response by pretending to run unit tests that pass. Adding to the critique prompt it was a written test and the answerer had no access to a computer to run tests fixed that and it has started solving some easy ARC-AGI tasks I couldn't get proprietary models to solve.

  • @tonyppe
    @tonyppe 5 дней назад +1

    subscribed, very interesting. good work on explaining it :)

  • @free_thinker4958
    @free_thinker4958 4 дня назад

    Thanks a lot man 😎👏❤️ we would like you to devote a future video to talk about the CLIN paper to build a self improving language agents

  • @waneyvin
    @waneyvin 8 дней назад +2

    great job, it's literally manual reinforcement learning!🤣🤣🤣

  • @kunalsuri8316
    @kunalsuri8316 8 дней назад +4

    Holly molly!! I was just reading today about how MCTS can be used to improve LLMs. Are you reading minds now?

    • @marilynlucas5128
      @marilynlucas5128 6 дней назад

      😂 I’ll tell you. The RUclips algorithm is very spooky. It can almost read your mind. I call it God’s mind.

  • @user-cc2lp9tz7r
    @user-cc2lp9tz7r 5 дней назад +2

    Isn't this the Q-Star algorithm we've been dreaming of?

  • @unclecode
    @unclecode 4 дня назад

    Fascinating! This morning, I posted on X about MCTS and this paper, and later, RUclips showed me your video. Such a great coincidence. I found the coefficient C in the UCT formula for balancing exploration and exploitation really interesting. I experimented with different settings and even made it random like temperature. The results are intriguing-might share the repo and a video soon.
    I wonder what would happen if we built a neural network like MoE but with this MCTS structure and trained it. Would it train while searching and reasoning? Could it generate a model far better at reasoning? What do you think? Anyway, kudos to you-you're right on track and well updated as usual.

    • @TrelisResearch
      @TrelisResearch  3 дня назад

      Yeah that’s interesting regarding training. I’d have to think more deeply.

  • @KarlLew
    @KarlLew 5 дней назад

    Me with Tarot Cards till I get the answer I like. But seriously, MTS seems like a formal way to structure an extended interaction with a user. MTS feels a lot like what I do when I use Google AI Search as I barrage it with a cloud of different prompts when searching for a particular piece of knowledge for which I may not know the conventional terminology. in other words, the internediate answers provide information for prompt refinement. For example, I once started with “nitrogen in soil” and ended up with “soil nitrification”, which was the prompt that gave the knowledge I sought. Thanks for the vid!

  • @aissabakhil1696
    @aissabakhil1696 2 дня назад

    thank you , can you make an explanation about gguf quantization and how to convert custom multimodal to gguf

    • @TrelisResearch
      @TrelisResearch  2 дня назад +1

      ooh, multi-modal I haven't looked at, but check out the quantization video from last year in the fine-tuning playlist from Trelis

    • @aissabakhil1696
      @aissabakhil1696 День назад

      Thank you

  • @Saurabh5228
    @Saurabh5228 8 дней назад +2

    How does it differ from the tree of thoughts prompting?

    • @TrelisResearch
      @TrelisResearch  8 дней назад +7

      They are related ideas! Tree of thought typically involves a more deterministic approach to where to go next in the tree, whereas Monte Carlo is based on probabilistic evaluation (with back-propagation in this case of rewards to all parent nodes). It's a bit confusing making clear distinctions between both here because - if temperature is 1 - and you are trying multiple samples, there is a probabilistic element to tree of thought and to the approach taken here.
      More on Tree of Thought here: www.promptingguide.ai/techniques/tot
      If I had to boil it down, two key elements here are:
      1. Backpropagating rewards from each evaluation through the parent nodes (which has the benefit of adding probabilistic information on the strength of the parent nodes)
      2. Using UCT, which is a specific approach for balancing exploration versus exploitation.

  • @r.s.e.9846
    @r.s.e.9846 2 дня назад

    Thanks! How could we improve this with a compiler, search or some form of symbolic reasoning?

    • @TrelisResearch
      @TrelisResearch  День назад

      Yup, will see if I can get a video on that live soon

  • @ravenecho2410
    @ravenecho2410 5 дней назад

    U mean the early chess algos?

  • @scscyou
    @scscyou 6 дней назад

    How do we integrate this as part of our AI client, for example when running local server with a web-based UI? Are there any complete, packaged solutions?

    • @TrelisResearch
      @TrelisResearch  5 дней назад +2

      Unless using groq (and even using groq) this is going to be slow for low latency applications. You would need to wrap all of this as its own server and then hit that endpoint, definitely a bit more work

    • @scscyou
      @scscyou 5 дней назад

      @@TrelisResearch Of course, I'm assuming use cases where we want accuracy. For example, to batch multiple questions and then get back in an hour to see the best responses to all of them. Preferably with an Agents workflow, where LLM can talk to itself to iterate over a solution (like a code); and with the ability to invoke external tools (compilation, browser, calculator...). Ollama server (containerized for security & compatibility) could be the best starting point, but what we need is a user-friendly way to use everything at once, like a module. Using Jupyter notebooks to run a custom python code fragment is an impossibly steep curve for many of us who need to use such AI for practical purposes (e.g. I don't really work in Python, but I could definitely invoke a local API algorithmically)

  • @pooascyrous5722
    @pooascyrous5722 5 дней назад

    any idea how we can use this idea in finance and trading decisions?

    • @TrelisResearch
      @TrelisResearch  5 дней назад

      Well yeah you can use it as a way to prompt an LLM to make a prediction and refine that prediction. I'll see if I can make a quick vid on that at some point.

  • @nathandfox
    @nathandfox 4 дня назад

    Why didn't the author try using MCTS + GPT4 to see if it can improve even at that level?

    • @TrelisResearch
      @TrelisResearch  День назад

      That's a good suggestion and I don't know why. Perhaps they just wanted to get the paper published and llama 3 showed what they needed. Also, there's the question of what to compare MCTS + GPT4 to? The benchmarks tested are saturated (true, they could have tried something like ARC).

  • @ghrasko
    @ghrasko 3 дня назад

    Hi, at 2m13s you say that the essence of Monte Carlo method is that it's programatically changes the prompts (instead of doing it manually). As I see, it is not at all doing it. It is NOT refining the prompt. Am I misunderstanding something?

    • @TrelisResearch
      @TrelisResearch  3 дня назад

      I just depends how you define prompt. I mean prompt as the full input to the LLM. This gets updated because part of the full prompt is the draft answer. You’re correct that the instruction set is pre-defined.

  • @avwie132
    @avwie132 6 дней назад

    In other words: keep juggling until you get proper results

    • @TrelisResearch
      @TrelisResearch  5 дней назад +1

      Yeah, kind of, for better or worse. Although Monte Carlo is (assuming a good evaluator) much better than random juggling.

  • @pensiveintrovert4318
    @pensiveintrovert4318 8 дней назад +6

    You may get a better answer. You can't possibly know if the answer is the best answer. Don't lie to yourself. Even identifying the better answer is not easy unless others are obviously wrong.

    • @TrelisResearch
      @TrelisResearch  8 дней назад +6

      Yeah, absolutely agreed, and tried to make that clear in the vid. But accept if it’s not clear enough!

    • @KCM25NJL
      @KCM25NJL 6 дней назад +1

      While this is technically true, it does in fact make for at least some semblance of hierarchical self-reflection on a longer timeframe than doing simple X-shot CoT prompting..... which is a step closer to system 2 thinking than we have. Of course while this implementation is rudimentary and has it's noted cons, I'm almost certain it's a step in the right direction. I'd personally like to see this used as a method of generating synthetic data in a way that gives at least a statistical improvement(via fine-tuning) in prompt or prompt-chain answers from smaller LLM's.

    • @TrelisResearch
      @TrelisResearch  5 дней назад

      ​@@KCM25NJL I suppose techniques like SPIN show that this is possible, and probably MCTS is a pareto improvement over that.
      All of that said, my gut feel is that - for pre-training:
      - language models are perhaps more useful for filtering data (see fineweb edu) than for generating synthetic data, unless you are using a powerful model to train a smaller model.

    • @KCM25NJL
      @KCM25NJL 5 дней назад

      @@TrelisResearch Yes indeed the frontier to smaller model for top down refinement was my thinking. If Llama 3 8B with MCTS can achieve similar math scores as GPT 4o, the biggest highlight for me is not one of capability, but of efficiency.

    • @goodtothinkwith
      @goodtothinkwith 4 дня назад

      That’s true for people too. We can’t be disappointed with not getting absolute certainty

  • @ShanyGolan
    @ShanyGolan 2 дня назад

    The idea has been done before. So its not new

    • @TrelisResearch
      @TrelisResearch  2 дня назад

      True that Monte Carlo is not new. I'm unsure whether the paper (applying it this way to LLMs) is new or not (I didn't feel I gave a strong opinion on that in the vid). If you have a link of a previous paper doing what this paper does, could you post it here? Cheers

  • @jamesbrown6591
    @jamesbrown6591 4 дня назад

    Not here for advice, just a sucker for slides with whimsical shaky font