How to OPTIMIZE your prompts for better Reasoning!

Поделиться
HTML-код
  • Опубликовано: 9 фев 2025

Комментарии • 45

  • @abyssuserigo
    @abyssuserigo Месяц назад +11

    Love your videos man, this space is so noisy but your uploads are pure quality.

  • @choiswimmer
    @choiswimmer Месяц назад +13

    Id love a comparison between dspy and this! We should start evaluating frameworks

    • @samwitteveenai
      @samwitteveenai  Месяц назад +1

      Just posted this above so reposting here
      This one certainly is a lot easier to use. Both TextGrad and DSPy really focus on treating it as if it were a deep learning network etc. For me, this one is probably better for people who don't have that kind of experience. I've been meaning to make an in-depth video about DSPy for a while, and just never gotten around to it, I do like it as a project.

    • @samwitteveenai
      @samwitteveenai  Месяц назад

      I totally agree. It would be good to start doing some kind of evaluation by comparing all of these kinds of things to each other.

  • @thunkin-ai
    @thunkin-ai Месяц назад +3

    I’ve been thinking about this concept for a while; very glad there’s a concept out in the wild

  • @keclv
    @keclv Месяц назад +1

    Cool! I've just tried the example problems from the Colab notebook using my Llama3.2 3b with a simple prompt "Please solve the following problem:" All the results were correct, with nice concise reasoning steps. To justify the amount and cost of optimization I would like to see some counter examples showing what value this approach actually adds over the baseline.

    • @samwitteveenai
      @samwitteveenai  Месяц назад +1

      I'm thinking of sticking Lite LLM support in there, and making it so that you could try it on any model, including local models, et cetera. That would make it much more effective cost-wise to be able to try things out

  • @mohamedkarim-p7j
    @mohamedkarim-p7j 27 дней назад +1

    Thank for sharing ..

  • @jaggyjut
    @jaggyjut 19 дней назад

    Great topic.

  • @sitedev
    @sitedev Месяц назад +2

    I can see how something like this would be useful in a RAG pipeline where as new documents are added a LLM instance could create a base dataset representing the content of the document (or the entire knowledge base) and then use that with PW to create extensive prompts that are subsequently used to evaluate chunk/retrieval performance.

  • @tuliochiodi8642
    @tuliochiodi8642 Месяц назад

    Thank you, Sam Witteveen, for this insightful video on PromptWizard! I really appreciate the effort you put into explaining its features.
    I had a question regarding my personal use case: I need to tune prompts that carry context variables retrieved during real-time execution. Can this framework handle such scenarios out of the box, or would it require additional customization? If so, how straightforward do you think it would be to implement this feature?
    Thanks in advance for your attention and for sharing your expertise!

    • @samwitteveenai
      @samwitteveenai  26 дней назад

      Yes, you could do this. It's all about how you make your evals and set those up.

  • @PavelSTL
    @PavelSTL 29 дней назад +1

    Hi Sam, awesome breakdown as usual. You're the best ! (no, I'm not prompting you) but a somewhat unrelated question: I'm using "higher" LLMs to help me write system prompts for lower LLMs (which in some of my circumstances is the only option I have), but when I ask these higher LLMs whether it's a good practice, they actually advice it's OK and sometimes even better to ask the "lower" LLM to write its own system prompt for the tasks you want, primary reason being it knows its own peculiarities better than any other LLM. Now I'm having hard time believing that. Granted, by 'lower' I don't mean GPT-3 or lower, but rather capable Gemini Pro 1.5 but still.... Any thoughts ?

    • @samwitteveenai
      @samwitteveenai  26 дней назад

      My guess is it's really going to depend on the model and the use case. Really, the simplest answer is try both and see which works best.

  • @ergosumdre
    @ergosumdre Месяц назад

    Nice video!

  • @d_b_
    @d_b_ Месяц назад +2

    Thank you for the insightful video. Could you elaborate on whether token usage is a concern for users when employing this framework? Also, would you say it’s only effective for specific tasks or use cases, rather than being broadly applicable? How feasible would it be for someone to develop a similar prompt optimization tool independently?

    • @samwitteveenai
      @samwitteveenai  Месяц назад +2

      The cost wasn't super high it was $13.70 . You can certainly use it for your own tasks, it would work better for things where there is a clear correct answer. you could develop something like this yourself but this is totally open so you can leverage of this

    • @johang1293
      @johang1293 Месяц назад +1

      Just making a simple tool to optimize your prompt to a lvl 3 or lvl 4 prompt will go and long way. Once you start to use lvl 3 and lvl 4 prompts you will really see what the llm is capable of.

  • @ScottLahteine
    @ScottLahteine Месяц назад +2

    This tool looks like it could be extremely helpful for my needs. I have to extract documentation from a couple of source code files, organize it all into sections, and output a YAML representation. It’s hard to know where to begin to get an LLM accomplish this task. It needs to be decomposed into stages, and the LLM needs to have a peek at the documentation that it’s generating so it can improve phrasing, do language translation, etc., and within a small context window so the LLM doesn’t choke. I already have a parser written the old fashioned way that can extract individual items from the sources into a generic JSON or YAML format, so I just need that final step of leveraging the LLM. I can write a TOOL to grab any requested part of the JSON and test it within an environment like Open WebUI, but my brain is fuzzy on how to make a reliable system that can be reused. Any tool that helps refine the prompts, whether for the whole process or to build agents, is most welcome! I’m sure this is a common enough problem that there are already some services popping up. But this seems like a good learning opportunity.

  • @amandamate9117
    @amandamate9117 Месяц назад +1

    obvious question: is this better or why is this better than just finetuning a model?

    • @HassanAllaham
      @HassanAllaham Месяц назад

      I believe that this framework costs less computation power than finetuning. Also, it can be used with the already-existing GGUF format that works well on consumer-grade hardware. It is like fine-tuning the system prompt to be suitable for a specific LLM and a specific domain of tasks (represented by a dataset) instead of LLM finetuning, which needs a high computation power (Powerful GPU).
      Also, LLM finetuning makes the LLM optimized only for a specific domain of tasks, while using this framework makes it possible to use the same general-purpose LLM for multi-domains of tasks by only generating the optimized system prompts.
      I believe finetuned LLM will provide better performance for specific task types especially where the right answers are not exact ones (I mean where there are many right answers for the same question)

  • @sepsi77
    @sepsi77 Месяц назад +4

    How much did this cost in the end? Seems like this could use a lot of token?

    • @amandamate9117
      @amandamate9117 Месяц назад +3

      a fortune

    • @thenoblerot
      @thenoblerot Месяц назад +3

      Enterprise and research don't seem to care about token counts, especially not if they're investing in RL datasets or will be saving tokens in deployment. 🔥💰 I clenched when Sam said he ran it for 20 minutes lol

    • @samwitteveenai
      @samwitteveenai  Месяц назад +19

      Just checked, it wasn't that bad it was $13.70. I feel this is pretty reasonable if it is an important prompt etc. The dataset was not small as well. I was thinking of showing this with DeepSeekV3 or Gemini if there is interest which could make it much cheaper etc.

    • @sepsi77
      @sepsi77 Месяц назад +1

      @ that’s very reasonable, thanks for the info.

    • @kaverianuranjana9787
      @kaverianuranjana9787 26 дней назад

      @@samwitteveenai A comparison of cost effective models with such methods would definitely be useful

  • @RickySupriyadi
    @RickySupriyadi Месяц назад +3

    most of the time model under 8B quite confused using too long prompt am i the only one got that problem?

    • @kaverianuranjana9787
      @kaverianuranjana9787 27 дней назад +1

      We've found that 8B models are quite uninformative and yield unstable results to minor changes as compared to 70B. This was for llama models. Even switching to 11B t5 model was better (so you don't need to jump to really big models)

  • @micbab-vg2mu
    @micbab-vg2mu Месяц назад

    thanks - very usuful :)

  • @puremajik
    @puremajik Месяц назад +3

    Compare promptwizard to textgrad and dspy

    • @samwitteveenai
      @samwitteveenai  Месяц назад

      This one certainly is a lot easier to use. Both TextGrad and DSPy really focus on treating it as if it were a deep learning network etc. For me, this one is probably better for people who don't have that kind of experience. I've been meaning to make an in-depth video about DSPy for a while, and just never gotten around to it, I do like it as a project.

    • @samwitteveenai
      @samwitteveenai  Месяц назад

      Just curious - do you have any particular use cases for DSPy or TextGrad?

  • @samuelcombey
    @samuelcombey Месяц назад

    Great job!✌ Colab link not working

    • @samwitteveenai
      @samwitteveenai  Месяц назад +1

      I just checked it it should be ok. drp.li/58ni6

  • @cariyaputta
    @cariyaputta Месяц назад

    The iterative optimization part is essentially a genetics algorithm.

    • @samwitteveenai
      @samwitteveenai  Месяц назад +1

      Yes, in that way it's quite similar to prompt breeder from DeepMind

  • @deadbeafc001
    @deadbeafc001 Месяц назад +2

    seems like DSPy

    • @samwitteveenai
      @samwitteveenai  Месяц назад +1

      indeed it is like DSPy but has some of the cool ideas from DeepMind's prompt breeder and is easier to use I would say

  • @ctSQD
    @ctSQD 17 дней назад

    If you guys read the paper for yourselves, you'll notice it doesn't even show any examples of the human prompt and how it actually improved it. With this being the main purpose, you'd think you would have this.

  • @hqcart1
    @hqcart1 Месяц назад +2

    The problem is that this method does not work if you do not know the answer to the problem, or the answer might not be precise like this math. None of the real-worl apps I've seen can use this.

    • @samwitteveenai
      @samwitteveenai  Месяц назад

      Yes, this is a really good point - it works best for things where there is a clear and correct answer. Using it for things like creative writing etc is probably not going to be as good. Ahem That said, I still think it can be used in certain kinds of real-world apps.