Structured generation hurts LLM reasoning performance (Paper Explainer)

Поделиться
HTML-код
  • Опубликовано: 17 ноя 2024

Комментарии • 7

  • @ThePuddu2
    @ThePuddu2 3 месяца назад +2

    I would be curious to see tested a variation of the NL-to-Format in a single generation instead of two subsequent ones. Meaning: reply to this (thinking step by step, and all other instructions). Then, include a JSON version of the response in the following format: { ... }.
    From what I'm seeing, it seems to improve the overall reasoning quality while keeping JSON for parsing in industrial applications. It would be nice to have it formally tested to benchmark it properly

  • @bastabey2652
    @bastabey2652 3 месяца назад +4

    my understanding is that gemini 1.5 pro and gpt-4o/4 were specially trained on constrained structured output (cso).. furthermore, gemini flash doesn't support json mode.. when I tested openai structured output with chatgpt 3.5, it didn't work... i haven't tested claude json support enough to comment.. so the paper results don't apply in the latest state of the art cso like gpt-4o and gemini 1.5 pro.. I agree with the conclusion of the paper in the case of the low end models... it's a result anyone who did a bit of ai application using these models have witnessed
    thank you for the informative video

    • @elvissaravia
      @elvissaravia  3 месяца назад

      Thanks for sharing your experience. Definitely need to be constantly monitoring performance for this specifically. The benchmarks are also not representative of all real world tasks and they mention that in the discussion section.

  • @sfilkin
    @sfilkin 3 месяца назад +3

    You choose papers well.

    • @elvissaravia
      @elvissaravia  3 месяца назад

      I try based on the audience interest. Hard to choose sometimes with so many papers coming out every day.

  • @gr8tbigtreehugger
    @gr8tbigtreehugger 3 месяца назад +1

    Many thanks!

  • @k0b0yash1
    @k0b0yash1 Месяц назад

    of course the json restricted prompt performed worse as it removed chain-of-thought