ChatGPT-4o is now the best LLM in Chatbot Arena! (Tested)

Поделиться
HTML-код
  • Опубликовано: 17 ноя 2024

Комментарии • 30

  • @orbedus2542
    @orbedus2542 3 месяца назад +7

    accurately counting letters is architecturally impossible. the best the model can do is guess/estimate. a correct guess does not make a model better or worse. tokenization means the model cannot "see" any specific letters. its like asking a blind person how many fingers you are holding up. But people seem to not know how current LLM architecture functions so they will be easily fooled by fitted responses.

    • @elvissaravia
      @elvissaravia  3 месяца назад +1

      Fair. I get the architectural limitations and how tokenisation also influences results. How do you think LLM should address these types of tasks and other issues or do you think it’s simply not feasible with current architectures?

    • @MAGAMINDMITCH
      @MAGAMINDMITCH 3 месяца назад

      This is an extremely simple problem actually the model just has to give a output first in a scratch pad or working memory then review its first response before giving a response to you kind of like how humans say uhh then talk

    • @RyluRocky
      @RyluRocky 3 месяца назад +2

      Impossible no, much more difficult yes, a blind person can’t see but they can feel the amount of fingers. If you have two or more different parsed tokenizations of the same word you can figure it out, practical? Not super, unless you find a better implementation but not impossible.

    • @VinMan-ql1yu
      @VinMan-ql1yu 3 месяца назад

      Guys, individual letters ARE tokens. You give it a word, it should have learnt somewhere which individual letters it is made of...

    • @elvissaravia
      @elvissaravia  3 месяца назад

      @@VinMan-ql1yu It could learn it for sure but there is a different problem I try to highlight in the video which is that tokenization depends on context and the understanding of those tokens may be different depending on that tokenization, context, and the model's internal representations.

  • @cbgaming08
    @cbgaming08 2 месяца назад +2

    Does Claude 3.5 Sonnet still holds the crown?

    • @elvissaravia
      @elvissaravia  2 месяца назад +1

      For my uses cases, Yes! Mostly doing stuff with code generation and reasoning along with some vision capabilities.

  • @ritvikrastogi4912
    @ritvikrastogi4912 3 месяца назад +7

    isn't the strawberry problem, related to tokenization...? How could this be solved...?

    • @elvissaravia
      @elvissaravia  3 месяца назад

      I believe it is. I mention it later in the video.

    • @ritvikrastogi4912
      @ritvikrastogi4912 3 месяца назад

      @@elvissaravia what could be the possible fix?? preference optimization?

    • @petargolubovic5300
      @petargolubovic5300 3 месяца назад

      @@ritvikrastogi4912 potential solution is the model being aware of it's architecture and using tools or just simply spelling words out every time it gets asked a question like that. if you ask it to spell out the word, it get's it right every time.

    • @MAGAMINDMITCH
      @MAGAMINDMITCH 3 месяца назад

      It's very simple they start with a scratch pad that is not something that you get from the API like a working memory and then it is capable of reviewing that and getting the correct answer ​@@ritvikrastogi4912

    • @elvissaravia
      @elvissaravia  3 месяца назад

      @@ritvikrastogi4912 hard to tell without actually running a robust set of experiments. I think eventually it will be fixed, either through bruteforce preference optimization or maybe architectural novelties. I think overall this is an interesting area of research in addition to understanding other quantitative related tasks.

  • @Cine95
    @Cine95 3 месяца назад +2

    weird tomorrow they are going to announce gpt 4o-large

    • @elvissaravia
      @elvissaravia  3 месяца назад

      Is that confirmed or rumoured?

    • @Cine95
      @Cine95 3 месяца назад

      @@elvissaravia it is confirmed by that strawberry account he said that its going to happen on thirsday

    • @MehulPatelLXC
      @MehulPatelLXC 3 месяца назад

      @@Cine95Where can I find the account you’re referring to?

    • @elawchess
      @elawchess 3 месяца назад

      That's still a rumour, that's not "confirmation".

    • @Aziz0938
      @Aziz0938 3 месяца назад

      Dont trust him​@@Cine95

  • @bastabey2652
    @bastabey2652 3 месяца назад

    how many r are there in the sentence "how many r are there in the word strawberry"?
    the answer is being hacked across all LLM vendors
    gpt-4o answered correctly when prompt is phrased:
    how many r are there in the sentence "how many r are there in the word strawberry"?
    perform step by step reasoning leading to the final answer

  • @HarringtonBartholomew-u9d
    @HarringtonBartholomew-u9d 2 месяца назад

    Allen Edward Johnson Frank Clark Eric

  • @gerkim62
    @gerkim62 3 месяца назад

    I assume you are not a coding engineer because what kind of coder wants to see comments in code😅

    • @elvissaravia
      @elvissaravia  3 месяца назад

      haha i am and i do think commenting is important in large codebases. depends on what kind of code you are referring to and for what it is used.

    • @gerkim62
      @gerkim62 3 месяца назад +1

      @@elvissaravia i get it. but i think overly commented code is not good. When chatgpt came out initially it used to add too many comments to the code.

    • @elvissaravia
      @elvissaravia  3 месяца назад

      @@gerkim62 agree overcommenting is a problem

  • @bastabey2652
    @bastabey2652 3 месяца назад +4

    only latest gpt-4o and sonnet-3.5 answers correctly the complex 5 candles riddle with the following system prompt:
    You are a logic and reasoning expert. Reason step by step leading to the final answer.