A much better LLM Leaderboard!!!

Поделиться
HTML-код
  • Опубликовано: 27 ноя 2023
  • 🏆 This leaderboard is based on the following three benchmarks.
    Chatbot Arena - a crowdsourced, randomized battle platform. We use 100K+ user votes to compute Elo ratings.
    MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
    MMLU (5-shot) - a test to measure a model's multitask accuracy on 57 tasks.
    🔗 Links 🔗
    ChatBOT Arena Leaderboard from Lmsys - huggingface.co/spaces/lmsys/c...
    Arena Leaderboard Elo Ranking Method - colab.research.google.com/dri...
    Play at the Arena - chat.lmsys.org/?arena
    Intro Sound from Honest Trailers- • Honest Trailers - Inte...
    ❤️ If you want to support the channel ❤️
    Support here:
    Patreon - / 1littlecoder
    Ko-Fi - ko-fi.com/1littlecoder
    🧭 Follow me on 🧭
    Twitter - / 1littlecoder
    Linkedin - / amrrs
  • НаукаНаука

Комментарии • 40

  • @1littlecoder
    @1littlecoder  8 месяцев назад +3

    Checkout this LLM ranking for RAG - ruclips.net/video/Ce0OKpMhvXw/видео.html

  • @nukesean
    @nukesean 8 месяцев назад +6

    wtf does that dentist Tesla joke even mean? In what world is that better than the one that actually makes sense?

  • @ByteBop911
    @ByteBop911 8 месяцев назад +5

    alphago also used elo...

  • @user-wv8gf3ft1f
    @user-wv8gf3ft1f 8 месяцев назад +5

    Keep up the good work! And also I love it when you make the LLMs make elon musk videos every time 🤣

    • @1littlecoder
      @1littlecoder  8 месяцев назад +2

      Thank you, glad to know you appreciate the humor :)

  • @thamaraikkannanks232
    @thamaraikkannanks232 8 месяцев назад +3

    How claude 1 beats claude 2?

  • @dmitrisochlioukov5003
    @dmitrisochlioukov5003 8 месяцев назад +1

    instead of a human verifier, would it be possible to take X model to decide which answer is the best? thus making a synthetic ELO.
    Also, more scenarios are necessary, especially in variety.

    • @KevinKreger
      @KevinKreger 8 месяцев назад +2

      Ah, RLAIF meets leader board 🙂

    • @dmitrisochlioukov5003
      @dmitrisochlioukov5003 8 месяцев назад +2

      @@KevinKreger lol I found out the paper/project already uses gpt 4 as a verifier 🥲

  • @MarceloLimaXP
    @MarceloLimaXP 8 месяцев назад +2

    Even though it's a 7B, the Zephyr is in a great position =)

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 8 месяцев назад +1

    This is a good development. I have been concerned with benchmark data being contaminated in some of the diverse models training data.
    Also it's a real treat seeing OpenChat punching way above it's weight. That's my baby there, they grow up so fast. Just wish they would release the quantized version.

    • @1littlecoder
      @1littlecoder  8 месяцев назад

      baby means? did you take part in creating that?

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 8 месяцев назад

      @@1littlecoder In a small way. I think some of the conversations it was trained on were mine.

  • @sukhpalsukh3511
    @sukhpalsukh3511 8 месяцев назад +2

    So we get GPT4 free access 😮

  • @KevinKreger
    @KevinKreger 8 месяцев назад +2

    Jokes are an excellent way to determine a lot of human level skills in a single prompt. I'm not sure who you would choose other than Elon Musk that your viewership would know. Yes, we have gotten a lot of bad/tedious Elon Musk jokes because the LLMs are usually bad/tedious at humor. That's the point. Jokes are generic and fast for the side-by-side. Coding duels need a lot more work, and we can't presume what everyone in interested in.

    • @1littlecoder
      @1littlecoder  8 месяцев назад +1

      I stick to the same joke because it gives me perspective, a lot of people find it useless, but something I got used to. Like you said, I always value human humor as something brilliant so I want to see how the LLM side of the thing does it. especially given it's Elon Musk.

  • @brandon1902
    @brandon1902 8 месяцев назад +3

    Openchat 3.5 is a 7b Mistral. It's not 3.5b.
    And I would like to add that I did thorough testing of most of the LLMs mentioned (e.g. science, code, pop culture, logic, math, creativity...) and the rankings make absolutely no sense. GPT4 is VASTLY superior to the others. While Openchat, Zephyr beta and several others are great for their size, but far less knowledgeable and "intelligent" than GPT4, Claude, GPT3.5 and Llama 70b.
    Something is fundamentally wrong with this leader board. It might be because it only attracts a certain kind of person asking a certain kind of question, and then said people don't take them time to verify the accuracy of the responses.

    • @KevinKreger
      @KevinKreger 8 месяцев назад

      Hi Brandon! Give us more info if you can. I think smaller models (less knowledge/training data) that are better at logic/reasoning and use a retriever are the future, BTW

    • @brandon1902
      @brandon1902 8 месяцев назад +1

      @@KevinKreger Sure. For one thing math is an emergent property of LLMs that is directly proportionate to size (all else being equal). This is best shown by GSM8k which is grade school level math word problems. Even large "dumb" LLMs like Falcon 180b do far better than the best 7b Mistral or 13b Llama 2 at math.
      To a lesser degree same goes for coding. The larger the model (all else being equal) the more likely the code will be without errors (e.g. all variables declared) and execute successfully. Small LLMs only become good at code when they're fed massive amounts of code, which makes them perform worse at everything else.
      Knowledge is also directly proportionate to size. For example, progressively larger Llamas (7, 13, 35 & 70) recall a progressively larger percentage of song lyrics. And extremely large models like GPT4 commonly recall every word in 100s of songs correctly.
      Many smaller LLMs like Xwin perform great at these user rated side by side tests. However, that's because they're trained to tell the user what he/she wants to hear based on information gleaned from the prompt with the transformers. However, if you're a specialist/genius, or take the time to research the replies, it becomes very apparent that GPT4s responses are vastly superior, even the ones rated lower by typical human users. In short, the responses look and sound good, but are often subtle hallucinations or empty words (like those of a snake oil salesman).
      There are only 2 major performance features that don't scale very well with LLM size. The most notable one is falsehoods (e.g. TruthfulQA). Even GPT4 can be fooled into thinking the earth is flat or whatever brainless garbage is widely discussed across the internet. However, GPT4 cheats by forcing alignment to avoid common misconceptions, yet the vast majority can still be outputted just as easily in 1800b GPT4 as a 13b Llama 2, hence their comparable TruthfulQA scores.
      Lastly, the second major performance feature that doesn't scale very well is, as you mentioned, logic/reason. It still generally scales. GPT4 can solve riddles and logic problems better than 7b Mistral. Part of the confusion is GPT4 is HEAVILY aligned while many Mistral are not, which means you can't reason with GPT4. It basically refuses to reason because that can come into conflict with alignment (e.g. not sharing how to steal a car or make a racist joke).
      Small unaligned Mistrals like Dolphin and Zephyr Beta can often reason better than GPT4. Not because they're better at it. They're actually worse at it. But because without alignment they can respond to user prompts or make it through several reasoning steps, without getting re-routed by alignment. For example, I've had a large LLM start talking about a physical phenomenon, but when it came to an expanding bubble it started warning about bends and always listening to your scuba instructor, which had nothing to do with what I was asking about. Simply put, it's the heavily alignment of GPT4 and other larger models that make them less "intelligent" than some smaller LLMs, particularly Mistrals. But in reality without the alignment GPT4 would be far more "intelligent" and logical.

    • @KevinKreger
      @KevinKreger 8 месяцев назад

      Great response! I've read about the scaling laws. Adding a retrieval LLM improves the corpus of smaller models trained to reason. It's a type of scaling of available content. And this is interesting, one of the scientists at deepmind trained a small transformer model to do modulo addition with no errors. I think it was a two layer toy model. it first memorized the results then it evolved neurons to solve the operation and ended up with a DFT with sine lookup and the attention heads were doing some of the work.

  • @a3onz3ro31
    @a3onz3ro31 8 месяцев назад +1

    open chat is mistral based model and it's 7b I guess

  • @testales
    @testales 8 месяцев назад

    There are a lot of great models missing though in the ChatBot arena, such as Open Hermes 2.5 which is one of the best 7b models currently out.

    • @1littlecoder
      @1littlecoder  8 месяцев назад

      It was recently added a couple of days back

  • @DrWrapperband
    @DrWrapperband 8 месяцев назад +4

    Just a point - Elloy Musky didn't create Tesla - so the joke isn't funny. Tesla motors was created on July 2003 by Martin Eberhard and Marc Tarpennin.

    • @Sinsholian
      @Sinsholian 8 месяцев назад +1

      Yeh…the musk “jokes” are really tiresome. Use something else.

  • @alx8439
    @alx8439 8 месяцев назад +1

    Also there are bunch of other llm leaderboards worth mentioning, apart from this one. If you decided to cover this topic - do some further researches

    • @alexkool3511
      @alexkool3511 8 месяцев назад

      Which once ??

    • @KevinKreger
      @KevinKreger 8 месяцев назад

      Alx, FYI, I think if you send him the links to the other leaderboards he will be responsive.

  • @alx8439
    @alx8439 8 месяцев назад

    This crowd source arena worth noting without:
    1) revealing the number of human evaluation
    2) without real people participation
    3) without dim out older results
    And I see none of that there

    • @Utoko
      @Utoko 8 месяцев назад

      1. the number of votes are all in the dataset. 2. not sure what you mean by that, everyone can use chatarena and start voting, 3. Why? It is only a issue in elo system when they would change a model and take the old data like now the new GPT4Turbo version. but each one gets a new entry.

  • @bonadio60
    @bonadio60 8 месяцев назад +1

    Gpt-4-turbo better then Gpt-4 is odd, I have used both and gpt-4 is much better.

    • @KevinKreger
      @KevinKreger 8 месяцев назад

      Better because it is a fraction of the price and more people are using it, hence, the bump up over gpt-4

  • @finn_the_dog
    @finn_the_dog 8 месяцев назад +3

    Please, quit the Elon Musk joke prompt. There have been enough Musk spam all over the internet for too long, love him or hate him, it's boring now to hear about him. It has become like hearing about a random celebrity. The joke prompt is quite pointless given it is sooo subjective, I can't think many other prompts that could return so little information about the capabilities of a llm

    • @KevinKreger
      @KevinKreger 8 месяцев назад

      Geez Finn. I totally disagree. Joke making is difficult. It's a great test for a LLM and an emergent capability. But go on and be bored of Elon. I am not an Elon fan either 🙂

  • @marcosbenigno3077
    @marcosbenigno3077 8 месяцев назад

    Thanks 1LC

  • @mathematicalninja2756
    @mathematicalninja2756 8 месяцев назад +1

    uukuguy/mistral-six-in-one-7b 6 bit quantized is intelligent as shit