Everything WRONG with LLM Benchmarks (ft. MMLU)!!!

Поделиться
HTML-код
  • Опубликовано: 9 фев 2024
  • 🔗 Links 🔗
    When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
    arxiv.org/pdf/2402.01781.pdf
    ❤️ If you want to support the channel ❤️
    Support here:
    Patreon - / 1littlecoder
    Ko-Fi - ko-fi.com/1littlecoder
    🧭 Follow me on 🧭
    Twitter - / 1littlecoder
    Linkedin - / amrrs
  • НаукаНаука

Комментарии • 32

  • @Kutsushita_yukino
    @Kutsushita_yukino 5 месяцев назад +6

    its like the educational system. getting a good grade at school doesn’t mean your smart, you just had the motivation and worked hard enough.

    • @1littlecoder
      @1littlecoder  5 месяцев назад +3

      haha that's a nice analogy!

    • @sankyuubigan
      @sankyuubigan 5 месяцев назад

      just the student listened and did as he was told. he is obedient. he did well to do what he was told and did it the way he was told.

  • @MichealScott24
    @MichealScott24 5 месяцев назад

    ❤lovely thanks for the comprehensive indetailed clear explaination!

  • @henkhbit5748
    @henkhbit5748 5 месяцев назад +1

    Thanks for showing that LLM benchmark can be manipulated. Informative paper.👍

  • @ilianos
    @ilianos 5 месяцев назад +2

    Very informative, as usual!

  • @_paixi
    @_paixi 5 месяцев назад +1

    When a measure becomes a target, it ceases to be a good measure. I use my own benchmarks to automatically evaluate models on my use cases because people are just trying to get high ranks on leaderboards now. I'm definitely going to update them to include these perturbations.

    • @sankyuubigan
      @sankyuubigan 5 месяцев назад

      how do you do your own testing? what tools do you use? where do you get datasets? can you make a video on this topic?

  • @8eck
    @8eck 5 месяцев назад +2

    Exactly, I was also thinkin about it. Basically what most top open source models are doing, is just tuning weights to output better score in benchmarks. Basically they are overfitting to benchmarks scores.

    • @blisphul8084
      @blisphul8084 5 месяцев назад

      The only real way to know is try the models yourself. I'd say Mixtral and Quen are excellent models, and I haven't even looked at the benchmarks. Every model is different, and has strengths and weaknesses, which isn't going to be measured by a single number in a benchmark. The way I see it is ChatGPT for knowledge, Open Source for creativity. Quen 72b does seem smarter than mixtral, but it's a little more expensive to run, but still cheaper than ChatGPT turbo.

  • @alx8439
    @alx8439 5 месяцев назад +1

    Thanks mate for the video. It perfectly proves the point there's still a lot of incompetence and ignorance among the people who have made these benchmarks. Good news is that they're bright minds who discovered all that. I hope community will come up with some newer and better way to benchmark LLMs that will be shuffling answers and have some unusual symbols etc to give us more trustworthy scores

    • @sankyuubigan
      @sankyuubigan 5 месяцев назад

      we need to create an algorithm that shuffles the dataset.

  • @VaibhavPatil-rx7pc
    @VaibhavPatil-rx7pc 5 месяцев назад +1

    excellent information bro!

  • @machinepola6246
    @machinepola6246 5 месяцев назад

    I think the phi 2 is targeted only on mcq scenario. Being largely trained on text book corpus may lead this failure as we always see a,b,c,d options in the text book. I want to see more scenarios than just mcq for phi-2 leaderboard change.

  • @unclecode
    @unclecode 5 месяцев назад

    This paper might suggest that LLM's "emergent" intelligent property is basically just a smart parrot that's super sensitive to how it's trained. It seems overfitting, to the level, they became sensitive to even symbols. Anyway, this paper is a great resource for learning how to structure input data for these models to get the best performance, and what to avoid. Okay, I think I found a good use for benchmark :D While we can't change these models, we can change ourselves!

  • @sankyuubigan
    @sankyuubigan 5 месяцев назад +1

    very important topic about "truthfulness" of information in tests ! it seems that "truthfulqa" test is also responsible for this ? make more videos about truthfulness of tests and their answers. great work

  • @MichaelBarry-gz9xl
    @MichaelBarry-gz9xl 5 месяцев назад +2

    This is easily mitigated by randomizing both the symbols and the order, such randomization should be the default, otherwise it learns unwanted patterns. I would go one further and ask it each question twice (new context so it forgets the last) both randomized, it needs to get BOTH correct in order to get a point. Problem solved.

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 5 месяцев назад +1

      That won't solve the problem of evaluation data being leaked to the training dataset. Regardless of it being intentional or inadvertent the same contamination can occur. The way to solve for this is to keep the evaluation dataset secret and not let anyone have access to it. This would necessitate the evaluators running the model locally which closed models are not likely to accedpt.

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 5 месяцев назад

      ​@@zyxwvutsrqponmlkh True, I think it's important for everyone to have their own personal evaluation datasets, to evaluate the things that you care about. But the biggest problem is "chasing the leaderboards" I've looked through some of the datasets and the quality is extremely poor (and I mean really really really poor) , and yet everyone relies on them, probably without even reading through them.

  • @Hanush8619
    @Hanush8619 5 месяцев назад

    Sir, pls share your view on which is the most accurate model..... So that we can find accurate informations

  • @kotcraftchannelukraine6118
    @kotcraftchannelukraine6118 5 месяцев назад +3

    This is a super wrong way to creating a perfect AI model. There is a lot of training data, but the architecture itself is terrible. Transformer architecture is very good for some small text processing but nothing more. Really advanced architecture must have amygdala for emotional processign, short-term working memory and long-term memory, normal multimodal system and much much more brain functions. The most important step is developing an architecture that could solve both cognitive and non-cognitive tasks in real time, not just a next-token prediction system.

  • @antb533
    @antb533 5 месяцев назад

    Its the same issue with code benchmarks such as Eval (EvalPlus) etc.

  • @zgintasz2
    @zgintasz2 5 месяцев назад

    More important would be the variance of the MMLU scores, not so much the rankings. If at one run the model scores 0.690, and at the other run 0.691, it's not such a big deal, even though rankings could change if competition is fierce at that score range.

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 5 месяцев назад +1

      The variance can only be caused by getting the answers incorrect. If the answers are the same each time, the scores will be identical. So we need to fix the ordering issues by randomizing it. Once fixed, there should be no variance. Either the model gets the question right 100% of the time regardless of the order, or the model hasn't learned the answer. The variance is caused by the model learning to game the benchmark, I.E the answer is usually A so ill just always choose A.

  • @hamidg
    @hamidg 5 месяцев назад

    Each specialty has 100-200 questions?? And the llm is only trained on 4 to 5, I mean I thought it would be much larger then this

  • @jmirodg7094
    @jmirodg7094 5 месяцев назад

    an easy improvement could be to randomize the position of the answer in the bench to turn the biais into noise

    • @rupakvignesh
      @rupakvignesh 4 месяца назад +1

      Why can't that be done during training as well?

    • @jmirodg7094
      @jmirodg7094 4 месяца назад +1

      yes that has to be done during training@@rupakvignesh

  • @frankjohannessen6383
    @frankjohannessen6383 5 месяцев назад +1

    Could we use agent systems to evaluate in a more free-form instead? The LLM you want to evaluate is given the question (without multiple choices) and is tasked with producing an answer. Another LLM is given the correct answer (or list of valid answers) and is tasked with evaluating if the answer the first LLM produced is the same answer or not. It would be very important to use a good evaluator model, since it needs to minimize false-positives and false-negatives. For now, GPT-4 would probably be ideal.