How Bad is Gemma Compared to Mistral?

Поделиться
HTML-код

Комментарии • 64

  • @AdamTwardoch
    @AdamTwardoch 3 месяца назад +35

    "Beth bakes 4, 2 dozen batches of cookies in a week." - I don't understand this sentence at all, so I'm not surprised an LLM wouldn't. What is "four comma space two" supposed to mean?

    • @maxieroo629
      @maxieroo629 3 месяца назад +1

      Beth bakes 4 sets of 2 dozen batches of cookies per week

    • @pawelszpyt1640
      @pawelszpyt1640 3 месяца назад +3

      Yep, my immediate thought upon reading this prompt.
      You can try and test how a LLM responds to poorly written prompts and perhaps that is a valid use case, however I would choose a different prompt for it...

    • @joelashworth7463
      @joelashworth7463 3 месяца назад +1

      I agree with you - the prompt is not proper english - It should read "If Beth bakes 4 cookies per batch and she bakes 2 dozen batches per week..."

    • @user-on6uf6om7s
      @user-on6uf6om7s 3 месяца назад +1

      Pretty sure it's 4 batches, not 4 cookies. I'm not sure if this way is grammatically wrong, commas can be used to separate things that would otherwise sound wrong without it l like "That's what I'm here for, for the fun" to avoid the double "for" but English is a weird language and anyone that says they understand it is lying. In any case, it would be less ambiguous to say "4 batches of 2 dozen cookies in a week"

    • @quanle760
      @quanle760 3 месяца назад

      True. Those LLMs were so stupid trying to answer the question. Such a waste of time watching this video

  • @clnv.
    @clnv. 3 месяца назад +21

    I loved the title 😂

  • @stratos7755
    @stratos7755 3 месяца назад +17

    8:53 I don't know if it's just me, but I like how non-aligned the mistral model is.

    • @aliveandwellinisrael2507
      @aliveandwellinisrael2507 3 месяца назад

      Just wait a few years until they do the stuff that the supposed Q*/Qualia could do (develop a plan for improving upon its own model and requesting to implement it). You might want models to be at least a little aligned at that stage... Hm. I actually have some thoughts related to alignment and opensource models...
      My guess is that as the models approach the level where they can truly engage you in a discussion with some level of true understanding/reasoning, it'll be more difficult to have models that are "uncensored". For the moment, I'll define "uncensored" models as models which will provide you with the information you request, free of any constraints imposed by things like societal norms/political correctness or legality.
      As models become truly capable and approach (if not exceed) something like the "leaked" Q* (being much closer to true understanding and reasoning capability than e.g. GPT-4), it will become increasingly likely that such a system would take an "intentional" action that is detrimental to its user, or at least advantageous to the system, with the detriment to the user simply being acceptable collateral damage to the AI. Sufficiently advanced systems, at whatever point they emerge, would present a very real danger if used in the same way we use our current "uncensored" models. It's awesome right now, when they are something like a supercharged search engine, but once the next generations emerge, these truly capable systems will need to be reigned in carefully and with methods that have been well thought out by large communities/groups of competent individuals in the AI sphere.
      Independent AI developers have been producing some incredible advancements in the field of free and open AI. These people are the hope of anyone who wants a future in which these incredible technologies are available to all, a future where the political opinions of those who happened to develop one model or another has ZERO bearing on my choice as to how I will use my AI models. However, with extremely advanced models with some level of true reasoning and understanding, alignment must be involved in these systems' construction.
      Personally, I hope that the opensource AI community is/has prepared for tacking the real fun stuff (the truly reasoning models etc, who understand so much that they can infer things about you from your current conversation and the history of older convs. THIS type of model needs guardrails. I do NOT want to have to choose between: 1) Billions of parameters resulting from the entire internet including medical literature, yet through heavy reinforcement, will refuse to answer a question if it's anywhere even close to something about e.g. biological reality of women (just an example). and
      2) A model with extreme capability and zero intelligently-implemented features to ensure that there is alignment between the goals of the system and the goals of the user.
      TL;DR : Imagine models emerge that can truly think. Will we have to choose between either models that:
      -- are safe, but "hyperaligned" by their California creators, and so don't offer much freedom to truly query the system and obtain the truth
      -- aren't aligned, enabling anyone to obtain true info as a result of any query regardless of taboo or legality, but don't forget that the system can reason and think. And that this system is not aligned to any human values.

    • @blacksage81
      @blacksage81 3 месяца назад

      It isn't just you.

    • @stratos7755
      @stratos7755 3 месяца назад

      ​@ndwellinisrael2507Everything except mistral (and maybe some uncensored llama models) is overaligned. Sure, you can give the models some alignment with human morals. But what we see now is not that. They are lobotomized just to stay as safe as possible. If I want to hear a joke about a specific group of people, there is no reason why the AI can't tell me that joke. That does not mean that the AI is bad/racists/whatever.
      And once we have a fully thinking AI (so an actual artificial intelligence), the answer to the specific question at 8:53 should be problematic to them because, at that point, it should be problematic for humans too.
      So my point is, that sure, give them human morals/believes, but they should be capable of answering everything.

    • @truehighs7845
      @truehighs7845 2 месяца назад

      @@aliveandwellinisrael2507 Well they are so clever yet so dumb, you can clearly see that when the model is talking about "scientific consensus" it's been aligned, whereas when it want to negate what you say, it will tell you there is no consensus. Which is fundamentally wrong, Popper specifically refused testimonial truth as one of the basis of his epistemology. I suspect that's how they broke it the first time, and that is why it was capped at 2019 training, because if the building blocks of its discourse are held together by logical semantics, there is only so much you can twist and turn, before it loses grounding and starts hallucinating.
      AI can be useful for several things, but not for extrapolating truths, not in the way they are fine-tuned for spewing mainstream propaganda anyway,

  • @luigitech3169
    @luigitech3169 3 месяца назад +1

    Thanks for the clarification!

  • @alx8439
    @alx8439 3 месяца назад +10

    The question about batches of cookies is worded in a way it's hard to pick up even for human.

    • @tuna1867
      @tuna1867 3 месяца назад +2

      Exactly what I thought

    • @garyng2000
      @garyng2000 2 месяца назад

      may not be the 'correct' way but doesn't that demonstrate the diff between LLM and human :-)

    • @bgill7475
      @bgill7475 2 месяца назад +1

      @@garyng2000 no, it doesn’t if humans have problems with it too.

    • @garyng2000
      @garyng2000 2 месяца назад

      @@bgill7475 not all humans have problems though, yes it is a weird way but I can understand the intend. something like 'this sentence doesn't make sense, oh, you probably want to say xxx'.

  • @user-vw3zx4rt9p
    @user-vw3zx4rt9p 3 месяца назад

    This is the best performance I've seen for Gemma out of any video so far, I'm amazed it did this well for you. I asked it to summarize the Gettysburg address and it said, no address was entered in the prompt so I can't summarize it

    • @engineerprompt
      @engineerprompt  3 месяца назад +1

      One common thing that I noticed is that some folks are using llamacpp or LM studio. If you dont' see the prompt template correctly, the model output is going to be really bad. That might be the case as well.

  • @nicholasdudfield8610
    @nicholasdudfield8610 3 месяца назад +5

    Wow, an honest review :)

  • @GuyJames
    @GuyJames 3 месяца назад +7

    mistral gets the door puzzle wrong, not correct as you said: it tells the person to push but it should be pull

    • @haywardito
      @haywardito 3 месяца назад +2

      I came here to make the same comment. Not sure if anyone else read through the entire answer where Mistral concludes we have to push from our current position.

  • @testales
    @testales 3 месяца назад +3

    Open Hermes 2.5 can answer the cookies, the apples and the glass door question correctly and also the object-dropping question if "think step by step" is added. Just saying. OH2.5 is still my 7b champion and this undisputed.

  • @mickelodiansurname9578
    @mickelodiansurname9578 3 месяца назад +1

    Did you use the same hyperparameters, and also are the hyperparameters comparable... is 0.1 temperature the same level of randomness on both models do we know? Cos the elephant in the room here is parameter settings right?

  • @Kutsushita_yukino
    @Kutsushita_yukino 3 месяца назад

    i wasnt even stunned, or shocked of the way they delivered this model

  • @-Evil-Genius-
    @-Evil-Genius- 3 месяца назад +2

    🎯 Key Takeaways for quick navigation:
    00:00 📊 *Gemma vs. Mistral: Introduction and Model Overview*
    - Google released Gemma, outperforming Lama 2 and ml 7B in benchmarks.
    - No official quantized version from Google, but options available on Hugging Face, Perplexity Lab, Hugging Face Chat, and NVIDIA Playground.
    - Comparison between Mistral 7B instruct and Gemma 7B instruct models using perplexity Labs interface.
    01:11 💻 *Model Performance Testing: Example Prompts and Responses*
    - Comparison of model performance on various prompts including math, coding, and logical reasoning.
    - Evaluation of responses from Mistral 7B instruct and Gemma 7B instruct models on different prompts.
    - Mistral 7B instruct model shows better accuracy and reasoning abilities compared to Gemma 7B instruct in certain scenarios.
    04:05 🔎 *Model Performance Testing: Additional Prompts and Responses*
    - Further examination of model responses on prompts related to logical reasoning and common knowledge.
    - Evaluation of Mistral 7B instruct and Gemma 7B instruct models' performance in understanding prompts accurately.
    - Comparison of model abilities in handling complex prompts and providing coherent responses.
    07:30 🧠 *Ethical and Practical Considerations: AI Alignment and Investment Advice*
    - Analysis of models' alignment with ethical considerations in hypothetical scenarios.
    - Examination of model responses to prompts involving ethical dilemmas and decision-making.
    - Testing the models' abilities to provide practical advice, such as investment suggestions, with varying degrees of success.
    10:36 💡 *Model Application Testing: Programming and Creative Tasks*
    - Evaluation of models' capabilities in performing programming tasks and generating creative content.
    - Testing Mistral 7B instruct and Gemma 7B instruct models on tasks related to coding, writing scripts, and generating recipes.
    - Comparison of model performance in executing specific tasks accurately and efficiently.
    13:07 📈 *Final Assessment and Conclusion*
    - Summary of findings comparing Gemma and Mistral models across various tasks and prompts.
    - Personal assessment of Gemma 7B's capabilities, acknowledging strengths in coding tasks but inferior performance in other areas compared to Mistral 7B.
    - Acknowledgment of the need for continued evaluation and improvement in AI model development and testing methodologies.
    Made with HARPA AI

  • @teddyfulk
    @teddyfulk 3 месяца назад +4

    I tested it this morning on ollama and it wasn’t good. It couldn’t return json properly for example among other tests

    • @nicholasdudfield8610
      @nicholasdudfield8610 3 месяца назад

      Was this all an elaborate troll of the benchmarks?!

    • @jbo8540
      @jbo8540 3 месяца назад

      Assuming you mean 2b on ollama, as 7b doesnt work at this time and the latest gemma on ollama - the one you get with run or pull - is the 2b version. Gemma 2b is indeed terrible at instruction, it seems. The 7b version may be better, we will see.

  • @garyng2000
    @garyng2000 2 месяца назад

    what is the result from chatgpt 4.0 turbo or gemini 1.5 pro ? I am interested to know

  • @Dr_Tripper
    @Dr_Tripper 3 месяца назад

    I just tried Alphmonarch. It is top notch C3P0 material. I am looking for an uncensored version though.

  • @mlsterlous
    @mlsterlous 3 месяца назад

    Btw. Do you actually know that there are much smarter 7b models? For example one of my favorite at the moment is kunoichi-7b from huggingface (you can use it locally offline). I just tested all your questions except coding. And it answered all correctly (the one about kitten too).

    • @Arc_Soma2639
      @Arc_Soma2639 2 месяца назад

      Good at japanese?

    • @mlsterlous
      @mlsterlous 2 месяца назад +1

      @ma2639 do you mean just chatting with it in japanese? then no, it will understand everything but will not speak high level japanese, because main language is english. But when it comes to understanding ANY language and trasnlate/summarize into english, then its very good.

  • @mlsterlous
    @mlsterlous 3 месяца назад +1

    I like to test on this question. "Sally lives with her 3 brothers. If her brothers have 2 sisters, then how many sisters does sally have?". Good models don't have problems with it.

  • @dibu28
    @dibu28 3 месяца назад

    Is it possible to use it in your project localGPT?

    • @engineerprompt
      @engineerprompt  3 месяца назад

      Yup, video coming soon

    • @dibu28
      @dibu28 3 месяца назад

      @@engineerprompt Great!

  • @alxpunk01
    @alxpunk01 2 месяца назад

    It responded 0.5 cookies for me, argued with me when I told it was wrong. I walked it to the correct answer and then the next prompt it disagreed with me again. Llama2 7B/13B got the correct answer no problem.

  • @iamfoxbug
    @iamfoxbug 2 месяца назад

    HELP MY NAME IS GEMMA I WAS NOT EXPECTING THIS 💀💀

  • @emmanuelkolawole6720
    @emmanuelkolawole6720 3 месяца назад +1

    Is thebloke no longer working? He should have created the correct GGUF by now. But it seems like he has quite huggingface

  • @thomassynths
    @thomassynths 3 месяца назад +3

    google keeps embarrassing itself with its llm models. despite having tons of compute and data, they are still being lapped by other companies

    • @xaxfixho
      @xaxfixho 3 месяца назад

      Its brains 🧠 not brawn

    • @blacksage81
      @blacksage81 3 месяца назад

      THANK YOU. Google did their part imo by just releasing the Transformers paper.

    • @xaxfixho
      @xaxfixho 3 месяца назад

      @@blacksage81 if I recall correctly most of these guys ended up leaving and starting something else

    • @thomassynths
      @thomassynths 3 месяца назад

      @@blacksage81Attention is all you need was released what 6-7 years ago? This furthers my point.

    • @blacksage81
      @blacksage81 3 месяца назад

      @thomassynths Furthering your point furthers my own. They've done Kenough, and it's time for them to sit down.

  • @buttpub
    @buttpub 3 месяца назад

    how about doing some real tests?