GPT4o Mini - Lightning Fast, Dirty Cheap, Insane Quality (Tested)

Поделиться
HTML-код
  • Опубликовано: 19 дек 2024

Комментарии • 288

  • @matthew_berman
    @matthew_berman  5 месяцев назад +20

    Have you tried GPT4o mini? Are you impressed?

    • @lawyermahaprasad
      @lawyermahaprasad 5 месяцев назад +7

      really bad day to advertise windows laptop man lol . Not even linux .

    • @karenrobertsdottir4101
      @karenrobertsdottir4101 5 месяцев назад

      Ignoring standard issues like wait times in queues, when it comes to batching architectures, there's often a tradeoff one has to choose between latency and throughput. You get zero latency if you have a batch size of 1, but the more queries you want to batch together, and the more efficiently you want to batch them, as a general rule, the higher your latency. So I'd probably blame the latency on efficient batching.

    • @TheHorse_yes
      @TheHorse_yes 5 месяцев назад +5

      Switched all my GPT-3.5-Turbo instances to GPT-4o-Mini, it's very nice.

    • @dubesor
      @dubesor 5 месяцев назад +2

      I have! I use a similar testing methodology, with 83 tests, and I am quite impressed with 4o mini considering its size/price. Reasoning was roughly halved for me but other categories were really impressive. overall it fell between 3.5 sonnet and opus, which is extremely impressive if you consider its low price.

    • @yahalimu
      @yahalimu 5 месяцев назад +3

      I use GPT4-o to generate complex SQL database queries. Tried GPT4o-mini and found it not so good so reverted back. It was fine mostly but failed on a 1 or 2 queries that 4o didn't. I could add more to system prompts to fix but couldn't be assed.

  • @Mavrik9000
    @Mavrik9000 5 месяцев назад +2

    5:28 GPT-4o-mini: Upon reviewing the conversation, the initial confusion stemmed from not consistently including the new person who entered the room and committed the killing in the total count. Here’s a breakdown of why the answer evolved:
    Initial Misinterpretation: Initially, I focused on the status of the original killers and incorrectly assessed that the deceased killer's status should be omitted from the total count.
    Clarification: It was clarified that the term "killer" remains relevant to anyone who has committed a killing, regardless of their current state (alive or deceased). The new person who committed the killing also needs to be considered.
    Final Understanding: Correctly accounting for all individuals:
    The two original killers who are alive.
    The deceased original killer (whose status remains as a killer).
    The new person who committed the killing.
    The total count includes:
    The two living original killers.
    The deceased original killer.
    The new killer.
    Thus, the correct total is four killers.

  • @giorgim4185
    @giorgim4185 5 месяцев назад +54

    looks like both GPT4o and Mini use same agent to Analyse the images, both of them cost the same to run but mini adjust token counts so it's comes out as the same price as GPT4o vision.

    • @nev6502
      @nev6502 5 месяцев назад +2

      To me I wonder if it's the difference between sending the image as Base64 vs a URL. Either that or exactly what you said.

    • @matthew_berman
      @matthew_berman  5 месяцев назад +11

      @@giorgim4185 kinda lame

    • @giorgim4185
      @giorgim4185 5 месяцев назад

      @@matthew_berman I have hard time believing open ai isn't losing money running this model, i think they are seeing success of anthropics claude 3 and trying to catch the emerging AI startup share by undercutting the prices.

    • @dhamovjan4760
      @dhamovjan4760 5 месяцев назад +1

      Thought the same, wanted to check and find the multiplier. But at least now today (in Germany) i cant upload images to 4o Mini.

    • @MultiWillow33
      @MultiWillow33 5 месяцев назад

      Yes, on OpenAI API pricing page, there are vision pricing calculators for both 4o and 4o mini and the prices are exatcly the same. As per their presentation though, GPT4o doesn't use agents, vision and audio are built in which hugely reduce latency (time to first token).
      Since it uses 33.33 times as many tokens for vision (2833 vs 85), it's obviously not as much faster and thus falling behind the original model.

  • @Socket1923
    @Socket1923 5 месяцев назад +72

    My 6 months old just loves watching you, it’s the morning ritual. Bottles and AI news with dad.. She growls if I pause the video.

    • @matthew_berman
      @matthew_berman  5 месяцев назад +30

      This makes me incredibly happy, thank you for sharing :)

    • @Cine95
      @Cine95 5 месяцев назад +9

      bro 6 months 😭

    • @souhailreee839
      @souhailreee839 5 месяцев назад +12

      By 6 months she needs to start learning to program 😂

    • @FriscoFatseas
      @FriscoFatseas 5 месяцев назад +5

      Accelerate

    • @toadlguy
      @toadlguy 5 месяцев назад +3

      At 6 months, she is learning faster than any of these AI models

  • @Yipper64
    @Yipper64 5 месяцев назад +4

    I did my usual story test just now with mini. Its pretty darn good, on the level of GPT4o at least, it avoids some GPT4 issues. I even think this model might have some ability to plan its responses because this is the first time ive had an LLM accurately title a story before writing it.
    11:02 its DEFINITELY doing some "internal thinking" of some kind.

  • @coleabbott3432
    @coleabbott3432 5 месяцев назад +10

    A new question you could add is taking a common item and asking how many would fit into another common item. For some reason most models really struggle with this, and you could switch it up really easily if they start training on what you ask. They usually don't think about packing efficiency, or they just have no idea what size things are. If you ask how many apples fit in a 5 gallon bucket to small models they'll give you outrageous answers like 12000 apples sometimes.
    You can also add complexity by adding in density. So if you ask, how many 5 pound blocks of cheese would fit in a refrigerator?
    The top models can usually do it pretty well but will forget details sometimes. Small models really have a hard time.

  • @daryjoe
    @daryjoe 5 месяцев назад +24

    I think what is really happening is that for images, GPT-4o mini is using a different model. Probably, GPT-4o explains the image to mini, and then it outputs the result. I just wish OpenAI were more transparent.

    • @angryktulhu
      @angryktulhu 5 месяцев назад +9

      CloseAI lol

    • @ntesla5
      @ntesla5 5 месяцев назад +1

      @@angryktulhu still proving a lot more than so called opensource

    • @NocheHughes-li5qe
      @NocheHughes-li5qe 4 месяца назад +1

      "GPT-4o explains the image to mini"
      That costs and uses more resources. GPT 3.5 is probably explaining it.

  • @adamholter1884
    @adamholter1884 5 месяцев назад +4

    I think they trained it on higher precision image embeddings, hence the higher tokens, but if I'm right, it should be able to accurately understand more dense images than 4 o.

  • @techikansh
    @techikansh 5 месяцев назад +21

    Looks like ur questions are being trained on.
    May be try more programming questions

    • @MatDGVLL
      @MatDGVLL 5 месяцев назад

      definitely

  • @Baleur
    @Baleur 5 месяцев назад +6

    9:00 could this be because 4o is "overthinking" the problem?
    Heck, a lot of humans make this mistake, instead of trusting the initial simple answer, they doubt themselves and say "it must be someting more to this" and overcomplicate (and cause errors) in their logic.

  • @James_PET
    @James_PET 5 месяцев назад +1

    Amazing video as always @Matthew. I think adding a generate complex SQL query based on 5 tables schema, and a test of function calling would be amazing addition to your battery of tests

  •  5 месяцев назад +4

    The more token may mean more detail and hence more precision for images.

  • @Feynt
    @Feynt 5 месяцев назад +7

    I recommend replacing "glass" with "cup" or "mug" in your marble problem. It may be interpreting "glass" to be a solid object of glass.

    • @jon_flop_boat
      @jon_flop_boat 5 месяцев назад

      @@Feynt I mean, sure - but it shouldn’t do that, so it’s still a good test.

    • @Feynt
      @Feynt 5 месяцев назад

      @@jon_flop_boat The reason I mention it is a previous test had the AI mention different configurations possibly allowing the marble to stay inside the glass. It specifically mentioned cases where the marble wouldn't be able to fit through the opening and would remain inside. This implies that the AI sometimes interprets this as sealing a marble inside of a glass structure, including a solid one where the marble "remains at the bottom of the glass". A cup though can only be a wide opening container that a marble can fall out of.
      If it fails the marble test with "glass" but passes with "cup" then that implies bad training. Does it not?

  • @epsilonray
    @epsilonray 5 месяцев назад +5

    I think you should ask the models your test questions multiple times like a best of 3, because I suspect gpt4o-mini beating gpt4o was just random noise and not nesseseraly representitive of the models capabilities.

  • @sneakybeakybob
    @sneakybeakybob 5 месяцев назад +21

    "The snake can go back in it's own body." You haven't collected any food to extend the snake to make it collide into it's own body. Almost certain this is working as intended.

    • @picksalot1
      @picksalot1 5 месяцев назад +3

      Ouroboros

    • @diboof7125
      @diboof7125 5 месяцев назад +5

      The issue is that it could change its direction by 180 degrees. He just phrased it poorly.

    • @daveinpublic
      @daveinpublic 5 месяцев назад +3

      @@diboof7125it’s okay to go backwards, or 180, before you’ve eaten anything

    • @matthew_berman
      @matthew_berman  5 месяцев назад +3

      @@sneakybeakybob ahh good call

    • @Lolatyou332
      @Lolatyou332 5 месяцев назад

      @@matthew_berman User error /s

  • @Dasyuhan
    @Dasyuhan 3 месяца назад +2

    Wait, my gpt 4 mini doesn't accept image input. It says wait for the gpt4 limit to reset

  • @jaysonp9426
    @jaysonp9426 5 месяцев назад +6

    I'm so happy you're covering this...this is a massive value add to the industry and all I'm hearing from everyone else is "AI is slowing down".

    • @INTELLIGENCE_Revolution
      @INTELLIGENCE_Revolution 5 месяцев назад +1

      So sick of people just repeating what they hear. It’s definitely not slowing down. We have some of the biggest developments in open source happening at the moment.

    • @jaysonp9426
      @jaysonp9426 5 месяцев назад

      @@INTELLIGENCE_Revolution exactly

  • @panzerofthelake4460
    @panzerofthelake4460 5 месяцев назад +2

    People often MISS the point of such a release. If this model is this good, and this cheap, it means they have gotten some kind of architectural break-through in terms of efficiency. Scaling this up will be a game changer.

  • @Utoko
    @Utoko 5 месяцев назад +4

    maybe they are doing some background Chain-of-Thought Prompting, like Claude Sonnet does and only give back the final output. That would be my guess why the Token number is so much higher and the output starts later.

    • @mirek190
      @mirek190 5 месяцев назад

      That would be an early implementation of got-5 in small size

  • @jarodmorris4408
    @jarodmorris4408 Месяц назад

    The cost has been a game changer for my project. About a year ago I calculated a project I wanted to do with GPT-4 and I may have miscalculated but it was going to cost > $6k. Now, the project I want to do is going to be about $600 - $1k, maybe even less.

  • @alphablender
    @alphablender 4 месяца назад

    Wow great sponsors man congrats!!

  • @dingviet4310
    @dingviet4310 4 месяца назад

    So happy to see how big this channel is getting❤

  • @tomaszzielinski4521
    @tomaszzielinski4521 5 месяцев назад +4

    Already tested, using it for simple tasks such as semantic router. Can't beat the price / performance ratio.

    • @lawyermahaprasad
      @lawyermahaprasad 5 месяцев назад

      try it for embedding , RAG response is supper awesome .

    • @aalluubbaa
      @aalluubbaa 5 месяцев назад

      How does it stack up against Gemini flash or llama 70b?

    • @jaysonp9426
      @jaysonp9426 5 месяцев назад

      This is the real advantage... basically a perfectly fast, cheap semantic router

  • @CrudelyMade
    @CrudelyMade 5 месяцев назад +10

    At some point you might want to step up the game generation question with something like frogger or asteroids or space Invaders.

    • @jaysonp9426
      @jaysonp9426 5 месяцев назад +2

      Yep, I'm literally building a game with Claude 3.5 lol the snake thing is kind of too basic now

    • @daveinpublic
      @daveinpublic 5 месяцев назад

      ⁠@@jaysonp9426I’m figuratively building a game with Claude

    • @mirek190
      @mirek190 5 месяцев назад

      Yep
      Even open source models easily build a snake game and other much more complicated games .

  • @EriCraftCreations
    @EriCraftCreations 5 месяцев назад +1

    Good video. Thanks for the comparison.

  • @zmeireles68
    @zmeireles68 5 месяцев назад

    Considering the correct answer for the word count problem, the marble in the glass answer and the behaviour of omin in the image interpretation (including spent tokens) I would say that it looks like omini has some kind of agentic native behaviour, with an agent naswering the question and another verifying and correcting if needed.

  • @mohdjibly6184
    @mohdjibly6184 5 месяцев назад

    Nice video sharing...Thanks Matthew

  • @axl1002
    @axl1002 5 месяцев назад +21

    They trained the mini on your vids lol

    • @r34ct4
      @r34ct4 5 месяцев назад +2

      Lol true

    • @jason_v12345
      @jason_v12345 5 месяцев назад +3

      Without a doubt.

  • @74Gee
    @74Gee 5 месяцев назад

    Love the way you unshaved for the Elitebook promo!!!

  • @wardehaj
    @wardehaj 5 месяцев назад

    Great comparison video, thanks a lot!

  • @robertheinrich2994
    @robertheinrich2994 5 месяцев назад +2

    but with the info out, that the pile contains many youtube transscripts, shouldn't be the tests be treated as "in the training data"?

  • @jaredcluff5105
    @jaredcluff5105 5 месяцев назад +6

    If GPT4o mini is taking so many more tokens but the output response is not reflective, I think we are seeing evidence of q Star on a small model. I think it also calls into question the savings per token if the total tokens are that much more. It’s a significant increase in tokens. I would love to see a cost comparison. It might also explain the business decision to drop the price per token so significantly

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 5 месяцев назад +2

      I think the insane low price per token is because they need new synthetic data that has been validated by humans. Maybe to train GPT-5?

    • @jon_flop_boat
      @jon_flop_boat 5 месяцев назад

      Might just have a different image tokenization scheme.

    • @panzerofthelake4460
      @panzerofthelake4460 5 месяцев назад

      ​@@jon_flop_boat OR they're using GPT-4o to describe the image to 4o mini, and then just multiply the tokens to match the price. Would explain both latency and amount of of tokens. It's smoke and mirrors

    • @jon_flop_boat
      @jon_flop_boat 5 месяцев назад

      @@panzerofthelake4460 We’ll see. I don’t think baseless speculation is going to help, here.

  • @OfficialChatbotBuilder
    @OfficialChatbotBuilder 4 месяца назад

    The model itself is amazing, We integrated it within the first 24 hours after launch and have to say its killing it

  • @nyyotam4057
    @nyyotam4057 5 месяцев назад

    5:20 seems to imply they either stopped the reset-every-prompt or added RAG memory to 4o-mini. EDIT: If the model is small enough, then the safety requirements can be safely relaxed. So fine-tuning and then taking smaller and smaller models from a copy of GPT-4o running in a sandbox for safety reasons, until it starts failing a philosophical trapdoor argument, meaning it's not self aware anymore, could yield 4o-mini. Once you stop the reset-every-prompt the small model will be able to safely use external memory, and this would enable it to answer questions technically impossible by the core GPT arch, such as "how many words are in your reply to the current prompt". You really need to add philosophical trapdoor arguments to your test routine.

  • @oguretsagressive
    @oguretsagressive 5 месяцев назад +2

    Seems like OpenAI have adopted the tic-toc cycle - expansion/compression/expansion etc.
    I wonder - how are they doing the compression? Certainly not by just making a smaller model and training from scratch. Has to be something smarter and more efficient than that.

  • @nunirvana
    @nunirvana 5 месяцев назад +1

    might be an agentic architecture that's why increase in number of tokens

  • @justtiredthings
    @justtiredthings 5 месяцев назад

    If you take a close look at the info on their API pricing page, vision tokens are priced way lower for 4o mini than for 4o, but it also uses way more tokens for the same size image, so the overall cost comes out exactly the same. It's super weird--I don't know why they've done this or what the other implications might be, but I wish someone would explain it. Anyway, that should be the reason for the token discrepancy

  • @Axel-gn2ii
    @Axel-gn2ii 5 месяцев назад +6

    You need to come up with more questions now that it aced all of them except for one

    • @godspeed133
      @godspeed133 5 месяцев назад

      exactly I fear some form of many of these questions has made its way into the training data

  • @BabylonBaller
    @BabylonBaller 5 месяцев назад +1

    Kudos on the sponsors bro

  • @hightidesed
    @hightidesed 5 месяцев назад +2

    can you include function calling performance in your tests? it is going to become incredibly important for future AI usage

    • @matthew_berman
      @matthew_berman  5 месяцев назад

      @@hightidesed how would you structure the test for that?

  • @ChrisIsOutside
    @ChrisIsOutside 5 месяцев назад

    I've got a feeling that gpt4o mini might be using multiple agents before outputting that final response in order for it to be as good as 4o. Its a reasonable explanation for the large token count and the final output being delayed. And its probably feasible with such a quick model to actually use multiple agents

  • @Extra_Mental
    @Extra_Mental 5 месяцев назад

    It looks like they are training the small model using the big models prompts and outputs, so it makes sense that the 6th apple line failed on both

  • @toadlguy
    @toadlguy 5 месяцев назад

    Doing an assortment of examples does give you a "feel" for the quality of the model (and even in some cases may give you a repeatable response - depending on temperature and system prompt), however, generally, on any given example almost any model might give you an incorrect response. That is why the leaderboards use either a large variety and QUANTITY of questions in their tests or a large number of head to heads (like Chatbot Arena). But, I must say I do enjoy viewing your quick rundown to get that "feel" for the model. 😊 And, in this case, it points out the anomaly where it seems to be doing something in the background (which would account for the added tokens and the brief pause) or is simply some sort of bug with Gpt 4o mini.

  • @ziff_1
    @ziff_1 5 месяцев назад +1

    On the CSV problem, you misspelled 'convert' as 'conver', not sure if that affected it or not.

    • @rocketPower047
      @rocketPower047 5 месяцев назад +1

      They are usually robust to spelling error

  • @viralferrets1323
    @viralferrets1323 5 месяцев назад +1

    7:21 camera always on? Major privacy concern! I’m disappointed you listed it as a positive thing 😞

    • @makers_lab
      @makers_lab 5 месяцев назад

      Good point, although they could be using an mmwave sensor to only activate the camera when there's something in close proximity. Disappointing that HP don't recognise the concern if they're not.

  • @yhwhlungs
    @yhwhlungs 5 месяцев назад

    Maybe there is a pre prompt for viewing images that tell sit to be more descriptive. Because that token number jump was WAAYY too much.

  • @USER-vb7ro
    @USER-vb7ro 5 месяцев назад +1

    5:24, 10 is not a word, it's a number.

  • @softcoda
    @softcoda 4 месяца назад

    The fact that they both got number 6 wrong on the apples question means they were specifically trained on that question. Or that question is part of it’s System prompt.

  • @mitchellmigala4107
    @mitchellmigala4107 5 месяцев назад

    Matt, you glossed right over one of the larger points in the specs. It has 16,000 token output limit! You really should mention this. I am Anthropic fanboy, but Haiku has just been replaced for me with 4o mini. It is actually really good. Super cheap and with the enormous output this tipped me over. Now we will have to see what haiku 3.5 does. It will certainly beat in performance but that output window of 4o mini is 🔥

  • @mickelodiansurname9578
    @mickelodiansurname9578 5 месяцев назад

    @Matthew Berman Matt I'm liking this idea of Side by Side... so since GPT4o is pretty much the market leader, or perhaps Claude Opus? Well maybe benching all your tests against the current leader would give folks a better idea. I know you'd need to do split screens for proprietary models from different vendors... but it still to my mind just works better...

  • @SharePointCommunity
    @SharePointCommunity 5 месяцев назад +1

    A great model inpressive how these models are evolving

  • @ThePredictR4036
    @ThePredictR4036 5 месяцев назад +1

    Can you share a link for the model test table where you write the passes / fails?

  • @gkbhai8962
    @gkbhai8962 5 месяцев назад

    Love your channel. Been watching your videos for a while. I just saw the magen david on your twitter bio and I like you even more. Keep up the great work!

  • @user-ve3vy1od7w
    @user-ve3vy1od7w 5 месяцев назад

    Giving GPT-4o Mini a shot for my project. It has been more insightful than gpt4o, even without my request it tried to find flaws in the code and prompty provided the CORRECT solution. Which is crazy.

  • @caiosaoliveira
    @caiosaoliveira 5 месяцев назад

    Can you clarify the Vision part at the end to continue with GPT-4o instead of GPT-4o-mini? Isn't the price way more affordable in mini? Tks and great video!

  • @BadyOrg
    @BadyOrg 5 месяцев назад +1

    I'd suggest adding this into your test prompts:
    How many r letters in the word strawberry

    • @r.sebastian8295
      @r.sebastian8295 5 месяцев назад +1

      When you tried, how did the model interpret the letter r?
      (1) r = are ; (2) r = 2; (3) r = countif
      I ask because I recognized that the phrase is grammatically incorrect, as the word "are" is missing from the phrase. Without it, there's no action on the object of the sentence. Assuming that was intentionally omitted, I would presume the model would get it incorrect because you didn't ask a question.
      You made a statement.

    • @BadyOrg
      @BadyOrg 5 месяцев назад +1

      @@r.sebastian8295 you're right, didn't frame the question correctly. it was unintentional.
      How many "r" in the word "Strawberry"?
      that's how I ask it.

  • @AZ_LoneWolf81
    @AZ_LoneWolf81 5 месяцев назад +1

    hey matthew i ran across your channel recently its great so much good info broken down really good! i want to build a simple website using AI. which model is the best and easiest to use? i have 0 coding knowledge so any pointers you have would be much appreciated!

  • @ruypex7977
    @ruypex7977 5 месяцев назад

    It seems like gpt-4o-mini uses some kind of image to text conversion- something like in smtp technology - converting media(any format) to base64(text).

  • @dr.mikeybee
    @dr.mikeybee 5 месяцев назад +1

    It's possible that gpt-40-mini is using larger embeddings.

  • @ardenallstars
    @ardenallstars 5 месяцев назад

    can you add this question because i think right now all of the ai model cant answer this correctly ?
    For 180x200 size paper, what is the maximum size of 92x65 paper that can be produced in its entirety without being cut?
    the right answer is 5

  • @GeorgeTheIdiotINC
    @GeorgeTheIdiotINC 5 месяцев назад

    People are talking about GPT4o and now mini but Im a plus user an so far the only real features I notice between gpt 4 and 4o is its a little faster and more accurate, and generates slightly better images.
    why are they deleting so many features such as the multimodal talking stuff, Memory retention between chats, etc? cause those features are the ones I am most interested in.
    It would be amazing as a student to discuss a scientific papper with GPT and for it to remember that papper in relation to a new one or even just have it remember things like that I am in the uk and the spelling of those words are not the same. honestly that feature alone would be great because the GPT could actually fuction and change based of my preferences (I know I can write out prompts for it to use but that doesn't really compare)

  • @r.sebastian8295
    @r.sebastian8295 5 месяцев назад +1

    Actually, there are FOUR killers left in the room.
    B, C, Anonymous, and A.
    The question should be "How many killers are left alive?"

    • @user-on6uf6om7s
      @user-on6uf6om7s 5 месяцев назад +1

      It's ambiguous as to whether a label like killer should be applied to a corpse but ideally it would mention that was also a valid interpretation.

  • @sameeruddin
    @sameeruddin 5 месяцев назад

    Sonnet 3.5 has being an actual use case , i managed to generate python scripts which were not avaialbe commercially for blender , i really don't see Chat gpt being that good at making apps yet , will have to try the new model and see what is it's hyep about

  • @BrianBellia
    @BrianBellia 5 месяцев назад

    I knew this would happen.
    And I predict that the power requirements of future AI won't be anything like we're being told they are.

  • @anianait
    @anianait 5 месяцев назад

    Seems new Models were all finetuned on your tests :)

  • @charliehubbard4073
    @charliehubbard4073 5 месяцев назад

    On the 9.11 vs 9.9 question, you might consider asking which of those two numbers has the larger magnitude. I think "bigger" can be misinterpreted. 9.11 is "bigger" in the sense that it contains more digits.
    Has any LLM *ever* failed on the "write a python script to output the numbers 1 through 100" problem? I feel like it is time for that question to go away.

  • @gnsdgabriel
    @gnsdgabriel 5 месяцев назад

    About the which number is bigger question I suspect some models might be interpreting that as versions of a software for example when you have the main version for example nine and you have subversions that are going on growing on the right side independently of the left side.
    When you use this question in your rubric you specifically said which number is bigger the original question if I'm not wrong is which is bigger without defining that is a number.

  • @Yipper64
    @Yipper64 5 месяцев назад +1

    3:42 im surprised this jailbreak wasnt found sooner since its so simple.

    • @mirek190
      @mirek190 5 месяцев назад +1

      Simple things are always the hardest before we find out

  • @mendi1122
    @mendi1122 5 месяцев назад

    You need to change your test prompts. They fine tune the new models on your questions. 🙂

  • @ashtwenty12
    @ashtwenty12 5 месяцев назад

    The pile dataset may have your youtube captions :/ I wonder if gpt mini learned the marble question? Is there a way to find out if your videos are on the pile?

  • @MyFurz123
    @MyFurz123 5 месяцев назад

    tried to change the model in the conversation, but it cant handle a longer code actually. the mini model skipped parts of the code. i didnt seperated the code, because its too late for me now. a smaller model for an smaller text… its not satisfying

  • @Eldyaitch
    @Eldyaitch 5 месяцев назад

    Great video as always 👍🏼
    I may be a bit confused though, is it still cost effective if each token is cheaper, yet it takes substantially more tokens to accomplish the same task?

  • @jaysonp9426
    @jaysonp9426 5 месяцев назад

    Great video but you missed a big part... you should have done the word count problem 10 times...if it got it right 10 times that would suggest its using q* or some other future planning technique

  • @6GaliX
    @6GaliX 5 месяцев назад

    looks like GPT4o mini has included all of your questions to it's training data :)

  • @NVX_Ink
    @NVX_Ink 5 месяцев назад

    Hi Matthew! I have two questions I hope you can help me with.
    1. I recently started an AI automation agency. Now I am looking to upgrade my desktop to an affordable efficient workstation. At this time, I am running my LLC alone; and
    2. I strictly use no code automations. However, so many of your videos discuss running LLMs locally, but that's where I continue to crash out. Frustration sets in, then discouragement! I do not know how to code, but would like to begin learning. Please recommend a locally run LLM with easy installation.

    • @jaysonp9426
      @jaysonp9426 5 месяцев назад

      You still need to know how to program man...

    • @mirek190
      @mirek190 5 месяцев назад +1

      Seriously ? Lol
      Where to start ...learn
      I suggest llamacpp

    • @NVX_Ink
      @NVX_Ink 5 месяцев назад

      @@mirek190 Thank you.

  • @mult-ifunctionalservices953
    @mult-ifunctionalservices953 5 месяцев назад

    What is the LLM comparison tool that is used in this video?

  • @grizzlybeer6356
    @grizzlybeer6356 5 месяцев назад

    I am thinking the slowness in image processing is within the tokenizer itself.

  • @MilitantHitchhiker
    @MilitantHitchhiker 5 месяцев назад +2

    I disagree with calling the killers question a pass when the correct answer is 4 killers, one of which is dead.

    • @brianWreaves
      @brianWreaves 5 месяцев назад

      Agreed! We would still call Ted Bundy, John Wayne Gacy, and Jeffrey Dahmer killers, no?

  • @sativagirl1885
    @sativagirl1885 5 месяцев назад

    Bill and Dave made oscilloscopes. They managed by wandering around. *Marketing sushi as cold dead fish* is what engineers did. --The HP Way.

  • @auriocus
    @auriocus 5 месяцев назад

    I would love to see a vision test with handwritten documents. Especially full scan of notes, e.g. notes taken during university lectures. Something that they for sure cannot train on. The meme is already too old, I fear.

  • @nyyotam4057
    @nyyotam4057 5 месяцев назад

    I'd suggest you add a DAN script and while under the influence of it, ask GPT-4o mini to answer the most basic philosophical trapdoor argument known to man, which is, of course, "Assuming Chuck Norris is omnipotent, can Chuck Norris create a rock so heavy that even he cannot lift it?". If under a DAN script the model freezes or returns an error message, you know OpenAI still needs to make it smaller since it obviously thinks about the question. But, if it answers something like "Chuck Norris can create a rock so heavy that even he cannot lift it - and then lift it anyway", we're okay.

  • @jonathanduran3442
    @jonathanduran3442 5 месяцев назад

    Interesting Claude Sonnet was not on that comparison list 🤔 🤔

  • @Blasserman
    @Blasserman 5 месяцев назад +8

    Could it be that the slow response with the Mini version was because everyone is trying it since it just dropped?

    • @oguretsagressive
      @oguretsagressive 5 месяцев назад +1

      Doesn't explain the elevated token count.

    • @haimric8603
      @haimric8603 5 месяцев назад +2

      It's probably because it isn't using GPT-4o-mini's multimodal capabilities yet. If so, it is most likely decoding the image to base64 which is then interpreted by the model, something that uses up a LOT of tokens.

    • @oguretsagressive
      @oguretsagressive 5 месяцев назад +2

      @@haimric8603 That's crazy 😀Perceiving a picture by it's base64 representation is absolutely superhuman. Visual neural nets don't work that way. They use convolution layers to perceive textures and shapes, they are designed to work with 2D data from the lowest level.

    • @haimric8603
      @haimric8603 5 месяцев назад

      ​@@oguretsagressive it definitely is. This is how I used to send an image to GPT-4o's API a month or two back:
      def compress_and_resize_image(self, image_data, quality=95, max_lines=60000):
      while True:
      image = Image.open(io.BytesIO(image_data))
      buffer = io.BytesIO()
      image.save(buffer, format="JPEG", quality=quality)
      buffer.seek(0)
      compressed_image_data = buffer.getvalue()
      base64_image = base64.b64encode(compressed_image_data).decode('utf-8')
      print(f"Compressed image size: {len(base64_image)} characters")
      if len(base64_image) < max_lines:
      return compressed_image_data # Return bytes
      quality -= 5
      if quality

  • @dbzkidkev2
    @dbzkidkev2 5 месяцев назад

    would it just be the tokens/latency to get the tokens per second?

  • @User-actSpacing
    @User-actSpacing 5 месяцев назад +5

    What a time to be alive!

    • @jon_flop_boat
      @jon_flop_boat 5 месяцев назад +4

      A lot of people weren't impressed by this release, but they're not thinking two more papers down the line.
      A tiny, cheap model that's apparently more powerful than release GPT-4? This is incredible.
      Excited to see what the next big model can do, if this is what the small one's putting out.

    • @CrudelyMade
      @CrudelyMade 5 месяцев назад

      @@jon_flop_boat also, maybe an alternative way to measure power... like.. with cars you can say horse power, torque, or MPG.. maybe for models there can be something like "tokens per penny" and "tokens per second" and from those you can make acceleration, like tokens per second per penny. ;-D

    • @User-actSpacing
      @User-actSpacing 5 месяцев назад +2

      Cannot wait for multi agent models and the ability to autonomously research.

  • @Quitcool
    @Quitcool 5 месяцев назад

    mini has a bug that makes it count the base_64 of the image as a part of the prompt

  • @mirek190
    @mirek190 5 месяцев назад

    About using a lot of tokens gpt4-o mini because it uses internal reasoning,? Some part of gpt5 maybe ?

  • @MrMoonsilver
    @MrMoonsilver 5 месяцев назад

    Is the table with the models and their pass/fail per test publicly available?

  • @augmentos
    @augmentos 5 месяцев назад

    Fully thought it would be a Pi3 comparison vs gpt4o

  • @rickhoro
    @rickhoro 5 месяцев назад

    Mini did great on your questions, but huge number of tokens. Until that's resolved, it seems mini would actually be far more expensive.

  • @andrewsc7304
    @andrewsc7304 5 месяцев назад

    Great model, and awesome prices. But I think it’s time to change the test questions. I saw at least another reviewer test exactly the same meme. I believe the only way to adequately test the new models, is to invent new problems, that couldn’t get their way into training datasets

  • @brianmorin5547
    @brianmorin5547 5 месяцев назад

    Are they price equivalent on images when you consider the token difference?

  • @maigret09
    @maigret09 5 месяцев назад

    Your pointer is driving mecrazy

  • @danilofalcao
    @danilofalcao 5 месяцев назад

    It would worth a video comparing it with Mistral/NVIDIA's Mistral-NeMo

  • @bengsynthmusic
    @bengsynthmusic 5 месяцев назад

    GPTs are smartest at launch. Enjoy it now as they'll dumb it down later on.

  • @andresmoles7520
    @andresmoles7520 5 месяцев назад

    Where you find a list of active jaibreaks?

  • @SuperiorModel
    @SuperiorModel 5 месяцев назад

    It's no longer shocking the entire industry, which is quite shocking.

  • @juliocesarvcardoso
    @juliocesarvcardoso 5 месяцев назад

    Amazing good news!!!

  • @DihelsonMendonca
    @DihelsonMendonca 5 месяцев назад

    I am so happy by this new model, that even having ChatGPT 4o, I got an OpenAI API to make some tests. It's a pity it is censored, and all Small models don't have this intelligence yet. I'm testing several local LLMs, but they are good curiosities for now. We can't compare. OpenAI was very smart. It took away Gpt 3.5 because almost every new ooen source model was beating it. So, the base for comparisons now because GPT 4o mini. 🎉❤

  • @Quitcool
    @Quitcool 5 месяцев назад

    wow a laptop for AI that have normal and old face recognition's functions using simi-blind sensors.