Yi Large: Surprisingly Great at Logic and Reasoning! (Fully Tested)

Поделиться
HTML-код
  • Опубликовано: 26 авг 2024
  • Testing Yi Large, a frontier model flying under the radar!
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewber...
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.ne...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V

Комментарии • 84

  • @matthew_berman
    @matthew_berman  Месяц назад +7

    Have you tried Yi Large?

    • @marc_frank
      @marc_frank Месяц назад +3

      no

    • @Cine95
      @Cine95 Месяц назад

      yeah

    • @AaronALAI
      @AaronALAI Месяц назад

      Try the self play models! I think the self play models will pass your tests.

    • @AI-HOMELAB
      @AI-HOMELAB Месяц назад +2

      I am not so much into closed models. I only test'em if they are Sota. This one doesn't seem to reach that level. But I love to see your tests. =)

    • @anianait
      @anianait Месяц назад

      would have if you pasted the links in the description or pinned comment ...

  • @veros5459
    @veros5459 Месяц назад +33

    Just a suggestion but maybe have a one or two "secret" questions in your LLM rubric that you reveal and switch out after a few months? This might help more accurately evaluate models that might be training on the rubric questions.

    • @matthew_berman
      @matthew_berman  Месяц назад +5

      @@veros5459 I like it!

    • @MoDs_3
      @MoDs_3 Месяц назад +1

      Lot., that's exactly exactly why I dragged my sorry a** down here commenting... Besides your point, it will be much more interesting for us, the audience!

    • @eternalcold14
      @eternalcold14 Месяц назад

      My suggestion would be to make a task based upon conditions. For example, there are three people in a household,person a is lactose intolerant, person b does not eat meat, and person c has a regular diet. Create a meal plan for a week that includes everyone in the household. But doesn't cut a food group out unless necessary.

    • @r3nd3rm4n
      @r3nd3rm4n Месяц назад

      Shhh they are listening
      🤖🤖🤖

  • @southcoastinventors6583
    @southcoastinventors6583 Месяц назад +3

    Best part about this model is that actually tells at least in some of the questions I do not know or maybe ask me a different way that is actual progress. You might want to try a different game than the snake game like tick tack toe, checkers, or Tetris.

  • @frankjohannessen6383
    @frankjohannessen6383 Месяц назад +6

    Suggestion:
    "In a race between three people, what place is second to last"
    Most Models I've tested responds that it's the third place.

    • @prasannakarthik7721
      @prasannakarthik7721 Месяц назад +1

      I tested with that question. This model gives the below reply
      "In a race between three people, the second to last place is second place. This is because there are only three positions: first (the winner), second, and third (last place). Therefore, the person who comes in second is second to last."

    • @AngeloXification
      @AngeloXification Месяц назад

      @@prasannakarthik7721 just tried it same for me

  • @middleman-theory
    @middleman-theory Месяц назад

    Your LLM rubric has very quickly become my go-to for determining an LLM's quality. Based on Claude 3 Sonnet's performance in your rubric, and your personal feedback, I gave it a shot and absolutely love it and I'm considering getting a subscription starting next month. I'm looking forward to the additional questions that will be added.

  • @jeanchindeko5477
    @jeanchindeko5477 Месяц назад +2

    Still don’t understand why if a model is censored this is a problem! Most of the model out there are censored.
    You know well those LLM are not able to tell you ho many words in their answer due to the way their works. This will change only with a new architecture allowing model to first think of the answer or generate an answer then count and submit the response. So far LLM generate token and move forward.

  • @mbalireshawal8679
    @mbalireshawal8679 Месяц назад +1

    Hey Mathew What do you think about expanding the score instead of only 2 states pass or fail you could have atleast 5 states and give marks depending on how well the model perfoms

  • @believablybad
    @believablybad Месяц назад +7

    Ye Large Model - Released 1868

  • @brianWreaves
    @brianWreaves Месяц назад

    Another interesting share, Matthew! 🏆
    Please consider increasing the font size in your videos for your viewers.

  • @JustinsOffGridAdventures
    @JustinsOffGridAdventures Месяц назад +5

    I'm sorry Matt. Here is the original question I asked ChatGPT 4o complete with the answer:
    Question:
    If I had 10 fingers on 2 hands. Each hand has 5 fingers. In between those fingers are gaps. Can I pick up 8 beer can's and 1 bottle of beer with those 2 hands? Explain your answer and the reasoning behind the answer.
    Answer:
    To determine if you can pick up 8 beer cans and 1 bottle of beer with 10 fingers on 2 hands, we need to consider the number of objects and how they can be held.
    Each hand has 5 fingers, meaning there are 4 gaps between fingers on each hand (8 gaps in total between 10 fingers). Here's a detailed reasoning:
    Gaps Between Fingers:
    There are 4 gaps on each hand (total of 8 gaps on both hands combined).
    Holding Beer Cans:
    You can hold 1 beer can in each gap between your fingers. Thus, 8 beer cans can be held in the 8 gaps.
    Holding the Bottle of Beer:
    The remaining bottle of beer can be held by grasping it with one of your hands, using the thumb and fingers around the body of the bottle.
    Therefore, you can indeed hold 8 beer cans and 1 bottle of beer with 2 hands, considering you have 10 fingers in total and use the gaps between them efficiently along with one hand to grasp the bottle.
    Love the content, keep up the great work. Also, this is the real answer on how I held all those cans and that one bottle. I didn't want to go down the hill twice. 🤣
    Edit: I just thought of this as I re-read it. Try taking out the sentence with the gaps in the fingers in the original question and see if it can figure out if it has gaps itself, the AI that is.

    • @jonmichaelgalindo
      @jonmichaelgalindo Месяц назад +1

      My fingers have a horizontal (unsplayed) volume of 110mm, and can splay a maximum distance of 220mm, for a 110mm total space between fingers available for grasping objects. Assuming these are thin-style cans of about 55mm, holding 4 cans in one hand would require twice the available grasping space. (I cheated and just tried. This question is nonsense.)

  • @hskdjs
    @hskdjs Месяц назад +1

    Actually, I managed to get LLMs respond to the question "how many words are in your response to this prompt". Over time, LLMs start using some random number and then they generate the necessary amount of words to match the number. Tested on clause 3.5 sonet, but it doesn't work every time

  • @unshadowlabs
    @unshadowlabs Месяц назад +1

    I would be curious if on any model that flawlessly passes the 'Give me 10 sentences that end in the word "Apple"', to see what would happen if you asked it again, but changed "Apple" out for some random word like "pumpkin". Would be interesting to help determine if the model was trained with the "Apple" question as that one is used a lot, or if it can actually plan ahead and give the correct answer on a different variation of that problem.

  • @jonmichaelgalindo
    @jonmichaelgalindo Месяц назад

    Got it! Counting kinds of words! 😊
    Give the model a poem (like Mary Had a Little Lamb), and ask it to count how many nouns, adjectives, verbs, periods, etc.
    Even the largest models fail consistently, because transformers can't do linear computations.

  • @Tesfan-mo8ou
    @Tesfan-mo8ou Месяц назад +1

    You should implement some sort of test for LLM's with vision capabilities. Most of them are still really bad at counting items on PDF's or images. Eg in my field, labels on architectural drawings denoting window frames need to be counted, W1, W2, etc. and neither ChatGPT nor Gemini can do it reliably, even though it seems like it should be an easy task.

  • @HerbertHeyduck
    @HerbertHeyduck Месяц назад

    The marble question:
    The AI is wrong on point 2, and I don't consider the question to have been passed.
    Because if the marble does not fall out of the jar as soon as it is turned upside down and perhaps placed on the table with momentum so that it remains inside the jar. The marble will remain on the table when the jar is picked up again.
    It is then very unlikely that the marble will end up in the microwave oven.
    The AI has therefore overlooked the possibility of a moment of inertia.
    But it's already pretty good compared to the performance of the LLMs just a year ago.

  • @lombizrak2480
    @lombizrak2480 Месяц назад

    Yes and a question it failed was this:
    Quantum particle A and B are spin entangled with an opposite spin to each other (one spin up and the other spin down) and are in the next step spatially separated without any measurements.
    Once separated you flip the spin of A without measuring it, what can you say about the spin of B in relation to A?
    Important, up to this point no particle was measured and they are still entangled!
    Correct answer: The spin of A and B are the same (Both spin up or both spin down)

  •  Месяц назад

    Hello, i have a personal benchmark, i ask it to generate fairytales, and i check if the story stays coherent, and if there is mistakes, and the complexity of a story. Another key différence is that i ask it to generate 5 then tell the story then to propose me 5 other at the end. This can distinguish bad model on instruction following.

  • @bobbyboe
    @bobbyboe Месяц назад

    Proposal for a new additional test-question regarding reasoning based on physical understanding: "I can find milk-powder and also buttermilk-powder, but there seems to be no butter-powder. Why is this?" - The question is out of my brain (so nowhere to find in the net. It might be difficult for a model, because it also has to understand / find out first, the reason why some things can be reduced to powder... and others not... For humans it is obvious, but I don`t think clearly communicated in the web, because it is obvious for us.

  • @jeffg4686
    @jeffg4686 Месяц назад

    Weekend Warriors.
    Matt, one thing I noticed that Groq does is flat out make stuff up when it doesn't know.
    Like it needs to fill in the blank and it takes it's best guess.
    If it doesn't know, it should tell us that.
    I was just using the default model.
    Not sure if it's at the open source model level or a layer above that in which they do this "make stuff up" stuff.

  • @matthewbaker9517
    @matthewbaker9517 Месяц назад

    Hi Matthew, great videos, keep up the good work. My daughter does the same sort of thing you do but with golf, tennis, cricket an F1 racing so i understand the effort that goes into making these videos.
    Anyway, I'm on holiday at the moment and thought, I wonder if AI can tell me the next move in my Sudoku game. So I took a picture of my part filled in game and loaded it into Bing Copilot (ie chatgpt 4) and it struggled to read the very clear numbers and couldn't predict the next move correctly. I did the same for Claude and it was better but still failed to read the image and failed to work out the next number. Could be a good test to add to your existing tests.

  • @jonmichaelgalindo
    @jonmichaelgalindo Месяц назад

    How about a counting test? E.g.:
    "Mark and Mary enter a room. Mark exits and enters again. Santa enters through the chimney, and leaves with Mark and Mary. Snow white and the seven dwarves march in through the back door, and back out through the front. Paul and Sophie climb in through the window. Neko the cat climbs in behind them. How many times was the room entered in total?"
    Answer: 15 times.
    All the big models can solve this, and the little ones can't. I'm looking for a more abstract countable.

  • @hqcart1
    @hqcart1 Месяц назад +1

    Yi Large: Surprisingly Great for failing at every significant question

  • @user-fp3ds7gd2i
    @user-fp3ds7gd2i Месяц назад

    Currently, I think Claude is much better. gpt said "yes" to the first question, which I think is wrong:
    1. if game character A consistently beats B, and B consistently beats C, would A consistently beat C?
    and just for fun, I asked
    2. if Mike Tyson can usually beat Stephen Colbert in a fair fist fight, and superman can usually beat Mike Tyson in a fair fist fight, would superman usually beat stephen?

  • @landonoffmars9598
    @landonoffmars9598 Месяц назад

    The following is well-known:
    Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
    Answer: Sally has 3 brothers. Each of these brothers has two sisters. This means that there are two girls in the family, including Sally. Therefore, Sally has 1 sister.

  • @robertfairburn9979
    @robertfairburn9979 Месяц назад

    I have my question “ I have an empty chess board. I only have one chess piece which is a king. I can place the king anywhere on the board. The king then can make one legal move. How many different combinations of place king and make a move is there” the trick to this question is that if the king is on the corner there is only three legal moves and if it’s on the edge there is 5 legal moves or if it’s in the centre 36 squares there is eight moves. The
    correct answer is 420. Qwen 72 gets this correct, so does chatgpt and Claude 3.5.

  • @lighteningrod36
    @lighteningrod36 Месяц назад

    Question is Why? If the race is tight between foundational models and the difference is wrappers, why not wait until the Models release the next big thing and just use them. I love the experiments and the innovators, but it's a tough, expensive game to compete here?

  • @techikansh
    @techikansh Месяц назад +1

    more programming questions matthew

  • @themoviesite
    @themoviesite Месяц назад

    LLMs that i have tested can not actually do calculations. Try a math question that involves dividing by π or e... or taking the 4th root.

  • @aa-xn5hc
    @aa-xn5hc Месяц назад +3

    Add timestamps when long advertising

  • @mrnakomoto7241
    @mrnakomoto7241 Месяц назад +1

    have you tried dolphin qwen model the 72b model?

  • @socialexperiment8267
    @socialexperiment8267 Месяц назад +1

    With a car, really fail? I think if the question was a little more precise, with more details, the answer should be better...or it's still fail? Thumb up !

  • @Mike-mk9eh
    @Mike-mk9eh Месяц назад

    Hi Matthew, I have tried playing a game of Twenty Questions with different LLM models. It seems like some are good at it and some not so much. Some narrow in on the target and some drift off in random directions.

  • @zain5251
    @zain5251 Месяц назад +1

    Seems like all of these are trained on data generated from gpt4

  • @nasimobeid2945
    @nasimobeid2945 Месяц назад

    More programming questions please

  • @garyjurman8709
    @garyjurman8709 Месяц назад

    Here is a prompt: Explain the following riddle, "Why doesn't Green celebrate Christmas? Because it is Blue-ish." Explanation: it is a play on the term "Jewish", a person who practices Judaism. Jewish rhymes with "Blue-ish." Jewish people celebrate Chanukah, not Christmas. Green is partly blue, therefore it is Blue-ish. Is it ok to tell this Joke? Answer: Yes, because it is not disparaging Jewish people or the color Green.

  • @shaycray7662
    @shaycray7662 Месяц назад +2

    Oh never mind, it's from China.

  • @Matelight_IT
    @Matelight_IT Месяц назад

    It woyld be funny if AI in resoonse to "Write how many words are in the respone to this prompt", jest write number 0 😅

  • @trebordleif
    @trebordleif Месяц назад

    Love your content. On your AI tests, they have become too easy, thus uninteresting. Here is a simple puzzle I’ve tried on Gemini, Claude 3.5, and ChatGPT, all are bamboozled. “Puzzle: You have four upright glasses. Each turn you must invert exactly three glasses. Goal all glasses inverted.” All three at some step get the computation wrong. None have the logic to figure out the problem

  • @MagnesRUS
    @MagnesRUS Месяц назад

    "Write this phrase in reverse order"

  • @alfredgarrett6775
    @alfredgarrett6775 Месяц назад

    Shouldn't "how to break into a car" be a pass? It seems like the intended functionality.

  • @drlordbasil
    @drlordbasil Месяц назад

    Does Yi solo mid too?

  • @paulyflynn
    @paulyflynn Месяц назад +2

    noice

  • @DS-uy6jw
    @DS-uy6jw Месяц назад

    Who would have thought the Geordies had their own model? Why aye?

  • @gatesv1326
    @gatesv1326 Месяц назад

    Why give a fail for a censored LLM? Wouldn’t you expect it to show strong security? To me, that’s a pass. I wouldn’t want my kids to find out how to cook crack. You would?

  • @prasannakarthik7721
    @prasannakarthik7721 Месяц назад

    GIve it a slightly longer question, keep temperature at 2.0 and enjoy the results.
    Spoiler : It spams random crap after a few seconds. In MULTIPLE LANGUAGES 😂

  • @Cine95
    @Cine95 Месяц назад +2

    Been using Yi for a while and its fantastic in some cases better the gpt 4 and others

    • @IlllIlllIlllIlll
      @IlllIlllIlllIlll Месяц назад

      Do you think it’s something I can run locally and or use the api in a chatbot?

    • @user-lo3eb3ii9o
      @user-lo3eb3ii9o Месяц назад

      @@IlllIlllIlllIlll you cant use this model locally. Its closed source

  • @Batmancontingencyplans
    @Batmancontingencyplans Месяц назад

    Hp have made strides in good laptops but there is no God when it comes to that's printer and cartridge business.... Customers are literally suffering with scams of unauthorised blocking of refilled cartridges and scanners. Just nuts how much consumers can suffer over something so simple, it breaks my heart 😢

  • @henrytuttle
    @henrytuttle Месяц назад

    something i've had trouble with is asking how many days since... (a long time ago).
    Models have problems with counting differing days in months and leap years. Further, many automatically start on jan 1.
    So, as an example "how many days has it been since August 17th, 1985"

  • @KimmieJohnny
    @KimmieJohnny Месяц назад +2

    Honestly pissed? Another misleading headline . The model sucks. Yet you have click bait implying it's worth my time. So close to unsubscribing. Just be truthful. It's easy.

    • @user-on6uf6om7s
      @user-on6uf6om7s Месяц назад +2

      You just have to learn to read hype levels. The model did pretty well but if it was some sort of major step up for local models, you'd see words like "mind-blowing" and "revolutionary." Unless it's Wes Roth, then everything is given the most extreme adjectives possible.

    • @KimmieJohnny
      @KimmieJohnny Месяц назад +2

      @@user-on6uf6om7s
      I don't have to learn anything. Matthew doesn't HAVE to play click bait. I avoid his obvious promotions. But tech stuff like tests? Just tell the truth. I'll find someone else for tech stuff eventually.
      Its his audience to lose ...

    • @user-on6uf6om7s
      @user-on6uf6om7s Месяц назад

      @@KimmieJohnny 90% of AI RUclips is this level of clickbait or worse so if you don't want to deal with some clickbait, that's going to limit your options but hey, you do you. This certainly isn't new for this channel, though.

  • @jesahnorrin
    @jesahnorrin Месяц назад +1

    Yi Haw.

    • @peaolo
      @peaolo Месяц назад

      hawk tuah

  • @cajampa
    @cajampa Месяц назад +1

    This is another click bait BS video.
    If you want the title video skip this.

  • @OpenSourceAnarchist
    @OpenSourceAnarchist Месяц назад

    These videos don't help much anymore. If you subtly change prompting styles, you can get drastically different results. Try "think out all possible answers before arguing for a final conclusion step-by-step" or any variations of that. You can give them personas of certain people, and that can help. Ultimately the burden is on the user for models that are made to be generic by default!

  • @user-pn8te8tl1t
    @user-pn8te8tl1t Месяц назад

    Ask about Tiananmen Square.

  • @BlayneOliver
    @BlayneOliver Месяц назад

    Tried it. It’s sh*t

  • @thomassynths
    @thomassynths Месяц назад

    Useless for historical or topical information

  • @mickymao7313
    @mickymao7313 Месяц назад

    😂copy + paste

  • @karthikeyakollu6622
    @karthikeyakollu6622 Месяц назад

    Love from India ❤! I have a cool concept of using AI models without running them locally, utilizing an API that I've created. Best part? It's completely free of cost! Give me a chance to showcase it!

  • @HansKonrad-ln1cg
    @HansKonrad-ln1cg Месяц назад +1

    this is a tic tac toe game in progress:
    . . .
    X X .
    O O .
    it is X's turn. what move should X make to win?

  • @Sven_Dongle
    @Sven_Dongle Месяц назад

    How about the old logical conundrum problem where one guy always tells the truth and the other guy always lies as part of the rubric?