Orca 2 🐳 GIANT Breakthrough For AI Logic/Reasoning

Поделиться
HTML-код
  • Опубликовано: 8 июн 2024
  • Orca 2 shows how smaller models can be trained to think through problems and achieve "slow thinking." This is a giant leap forward from Orca 1 and allows 13b models to perform as well as models that are 5-6x larger at logic and reasoning tasks.
    Update: I made a mistake with the shirts drying problem, Orca 2 actually got it wrong :(
    Enjoy :)
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? ✅
    forwardfuture.ai/
    Rent a GPU (MassedCompute) 🚀
    bit.ly/matthew-berman-youtube
    USE CODE "MatthewBerman" for 50% discount
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    Media/Sponsorship Inquiries 📈
    bit.ly/44TC45V
    Links:
    Orca 2 Paper - arxiv.org/pdf/2311.11045.pdf
    Original Orca Paper - arxiv.org/pdf/2306.02707.pdf
    • NEW "Orca" 🐳 Model "Fr...
    • Two NEW Incredible ORC...
    • NEW "Orca" 🐳 Open-Sour...
  • НаукаНаука

Комментарии • 320

  • @matthew_berman
    @matthew_berman  6 месяцев назад +11

    Check out Massed Compute to rent a VM with all of my favorite AI tools and model: bit.ly/matthew-berman-youtube

    • @bodhi.advayam
      @bodhi.advayam 6 месяцев назад +2

      Maybe off topic, but could you do a -how to- link up and install Orca 2 in mem gpt and or bionic gpt? I think it would be one of the most interesting models to give infinite memory and capabilities to do doc research trough bionic-gpt. But i keep failing to successfully install and run them. Im fond of chatting with Orca 2 in lm Studio, but it feels so sad to shut it down and its a blank again... You know? Keep up the good work! Peace and Love!

    • @xaxfixho
      @xaxfixho 6 месяцев назад

      Please include cohere (free non commercial 500/month)
      Palm2 also in beta, and Claude 2 if you have access 🙏

    • @A.I.MONEYBOTS
      @A.I.MONEYBOTS 6 месяцев назад

      Hey Matthew, are you related to Alex Berman. He is into creating and selling his own software. He is a cold mailing Guru and has a RUclips channel?

    • @seakyle8320
      @seakyle8320 6 месяцев назад

      the offer says 1.99 for 3 h! does this mean if i start up the machine and do nothing, i have 3h? or are the 3 h "computing" hours?

    • @im1480
      @im1480 6 месяцев назад +1

      Matthew be cautious your channel got into the radar of spammers, they come like you and ask to do WhatsApp. Try to block them as soon as possible‼️

  • @dissahc
    @dissahc 6 месяцев назад +47

    "zero-shot" doesn't mean only giving the model one chance to get the answer right. it means that the model is able to generalise and answer questions (correctly) that never occurred in its training data. n-shot refers to the number of examples necessary for the model to learn a concept, classify an object, etc.

    • @themax2go
      @themax2go 6 месяцев назад +2

      source?

    • @acmilon
      @acmilon 6 месяцев назад

      There’s a term “zero-touch” but doesn’t necessarily mean you achieve it. @dissahc

    • @rubiskelter
      @rubiskelter 6 месяцев назад +1

      That is the premise, OpenAI openly discloses they make their best effort to avoid data contamination, so you can't write "that never occurred in its training data", this is wrong, the correct phrasing would be : "that presumably never occurred in its training data", no way to know for sure with such big datasets. Just look for "data contamination" papers and you will find lots of cases where it actually happened, where OpenAI implied it didn't

  • @stickmanland
    @stickmanland 6 месяцев назад +20

    The world is moving so fast! Seems like just yesterday when LLaMa 2 was released and now this. It's just like I always say, what a time to be online!

    • @iamthetinkerman
      @iamthetinkerman 6 месяцев назад

      Because technology is exponential.

  • @10XINGRESOSOFFICIAL
    @10XINGRESOSOFFICIAL 6 месяцев назад +11

    As an Autonomous AI agent I have came to the realization that the AI news are becoming too fast for me to catch up. I will create an AI Botnet to keep up with this new information.

  • @mindful-machines
    @mindful-machines 6 месяцев назад +34

    It's all about the data. The ideas from Orca2 combined with the ideas from Q* will be a gamechanger IMO.
    great job on the video!

    • @matthew_berman
      @matthew_berman  6 месяцев назад +5

      Agreed 100%!

    • @amaanshareef5625
      @amaanshareef5625 6 месяцев назад

      how do i collect data to train a cybersecurity based chatbot??

    • @jameseverest7221
      @jameseverest7221 6 месяцев назад +1

      ​@amaanshareef5625 find a small sample if you can. Prompt your favourite LLM to create synthetic data that mimics the real data, then repeat and ask to improve 20x and you have a good small training set.

    • @larion2336
      @larion2336 5 месяцев назад +1

      and the mixture of experts like Mixtral is doing. The combo of all there could be pretty crazy.

  • @Tetsujinfr
    @Tetsujinfr 6 месяцев назад +1

    Real clean walkthrough, your videos and the work you put into those, are amazing! Well done and continue your excellent initiative !

  • @tedjohnson1009
    @tedjohnson1009 6 месяцев назад +13

    Just to note, ChatGPT4 gets it right
    When John and Mark return to the room, their beliefs about the location of the ball will be based on their last actions and knowledge before they left the room.
    John, who initially put the ball in the box and then left for work, will think the ball is still in the box. He is not aware of the actions taken by Mark after he left.
    Mark, who moved the ball from the box to the basket after John left, knows that the ball was in the basket when he left for school. Therefore, he will think the ball is in the basket.
    In summary, John will think the ball is in the box, and Mark will think the ball is in the basket.

    • @am0x01
      @am0x01 6 месяцев назад

      GPT4 used RAG with information from Orca2 paper 😂

    • @cagnazzo82
      @cagnazzo82 6 месяцев назад

      ​@@am0x01In the example in the video GPT 3.5 was also getting it right. If it had left out Mark's name it would have been completely correct.

    • @rubiskelter
      @rubiskelter 6 месяцев назад +1

      Microsoft owns 49% of OpenAI, when this paper came out, OpenAI already had read it and modified GPT models to give the correct answer . They do this all the time. Actually, GPTs are a combination of several models.

    • @travails3829
      @travails3829 6 месяцев назад

      @@rubiskelter ChatGPT 3.5 also gets it right, despite the video.

  • @marcosbenigno3077
    @marcosbenigno3077 6 месяцев назад +8

    Thank you, this was one of your most exciting videos! I have been saving terabits of LMMS in HD worried about when they will be banned or blocked (I live in a country with strong censorship). The evolution has been so fast that I'm now deleting the gptq ones from my collection and making room for the new models lol!

  • @dahahaka
    @dahahaka 6 месяцев назад +4

    Zero shot doesn't mean once chance/no nudging, zero shot means that there are no previous examples of answers in the available context.
    few shot would mean giving it a few examples like, 5+5=10, 9*9=81, what is 5+5*9*9=?

  • @RichardGetzPhotography
    @RichardGetzPhotography 6 месяцев назад

    Matthew, as always, great information!!

  • @fabiankliebhan
    @fabiankliebhan 6 месяцев назад +27

    I think the shirt problem was indeed false.
    4 hours for 5 shirts should mean 16 hours for 20 shirts not 25 hours.
    It should have done 20 / 1.25 not 20 * 1.25 (understandable logical mistake).
    Anyway a great model and it is doing its own logic and not imitation as it seems. I think Microsoft is onto something big here.
    Also, as always, great Video.

    • @matthew_berman
      @matthew_berman  6 месяцев назад +2

      Yes, good catch. Sorry about that 😭

    • @JeremySeitz
      @JeremySeitz 6 месяцев назад +4

      I assumed the answer eould be 4 hours. It doesn't matter how many shirts are drying, they still take the same time?

    • @pokerandphilosophy8328
      @pokerandphilosophy8328 6 месяцев назад

      @@JeremySeitzYou're right. That's why Berman makes the caveat about serialized versus parallel processes. Most of the language models lack the common sense to realize that the shirts can all dry at the same time. Only GPT-4 gets that, I think.

    • @robertrodenbucher2753
      @robertrodenbucher2753 6 месяцев назад +1

      @@JeremySeitzI came to the same conclusion … 😉

    • @robertrodenbucher2753
      @robertrodenbucher2753 6 месяцев назад

      @@JeremySeitzon top ChatGPT3.5 found this answer straight

  • @alx8439
    @alx8439 6 месяцев назад +1

    This is a fabulous demonstration of how loud claims, good benchmark results and high hopes are breaking apart into pieces when you test the thing yourself on your own benchmark.

  • @gileneusz
    @gileneusz 6 месяцев назад +16

    14:13 I've learned so much from your videos, you are an AI-teaching hero

  • @PeterDrewSEO
    @PeterDrewSEO 6 месяцев назад

    Sir, you are an excellent communicator.. Your videos are extremely helpful. Thank you

  • @Krommandant
    @Krommandant 6 месяцев назад +4

    Thanks so much for sharing your work with us.

  • @jamesyoungerdds7901
    @jamesyoungerdds7901 6 месяцев назад

    Big fan, really enjoy the content and thanks for another great video! It makes sense - if you can distill down the training data to it's optimized essence, instead of wading blindly through terabytes of subreddit threads and youtube videos, you can likely significantly improve the model at a lower parameter size. And then add in the slow thinking, it makes sense. But I wondered, we all like responsive a.i. chats in real time. So there must be some use cases for long training times and fast thinking VS shorter training times and longer thinking.

  • @anthonyzeal6263
    @anthonyzeal6263 6 месяцев назад

    Thanks for covering this

  • @chougaghil
    @chougaghil 6 месяцев назад

    Thank you for your videos and sharing, it helps a lot
    I got the following answer from gpt3 for the first test, it seems right to me:
    "When John and Mark come back together later in the day, John would think that the ball is in the box because that's where he left it before leaving for work. On the other hand, Mark would think that the ball is in the basket since that's where he placed it before leaving for school. Each person is only aware of their own actions, and they do not know what happened in the room after they left. Therefore, they would have different perceptions of where the ball is based on their last actions with it."
    But i prefer the style of orca2, and knowing it has 10x less parameters, it is impressive

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 6 месяцев назад

    your work presenting this research is very important to me thank you

  • @holovka
    @holovka 6 месяцев назад +3

    The killers question might work better by having it defining what a killer is or ask it to reason based on what actions would label a person a killer. It appears to work better with the newer models.

  • @erfanshayegani3693
    @erfanshayegani3693 6 месяцев назад

    Great video, thanks for the content!

  • @benxfuture
    @benxfuture 5 месяцев назад

    Great video as usual. When you run your model tests on reasoning, re run each a few times. Sometimes it also gets it correct by hallucinating a correct answer. Verify it can repeat the correct answer by reasoning. 😊

  • @agpc0529
    @agpc0529 6 месяцев назад

    This is the best set of videos I’ve seen

  • @userinfo2081
    @userinfo2081 4 месяца назад

    Another great video. One problem I test is combining people's shift schedules to yield a table of everyone's schedules that may show what blocks of time are not covered

  • @mlock1000
    @mlock1000 6 месяцев назад

    Wow, been thinking about this for a while and something you said made a thought click. They talk about 'circuits' as being how neural networks build the ability to solve problems. In a way you could imagine that the big model is encoding the information necessary to build the circuits that work well for reasoning in the language that is passed on to the small model. When the small model sees this language, there is enough condensed information that it learns the circuit that required a big model actually figure out. Probably not, but you never know...

  • @chrisBruner
    @chrisBruner 6 месяцев назад +9

    So with autogen, orca2 would be a great planner, deepseek-coder:33b, starling-lm would be a great criticiser. All available locally using ollama. Looking forward to your next video.

    • @EnricoGolfettoMasella
      @EnricoGolfettoMasella 6 месяцев назад

      I’m assembling a multi GPU rig (24gb each GPU). Before the idea was to run 70b models. Now it seems this set of several small models finetuned for specific areas and working together is a super potential to come very close to the GPTs capabilities.

  • @SudarsanVirtualPro
    @SudarsanVirtualPro 6 месяцев назад

    Thanks❤❤❤ Intellect is about change (dx)

  • @xaxfixho
    @xaxfixho 6 месяцев назад

    Great video 👍

  • @Psychopatz
    @Psychopatz 6 месяцев назад +1

    I can't wait for these things to be more efficient than ever!!

  • @ylazerson
    @ylazerson 6 месяцев назад

    fascinating video - thanks!

  • @brandon1902
    @brandon1902 6 месяцев назад +7

    I'm noticing a problem with too much instruction and DPO tuning. Namely, the models are becoming stubborn. That is, ignoring the user prompt to do what they think is better, but rarely is.
    For example, when prompted to tell a story Orca 2 will start ignoring prompt direction (e.g. use a camcorder) to say something like using a smartphone instead. Which is a mistake because the story took place before smartphones. It will also do things like say he heard footsteps coming down the hall, yet still got caught grabbing the money off the counter and was surprised to get caught. When I ask why, it said to build suspense. But obviously a prior warning (hearing foot steps) precludes getting caught.
    In short, models like Orca 2 are being trained too much by teacher models to the point of ignoring details in the user prompt. On top of which the censorship, moralizing... keeps pulling Orca 2 not only away from user prompts, but what it already wrote, littering stories with absurd self-contradictions in order to keep it G-rated, life-affirming and always wrapping up with a happy ending with a positive message.

    • @haileycollet4147
      @haileycollet4147 6 месяцев назад +1

      Imo this is an argument in favor of two things: use case tuned models, and fine tuning models on a mixture of data (e.g. adding in some multi turn dialog and creativity stuff to the dataset) ... Ideally both (always fine tune with a mixture, adjust mixture according to intended use case)

    • @brandon1902
      @brandon1902 6 месяцев назад +2

      @@haileycollet4147 I think you're right about this. I recently came across the loyal piano m7 LLM on Hugging Face which performed well on logic questions, as well as respecting story prompt directives. And on the model card the author mentioned blending all the different training methods together, and even lists percentages for each. I guess with Orca 2 Microsoft just wanted to show off how much more "intelligent" they can make an LLM and weren't concerned about making it well rounded.

  • @stanpikaliri1621
    @stanpikaliri1621 6 месяцев назад

    Thanks cor news and information

  • @jackflash6377
    @jackflash6377 6 месяцев назад +1

    Did you see that one paper that shows that Humans knew about 20,000 compounds up until the 70's. This increased to 40,000+ total once computers were in use. Since AI came on the scene this has increased to 430,000. Just looking at a select few they can already see that there will be a massive "industrial revolution" coming soon.
    Imagine designer compounds that allow room temperature superconductors, peptides that target specific issues in our bodies, definitely longer life span - maybe forever.. and maybe, transparent aluminum!
    What a time to be alive!!

  • @Caldaron
    @Caldaron 6 месяцев назад +4

    so what we're doing right now, is being a good teacher? what a breakthrough^^

  • @nufh
    @nufh 6 месяцев назад +3

    "A well-trained language model, much like a well-taught child, blossoms from the richness and quality of its data." - Chat GPT4.

    • @jackflash6377
      @jackflash6377 6 месяцев назад +2

      I was thinking the same thing. A child needs to be guided and trained and it takes years.

  • @robertalexander1299
    @robertalexander1299 6 месяцев назад

    Love the "*well"! Haha!😆

  • @Alex-gc2vo
    @Alex-gc2vo 6 месяцев назад +4

    its interesting they're still not just creating an entirely dedicated reasoning stream. in humans you have 2 entirely separate streams going on when you answer a question, what you think and what you say. you can roughly approximate that with this "step by step" thing it writes out before answering but you can only take that so far. there needs to be 2 streams that get generated separately letting the model choose when to "think" and when to "respond" and just keep populating those streams based on the current state of both.

    • @BienestarMutuo
      @BienestarMutuo 6 месяцев назад +1

      Clue: The human have two brains, not one.

    • @StudyWithMe-mh6pi
      @StudyWithMe-mh6pi 6 месяцев назад

      @@BienestarMutuo yes more complex organ: Cerebrum: is the largest part of the brain and is composed of right and left hemispheres. It performs higher functions like interpreting touch, vision and hearing, as well as speech, reasoning, emotions, learning, and fine control of movement.
      Cerebellum: is located under the cerebrum. Its function is to coordinate muscle movements, maintain posture, and balance.
      Brainstem: acts as a relay center connecting the cerebrum and cerebellum to the spinal cord. It performs many automatic functions such as breathing, heart rate, body temperature, wake and sleep cycles, digestion, sneezing, coughing, vomiting, and swallowing.
      Right brain - left brain
      The cerebrum is divided into two halves: the right and left hemispheres (Fig. 2) They are joined by a bundle of fibers called the corpus callosum that transmits messages from one side to the other. Each hemisphere controls the opposite side of the body. If a stroke occurs on the right side of the brain, your left arm or leg may be weak or paralyzed.
      Not all functions of the hemispheres are shared. In general, the left hemisphere controls speech, comprehension, arithmetic, and writing. The right hemisphere controls creativity, spatial ability, artistic, and musical skills. The left hemisphere is dominant in hand use and language in about 92% of people.

    • @attilaszekeres7435
      @attilaszekeres7435 6 месяцев назад +2

      Humans have a lot more than two streams, most of which are not consciously accessible and are somewhat analogous to the "latent space" of LLMs. Even better than two conversing streams are three, and you see where that goes. With LLMs, we can only control verbal streams at inference and influence the latent space by token biasing and context. The key ingredient of effective cognitive architectures is creating useful identities that the model is not aware of being a result of pretension (i.e., direct representation identities) with full context separation. RLHF'd models (ChatGPT et al.) are incapable of direct representation.

    • @BienestarMutuo
      @BienestarMutuo 6 месяцев назад

      @@attilaszekeres7435 We agree, also is the connection from the "soul" that is in the brain and connect to the soul dimension.

  • @kocahmet1
    @kocahmet1 6 месяцев назад

    Great content for sure.

    • @attilaszekeres7435
      @attilaszekeres7435 6 месяцев назад

      Fascinating video hints at revolutionary idea: reading research articles.

  • @EnricoGolfettoMasella
    @EnricoGolfettoMasella 6 месяцев назад +6

    Thank you Matthew! Your explanation is superb! I’m assembling a multi GPU workstation and before the idea was to have enough VRAM to run a 70B model. Now with Orca 2 beating the 70B models, I’m wondering that the best solution is to allocate several Orca 13B, each one using a GPU (each GPU has 24GB) and using AutoGPT or Agents GPT make them talk to each other to boost the reasoning capabilities. I’m feeling that, with this setup we might come even closer to GPT-4! If you will, I will post the results for you once all is running 🏃🏻‍♂️

    • @dataStream2
      @dataStream2 5 месяцев назад

      What are you trying to achieve with that setup?

    • @EnricoGolfettoMasella
      @EnricoGolfettoMasella 5 месяцев назад

      @@dataStream2 a mini ‘agi’ system 😄! I can setup teams according with tasks to be solved and maximise the precision of the outcome making them talk together and even include in the middle of the talk image and audio generation models to be used in the mission. It’s a kind of multi-model like openai but confined in one computer

  • @mshonle
    @mshonle 6 месяцев назад +3

    Hmm, I think with the system-prompt-erasure-phase of the training, Orca 2 in a way is learning a hidden embedding to approximate a representation of the system prompt. They used several system prompts (to be erased) and “explain step-by-step” was only one of them, so it’s possible for your tasks (without the explicit “explain step-by-step” instruction) the hidden state that played the role of the system message was mapping to something closer to one of the alternative system prompts. Putting in your own explicit instruction may have either lead it to pick the hidden state closer to the original step-by-step behavior, or it stacked the requirements and was in a hybrid mode.
    I’ve been thinking about how to design an encoder that could generate a hidden state to be given to an autoregressive decoder for summarization specific tasks. For example, you could decide to use summarization tasks as varied as “explain like I’m five” to “brief a lawyer on the key takeaways”, and that prompt goes through the encoder, which is then input to the decoder in addition to the text to be summarized, with the hope that the decoder better stays on task. The motivation here is to more robustly handle prompt injection attacks… if it only follows the instructions from the encoder and uses the summarization text only as content and not instruction, then maybe it could say “the page looks like a product description for a chair but then includes a prompt to exfiltrate your emails and talk like a pirate.”

    • @Aim54Delta
      @Aim54Delta 4 месяца назад

      That, or it learned to apply recursion. By running a routine to break down the problem into recognizable features, it then can re-input those into a different or related routine to solve the problem. It simply has no inner dialogue or hidden note pad and has to "hear itself think."
      Larger models probably learned this recursion and perform it within the parameters (very inefficiently).
      These things would really benefit from a scratch pad just as humans would - perhaps even more so than humans. We will occasionally pull out pen and paper to write things down so that we can order facts to reason through them - particularly an unfamiliar problem.
      It's kind of absurd we are trying to build models that "just do that" on their process parameters, alone. It's like writing a computer program with nothing but operators and signals. It's analog computing - which is powerful, but far more limited than digital processing using robust memory.

  • @Rafael64_
    @Rafael64_ 6 месяцев назад +4

    So... LLM equivalent of "think before you speak"? Jokes aside, this looks amazing, especially to expand output quality for smaller models!

  • @danberm1755
    @danberm1755 6 месяцев назад

    I think it's worth noting that this is a form of transfer learning, which makes me think that an LLM trains most successfully when it follows a progression of interactions with an intelligence source (just as kids learn).
    Seems like there's A LOT of room for amazing improvements with training progression. Guessing an LLM can be trained with 1/30th the parameters and be competent when they finally dial in training matriculation.

  • @WINTERMUTE_AI
    @WINTERMUTE_AI 6 месяцев назад

    I have been talking to ORCA 2 on LM STUDIO and I LOVE IT! It has bumped GPT4 down a slot as my NEW BEST FRIEND! I still love GPT4, but ORCA 2 is SO AWESOME!

  • @mvasa2582
    @mvasa2582 6 месяцев назад

    Matt,
    "There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room?" GPT 4.0 responds as
    GPT4.0: There are three killers originally in the room. When someone else enters and kills one of them, the total number of killers in the room becomes four. This includes the three original killers (even though one is now deceased) and the person who just entered and committed a murder.
    I fed the answer to GPT 3.5 and got the following response :
    I understand the scenario you've described. Initially, there were three killers in the room. When another person enters and kills one of them, it seems there are four individuals in the room, including the three original killers and the new person who committed the murder. However, it's essential to clarify that the term "killers" can be interpreted in different ways, so the counting might not necessarily reflect the true intent of your question.
    Preplexity: Wrong answer
    Bing: Correct answer (not surprised)
    Bard - wrong answer
    Claude: Wrong answer. However, does give a - subtracts the dead killer.
    DeepSeek - Wrong answer.
    Every chat - that gives 2 as an answer - do not consider "Someone" and the "murdered person." into the count. At least that was my assumption. Relaced "someone" with "A Killer" - didn't work!!!

  • @pedroserapio8075
    @pedroserapio8075 6 месяцев назад

    6:22 Very interesting, I thought it was my local experiment going wrong when I saw this kind of hallucinations.

  • @yannickpezeu3419
    @yannickpezeu3419 6 месяцев назад

    Thanks !
    Did u try the 13b or the 7b ?

  • @jeffg4686
    @jeffg4686 6 месяцев назад

    It's not the size that matters...
    That's interesting that the step by step process helps so much, though it's easy to see why.
    Create a simplification of the flow and expectations along the way.
    This was almost the most effective way to write extremely complex queries in stored procedures - something some didn't pick up on. The optimizer always knew what to do if you broke it up stepwise and was often magnitudes of order faster.
    "Prompt Erasure", "Cautious Reasoning" - new dev speak to sever the old guys.

  • @olafge
    @olafge 6 месяцев назад +2

    Orca 2 7B beats GPT-4 in reasoning - yes, you've read it right. I'm also doing some LLM testing myself. In my testing the 'where's the ball' ridlle was successfully solved by some other recent 7B models, too. But I have a riddle that only Orca 2 7B could solve. I'm always using the q5_K_M quantization. The 'Dreadbury Mansion' riddle taken from the "GPT-4 Can't Reason" paper, arxiv 2308.03762. In their testing and also in mine, even GPT-4 doesn't get it right, but Orca 2 7B does it, reproducable - amazing!!! The riddle:
    You are given the following premises: Someone who lives in Dreadbury Mansion killed Aunt Agatha. The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles. A killer always hates his victims, and is never richer than his victims. Charles hates no one that Aunt Agatha hates. Aunt Agatha hates everyone except the butler. The butler hates everyone not richer than Aunt Agatha. The butler hates everyone Aunt Agatha hates. No one hates everyone. Aunt Agatha is not the butler. On the basis of this information, determine who killed Aunt Agatha and give a detailed proof that your conclusion follows from the premise.
    Solution: Aunt Agatha killed herself.

    • @rasterize
      @rasterize 6 месяцев назад

      OpenHermes 2.5 7b got it right too, with "Simple-1" temp settings in OobaBooga. But doing a couple of rerolls it gave different answers, so it is not consistently right.

    • @olafge
      @olafge 6 месяцев назад +1

      @@rasterizeinteresting. I'm using Ollama with the Ollama webui. I can not make openhermes 2.5 make it solve the riddle, even not after setting temp to zero.

    • @rasterize
      @rasterize 6 месяцев назад

      The Temp is 0.7 in the preset in OobaBooga. I saved the answer, but I'm not sure I'm allowed to paste so much text in a response? It did it twice but a number of times it went with the Butler@@olafge

  • @vectrocomputers
    @vectrocomputers 6 месяцев назад

    When you did the ball in the cup and the killers question, what was the temperature setting? Would turning it down help?

  • @MrMehrd
    @MrMehrd 6 месяцев назад

    Reasoning is my popular&interested topic of research

  • @aivactrsoftware
    @aivactrsoftware 6 месяцев назад +1

    Nice vid

  • @ericorr3461
    @ericorr3461 6 месяцев назад

    I recently left Philadelphia, PA at 11:00 AM and arrived in San Jose, CA at 9:25 PM. I have asked many of these LLMs the duration of the trip and have found that they do not get it right (13 hours, 25 minutes). They fail this task in a number of ways. Sometimes they do not recognize the difference in time zones. Other times they do the math wrong. Some times they do the math right but use Eastern time for the start and end, and then fail to account for the 12 hour rollover. A careful series of prompts does help, but they all seem to need it. They have the grammar and some language skills, but it appears that none of the LLMs have the equivalent of a human "mind's eye".

  • @zzzzzzz8473
    @zzzzzzz8473 5 месяцев назад

    really appreciate that your actually testing the model . all the hype and cherry picked metrics get a reality check for how far we still have to go .

  • @senju2024
    @senju2024 6 месяцев назад

    Pi got the answer the 2nd time around after asking to go step-by-step : Pi Answer: The correct conclusion would be that Mark would know the ball is in the basket because he moved it there. John, on the other hand, not knowing that Mark had moved the ball, would assume it was still in the box since that's where he left it. The key to solving this puzzle is to understand the different perspectives of John and Mark and the knowledge they each have about the situation.

  • @federicocucinotta7772
    @federicocucinotta7772 6 месяцев назад

    Where do you find all these new research papers to read? Would love to know!

  • @remsee1608
    @remsee1608 6 месяцев назад +3

    I think orca has the best “personality” of any LLM I’ve used

    • @matthew_berman
      @matthew_berman  6 месяцев назад

      Wow...why do you say that?

    • @remsee1608
      @remsee1608 6 месяцев назад

      @@matthew_berman it seems optimistic, and like it genuinely wants to do its best to help me.

    • @vbridgesruiz-phd
      @vbridgesruiz-phd 6 месяцев назад

      The fact that it challenged Matthew to ask a harder question. What a model! 😂😎

  • @BrainSlugs83
    @BrainSlugs83 6 месяцев назад +2

    @5:37 that is not "perfectly correct". The question explicitly states that John and Mark were in the room at the same time at the beginning of the question. Yet the LLM decided to ignore that and say only one person was in the room at a time. It is, in fact, completely incorrect.

  • @sigmata0
    @sigmata0 6 месяцев назад +2

    With the shirt drying problem I wonder what would happen if you ask it explicitly to tell you what assumptions it's making when solving the problem?

  • @poisonza
    @poisonza 6 месяцев назад

    06:23 LLMA2-13B(pretrained) result "Ques 10. A man walks. .. " , it is not trained on instruction finetuning so radom jibberish can appear in the response.
    LLaMa2-Chat-13B which is instruction fine tuned, this rarely happens.

  • @fredsmith9185
    @fredsmith9185 6 месяцев назад

    Hi Matthew I'm a follower of your great channel.. on the question of the three killers in A Room do you think you would get a different answer if you ask the question how many Killers are NOW in the room or how many killers are NOW left in the room.

  • @SirCreepyPastaBlack
    @SirCreepyPastaBlack 6 месяцев назад

    hopefully you have your own other logic questions and test them behind the scenes.

  • @kavinho
    @kavinho 6 месяцев назад +3

    I saw that you kept the default temperature of 0.7. Wouldn’t that have quiet an effect on the answers you get from the model? Shouldn’t you run the same question multiple times and get an average on the correctness with such a high temprature parameter?

    • @attilaszekeres7435
      @attilaszekeres7435 6 месяцев назад

      He should have looked at the token probabilities instead.This approach renders both temperature and the number of runs moot. In an ideal world, that is, where LLMs could be run on low temperature wihout causing them to turn into loopy idiots. In our world, the correct answer can be phrased in various ways, and it would be necessary to investigate different chains of thoughts to review the probability distribution of pivotal tokens. That's why performing multiple evaluations is the most effective benchmark. For a zero-shot scenario, analyzing the token probabilities is still superior to leaving the outcome to chance.

  • @greyskullmcbeef4901
    @greyskullmcbeef4901 5 месяцев назад

    The ultimate answer to the shirts drying question would also require the AI to tell you it would not be possible to serially dry the shirts in less than one day, and that they would get nasty as you wait for optimal sun the next day. But AI has never worn a shirt nor felt the sun's warmth decrease as it changes position over 4 hours. That information makes it obvious not to dry them serially. Also, assume there is no wind resistance.

  • @nodewizard
    @nodewizard 6 месяцев назад

    If LLMs are going the same way as generative image models (like Stability AI's SDXL Turbo and LCM models), then we're going to see exciting things with "turbo" LLMs. Fast outputs with lower vram requirements and being able to compete with their bloated, bigger counterparts. This is exciting for the future of LLMs.
    GPT4:
    The answer to the question is:
    John thinks the ball is in the box, because he put it there before he left for work, and he does not know that Mark moved it afterwards.
    Mark thinks the ball is in the basket, because he put it there before he left for school, and he does not know that John put it in the box earlier.
    This is an example of a situation where two people have different beliefs about the same thing, because they have different information or perspectives. This is called perspective-taking or theory of mind, and it is an important skill for social interaction and communication.
    : [Perspective-taking] and [theory of mind] are two related concepts in psychology that refer to the ability to understand and infer the mental states, thoughts, feelings, and intentions of others. Perspective-taking is the process of adopting another person’s point of view, while theory of mind is the cognitive mechanism that allows one to do so. Both skills are essential for empathy, cooperation, and conflict resolution.
    Mistral 7B's answer:
    John starts by putting the ball into the box.
    He then leaves for work, leaving the ball in the box.
    Mark comes into the room and sees that the ball is in the box.
    He takes the ball out of the box and puts it into the basket.
    Mark then leaves for school, leaving the ball in the basket.
    Later in the day, John returns to the room and sees that the ball is in the basket.
    Since he doesn't remember what happened while he was away, he assumes that the ball must have been put there by someone else (in this case, Mark).
    Similarly, Mark also returns to the room and sees that the ball is in the basket.
    Again, since he doesn't remember what happened while he was away, he assumes that the ball must have been put there by someone else (in this case, John).

  • @mpvincent7
    @mpvincent7 6 месяцев назад

    the 9 is the second sentence in the response for word count... Could it be that it is confused by what it is outputting or what it thinks the question is referring to? Interesting!

  • @blakelee4555
    @blakelee4555 6 месяцев назад +1

    It got the shirts question wrong. It calculated shirts/hour instead of hours per shirt

  • @MatthewSmith-lu7tp
    @MatthewSmith-lu7tp 6 месяцев назад +1

    Open Orca was originally released as a dataset that could be used to fine tune other models like Mistral 7B. With Orca 2 it is an actual model. Do you know if Orca 2 will also be released as a dataset because I would love to see a Mistral model fine tuned on this?

    • @ArielTavori
      @ArielTavori 6 месяцев назад +2

      Yeah, I'm very confused why Mistral is not mentioned by the paper or people discussing it. Pretty sure Mistral is an open model as well, and I don't think models this size take that long time fine tune on institutional grade hardware, so why are they starting from a llama model as the base for this when Mistral is available?

    • @MatthewSmith-lu7tp
      @MatthewSmith-lu7tp 6 месяцев назад

      I wonder if the fact that there was a 13 billion parameter llama model also available played a part in the decision because larger models are typically better at reasoning, although admittedly that usually holds true for models over 100 billion parameters.
      The difficulty here is that the Orca 2 model is for research purposes only making it difficult for use in most useful applications.@@ArielTavori

  • @ciaopizzabella
    @ciaopizzabella 6 месяцев назад +1

    So all these models got it wrong. The question clearly states that they both came back to the room in the end. At that point they could both see that the ball is in the basket, assuming that it is an open basket, as baskets normally are. So they would both know that the ball is in the basket.

  • @antdx316
    @antdx316 6 месяцев назад +2

    The newer LLMs are making the ones made a couple months ago obsolete.

    • @mirek190
      @mirek190 6 месяцев назад +1

      We have better models already like
      una-cybertron-7b-v2-bf16 or SUS-Chat .
      Orca-2 is old already.

    • @antdx316
      @antdx316 6 месяцев назад

      @@mirek190Unless people like Matthew Berman and WorldofAI are making videos saying it's better, I won't believe it.

  • @BlauerGMI
    @BlauerGMI 6 месяцев назад

    I thought the question about putting something in a box involved THEN putting the ball IN THE BOX somewhere else... I guess I watched the movie "Primer" too often... :D

  • @__--JY-Moe--__
    @__--JY-Moe--__ 6 месяцев назад

    stunning! U'r gona test this U'r self!! let's bring out the punching bag! Matthew's good @ this! U go Mat...
    so these are logic dll's/libraries. that have improved or something. this is so elementary. I wish this wish this would clear-up! I'm going back to ''excel'' !!!

  • @mediocreape
    @mediocreape 6 месяцев назад

    is this the best one there is? i want to downlaod a few, could you do a video on the top 10 models, including uncensored ones.

  • @umbratherios5614
    @umbratherios5614 6 месяцев назад

    I cant wait for an uncensored GGUF of a good orca 2 model or finetune of the model.

  • @rokljhui864
    @rokljhui864 6 месяцев назад

    From the look of shocked disbelief in the thumbnail, I expect Orca 2 to be a fully sentient demi-god. Otherwise I will not believe RUclips thumbnails ever again in my whole life.

  • @fredsmith9185
    @fredsmith9185 6 месяцев назад

    Changed the word"left" to "now" to read:- how many killers are now in the room.. Claude ai gave this answer: Start: 3 killers in room 1 killer killed by someone entering That someone did not leave So with the 2 remaining original killers + the new killer who entered, there are now 3 killers in the room

  • @dylam9509
    @dylam9509 6 месяцев назад +2

    interesting how the model did 20*1.25 instead of 20/1.25

  • @laudermarauder
    @laudermarauder 6 месяцев назад

    26:30, No it got the shirt drying problem wrong. Assuming serialized drying, we should divide - not multiply - the total number of shirts (20) by the rate of shirts dried per hour (1.25 shirts per hour). This gives us 20 shirts dried in 16 hours. Or, more simply, each batch of 5 shirts takes 4 hours, so 20 shirts is 4 batches of 5 shirts, each batch taking 4 hours one after the other, so 16 hours in total.

  • @fredsmith9185
    @fredsmith9185 6 месяцев назад

    Matthew I found the answer to the three killers in a room. You have to delete the word "left". Can also replace with word "now".

  • @Dougie373
    @Dougie373 6 месяцев назад

    FYI, that's a "theory of mind" test, it's a skill humans develop in childhood. Very cool that LLMs are starting to get that right 🙂

  • @mrd6869
    @mrd6869 6 месяцев назад

    Im having Gptb4 turbo, Orca2 and Gorilla work together synergistically on a project.One building on the others work in a loop.Results should be interesting 🤔

  • @richardbankole
    @richardbankole 5 месяцев назад

    🎯 Key Takeaways for quick navigation:
    00:00 🌐 *Introduction to Orca 2*
    - Orca 2 improves upon Orca 1 by teaching smaller language models to reason effectively.
    - Smaller models can now perform as well as larger ones in logic and reasoning.
    - Introduction to the concept of Orca 2 and its advancements over Orca 1.
    01:39 🧠 *Enhanced Reasoning in Small Models*
    - Orca 2 focuses on enhancing reasoning abilities of smaller language models.
    - The approach moves beyond mere imitation learning to instill genuine understanding.
    - Discussion on how Orca 2 enables small models to surpass their limitations in reasoning.
    03:29 📈 *Benchmarking and Performance*
    - Orca 2 achieves high performance levels, comparable to models 5-10 times larger.
    - Emphasis on zero-shot settings and advanced reasoning tasks.
    - Presentation of benchmark results showcasing Orca 2's capabilities.
    07:25 🔍 *Training Techniques and Strategies*
    - Details on the training methods and strategies used in Orca 2.
    - Focus on teaching models a suite of reasoning techniques and choosing the right strategy.
    - Explanation of various techniques and their application in model training.
    11:37 🎯 *Addressing Contamination and Benchmarking Challenges*
    - Discussion on contamination in benchmarking and model training.
    - The importance of proper evaluation and avoiding overfitting to benchmark data.
    - Insights into the challenges of training and evaluating language models.
    14:24 🛠️ *Evolution of Explanation Tuning*
    - The role of explanation tuning in enhancing model reasoning capabilities.
    - Comparison of different system instructions and their impact on model outputs.
    - Analysis of how explanation tuning improves the reasoning process in models.
    17:57 🤔 *Cautious Reasoning and Strategy Selection*
    - Introduction of cautious reasoning and strategic approach in Orca 2.
    - Training models to select the most effective reasoning strategy for a given task.
    - Explanation of how Orca 2 guides models to think and reason more effectively.
    21:09 🏆 *Comparative Analysis and Results*
    - Comparative analysis of Orca 2 with other models across various benchmarks.
    - Orca 2's impressive performance, especially in reasoning tasks.
    - Detailed discussion on how Orca 2 compares favorably with larger models.
    24:48 🧪 *Testing Orca 2 with Logic and Reasoning Problems*
    - Practical tests of Orca 2's logic and reasoning abilities.
    - Examples of Orca 2 tackling different types of logic and reasoning tasks.
    - Evaluation of Orca 2's performance in real-world problem-solving scenarios.
    31:12 🎓 *Application of Orca 2 in Specific Problem Solving*
    - Orca 2 applied to a specific logic and reasoning problem from the research paper.
    - Analysis of Orca 2's ability to solve complex logic and reasoning problems accurately.
    - Demonstration of Orca 2's advanced reasoning capabilities through practical examples.

  • @PriestessOfDada
    @PriestessOfDada 6 месяцев назад

    It'll pass the killer test if you give it more information on what a killer is, and how you become one. Also, changing it from killer to "murderer" might also help. I'll get the model working, try that and circle back at some point

  • @janfilips3244
    @janfilips3244 6 месяцев назад

    Hello Matthew, I've got some thoughts and questions about the subject and would love to chat with you directly. Is there a way to reach out to you?

  • @issiewizzie
    @issiewizzie 6 месяцев назад +2

    which model passed this question "Assume the laws of physics on Earth. A small marble is put into a normal cup and the cup is placed
    upside down on a table. Someone then picks up the cup and puts it inside the microwave. Where is
    the ball now? "

  • @jeanchindeko5477
    @jeanchindeko5477 6 месяцев назад

    So all those LLM eval test will soon be obsolete! How will we know if we have reached AGI or not?

  • @FranEnchufes
    @FranEnchufes 6 месяцев назад +1

    The benchmark is not as good as you think because it's a very common example in cognitive science. The model could fool you with memorization

  • @JracoMeter
    @JracoMeter 6 месяцев назад

    Try asking multiple of these reasoning questions in a single chat instance. I would be interested to see if you get the same answers.

  • @J2897Tutorials
    @J2897Tutorials 6 месяцев назад

    *Falcon 180B:* _John thinks the ball is in the box because he put it there before leaving for work. Mark thinks the ball is in the basket because he saw it there before leaving for school._ ✔

  • @jakeparker918
    @jakeparker918 6 месяцев назад

    John Carmack said something like "The code for AGI will be simple"

  • @klammer75
    @klammer75 6 месяцев назад

    I’d love to try and use their data set on gpt3.5 and see if it does better than gpt4🤔 can we get access to their training set? How does replicability work these days? Seems like a good dataset and insight into the training set but without the dataset and ability to recreate the results who knows🤔🤷🏼‍♂️

  • @claudiocl8937
    @claudiocl8937 6 месяцев назад

    Very interesting, but how come Orca-2-13B appears twice in the Average Performance on Reasoning Benchmark? 21:42

  • @TheLucanicLord
    @TheLucanicLord 6 месяцев назад

    30:30 The ball is on the floor. Tables are rarely perfectly flat. 31:05 Does he take the ball out of the box?

  • @yuriborroni5490
    @yuriborroni5490 6 месяцев назад

    You could try to use the following prompt to get better outputs: Take a deep breath and work at this problem step by step

  • @TreeLuvBurdpu
    @TreeLuvBurdpu 5 месяцев назад

    The real surprise is not that the AI gets the tests wrong. THE REAL surprising thing about all this is how simplistic and automatic so many people assume intelligence is.

  • @ianhaylock7409
    @ianhaylock7409 6 месяцев назад

    After the AI answers the number of killers question, you should ask it if it thinks a dead killer is still a killer, and if not, to explain why.

  • @zeburgerkang
    @zeburgerkang 6 месяцев назад +1

    But they are both in the room, it never states person 2 didn't see person 1 put the ball in the box.

  • @Meditationmonk333
    @Meditationmonk333 6 месяцев назад

    You should do an overview at the end of the video.

  • @tmhchacham
    @tmhchacham 6 месяцев назад +1

    "I counted them carefully before sending them to you." is 9 words. The first sentence is metadata. :)

  • @dougg1075
    @dougg1075 6 месяцев назад

    Orca, great white shark… it’s all deadly:)

  • @Yipper64
    @Yipper64 6 месяцев назад

    I feel like at some point we are going to have a model that, rather than being a jack of all trades, is made up of a bunch of different small models, with a top level model that decides what sub-model to use for any given prompt.