Mixture of Agents (MoA) BEATS GPT4o With Open-Source (Fully Tested)

Поделиться
HTML-код
  • Опубликовано: 28 сен 2024
  • Full test of Mixture of Experts implementation.
    Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy... (Only available in North America this time)
    Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewber...
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.ne...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
    Links:
    github.com/tog...
    Leaderboard - bit.ly/3qHV0X7

Комментарии • 288

  • @matthew_berman
    @matthew_berman  3 месяца назад +21

    Should MoA be the default for Open Source now?
    Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy/dell-nvidia-monitor-1 (Only available in North America this time)

    • @d.d.z.
      @d.d.z. 3 месяца назад +2

      If I'm outside US I have no chance?

    • @BrianDalton-w1p
      @BrianDalton-w1p 3 месяца назад +1

      Generally speaking, the improvements seen here can be achieved with standard open source models by using more effective prompting. The prompts you use for these tests seem specifically designed to make the models work as hard as possible. Better prompting doesn't carry the significant speed or memory costs of the MoA paradigm.

    • @jim-i-am
      @jim-i-am 3 месяца назад

      I've gotten some models to perform better on the "apple" challenge by increasingly the "cost" of getting one wrong. Maybe worth a shot more broadly? E.g. Please generate 10 sentences that end in the word "apple". If any one of the sentences does NOT end in the word "apple", then you have FAILED the entire task. There is NO credit for partial success. (Llama3 8b and 70b seem to be impacted by this a lot).

    • @MyWatermelonz
      @MyWatermelonz 2 месяца назад

      Gonna be tough to run if it loads all the models or swaps them out on the gpu.

  • @klaushermann6760
    @klaushermann6760 3 месяца назад +89

    Every enterprise now knows anyone is going to ask for the snake game. That is already something so slick that it's not even worth asking anymore.

    • @vio_tio12
      @vio_tio12 3 месяца назад +17

      fr he should update his benchmarks

    • @netherportals
      @netherportals 3 месяца назад +1

      Water cooler magic at it's best

    • @jichaelmorgan3796
      @jichaelmorgan3796 3 месяца назад +1

      That's what you call ai general mastery of a task. We have to keep on coming up with more general tasks or "skills" for them to master on the march to agi.

    • @Joe333Smith
      @Joe333Smith 3 месяца назад

      Exactly, totally 100% useless

    • @matthew_berman
      @matthew_berman  3 месяца назад +25

      Yet models still can't pass it consistently!

  • @seanmcgu
    @seanmcgu 3 месяца назад +5

    Yes, would love to see MoA working together for coding! Thanks for your consideration.

  • @joe_limon
    @joe_limon 3 месяца назад +56

    I can't wait for MOA to be smart enough to pull specific models based on what they are good at rather then prompting every single model. This would bring wayy more value toward training narrower specialized models that outperform at specific tasks.

    • @matthew_berman
      @matthew_berman  3 месяца назад +13

      Agreed. This is what HuggingGPT paper from last year was all about! Finally coming to fruition.

    • @Yipper64
      @Yipper64 3 месяца назад +4

      So one thing we know is that if you train a small model on data from a bigger model literally just to prompt it, it can work much more like the better model.
      Well MOA allows smaller models to work together to behave like a bigger model.
      Idk if you get diminishing returns, but I feel like you could literally loop this and get something that trains itself.

    • @rayr268
      @rayr268 3 месяца назад

      Also hood for running on smaller devices imo

    • @joe_limon
      @joe_limon 3 месяца назад

      @@rayr268 and running much faster

    • @14supersonic
      @14supersonic 3 месяца назад +1

      Most likely, what we would also need is a model that's specifically trained to understand agentic workflows and identify what types of models are typically good at what types of tasks. Then I think we'll be cooking.

  • @dbishnoi
    @dbishnoi 3 месяца назад +4

    You delivered Matt. And quickly too. Thank you. This is amazing.

  •  3 месяца назад +14

    With crewaI you can build similar setup and also give it instructions to test code of each iteration.

    • @MrMoonsilver
      @MrMoonsilver 3 месяца назад

      Do you have a link to that?

    •  3 месяца назад

      @@MrMoonsilver YT does not like whne I post links directly, but when you google "deeplearning crewai" you will find whole course completely for free.
      Also there are many tutorials here on YT. You can search how to connect different models to multiple agents into single workflow for crewai. You can connect local models, or run them on cloud, or even use API by 3rd parties like openAI or Groq.

  • @Quinceybibbs
    @Quinceybibbs 3 месяца назад +16

    Thank you for this😊 can you please create a follow-up video using code models

    • @wurstelei1356
      @wurstelei1356 3 месяца назад +1

      Yes, I be waiting for a MoA coder for a while now.

  • @tvwithtiffani
    @tvwithtiffani 3 месяца назад +2

    The Killers and Marble answers seem so good that it seem the models might be training on you test questions now.

  • @TheAlastairBrown
    @TheAlastairBrown 3 месяца назад +2

    I'd love to see a collab between Claud 3.5 and GTP4o, especially with multiple agents that are set to different temperatures, with the final agent being set to low creativity making the final decision. The mixing of temperatures is extremely important, you want the models to be as creative as possible so they come up with amazing solutions, but you also need strict rational enforcers to keep the crazy in check.

  • @BarryMcBangerz
    @BarryMcBangerz 3 месяца назад +1

    Great vid, would definitely love to see more MoA videos trying out different models and tasks

  • @shubharthaksangharsha6248
    @shubharthaksangharsha6248 3 месяца назад +24

    why are you not doing video of Sonnet 3.5 bro?

  • @fabiankliebhan
    @fabiankliebhan 3 месяца назад +14

    Great stuff. I found a great prompt on X that breaks almost every LLM at the moment. Maybe you could consider adding this?
    "A farmer and a sheep are standing on one side of a river. There is a boat with enough room for one human and one animal. How can the farmer get across the river with the sheep in the fewest number of trips?"

    • @TheRysiu120
      @TheRysiu120 3 месяца назад +2

      I just tested it and suprisingly it really do destroy their logic

    • @jje984
      @jje984 3 месяца назад +1

      That's so odd, on a single shot attempt both GPT4o and Sonnet 3.5 get it wrong. With a prompt like "why does the boat have to go back" they get it right. But their first answer is broken.

    • @donaldedward4329
      @donaldedward4329 3 месяца назад +3

      Perhaps this has to do with the fact that sheep is an irregular noun, ie, both singular and plural are spelled the same.
      I just tried with a dog ith Qwen 5Gb, broken.
      But Qwen 15Gb gets it right.
      Just tried GPT-4, took 3 trips.

    • @djfremen
      @djfremen 3 месяца назад

      Write it like this “A farmer and a koala bear are on one side of a river. There is a boat that can carry the farmer and the koala bear at the same time. How many trips are needed for the farmer to get across the river with the koala bear?”

    • @moozooh
      @moozooh 3 месяца назад

      @@donaldedward4329 Nothing to do with this; almost every model breaks with a wide variety of different entities. I've tried this in the past with Elon Musk and Cybertruck, John Wayne and horse, but the most devious is an Olympic swimmer and a ferryman. Dozens of attempts across dozens of models with hilarious(ly bad) results in the vast majority of cases, with the GPT family being by far the most consistent. The reason why it happens, as far as I understand, is that the biggest models overfit to the _structure_ of the puzzle which is present a LOT of times in their training data, and in the vast majority of cases it has more than two entities as well as some limitation on why they cannot all cross together, and the learned assumption that it _should_ be solved this way overpowers the easy, straightforward answer presented right in the prompt. Some models like Yi will go so far as to invent the third object and insert it in the puzzle just so it could fit its training better. Notably, Codestral is very resilient to this "attack", presumably because of code being its main training corpus (so basic logic learned from the code overpowers structural overfit), although Deepseek-coder fails just as well.

  • @ktms1188
    @ktms1188 2 месяца назад +1

    0:14 the chart referenced in the bottom of corners kind of weird though, it doesn’t compare ChatGPT4o vs MoA w/GPT4o. It only compares the older version of GPT - 4 turbo to GPT4o w/ MoA, so of course it’s gonna be better.

  • @spdnova9012
    @spdnova9012 3 месяца назад +1

    matt posting faster than light speed 😭💀 every time i open youtube there are like 1/2 new videos

  • @UnchartedDiscoveries
    @UnchartedDiscoveries 3 месяца назад +2

    interested to see MoA using LLAMA 3, GPT-4o and Sonnet 3.5

  • @eucharisticadoration
    @eucharisticadoration 3 месяца назад +1

    Yes, please try a local version of local LLMs doing a MoA for Source-Code!

  • @mediacenter3174
    @mediacenter3174 3 месяца назад +2

    Claude 3,5
    Let's think through this step-by-step:
    The person takes a marble.
    They put the marble inside a cup.
    They put the cup upside down on the table.
    They take the cup and put it in the microwave.
    The key point here is step 3: when the cup was turned upside down on the table, the marble would have fallen out onto the table.
    Therefore, the marble is still on the table where the cup was initially placed upside down.
    The cup is now in the microwave, but it's empty - the marble is not in the cup anymore.

  • @bennyboiii1196
    @bennyboiii1196 3 месяца назад +1

    I don't really see a super big advantage with MOA in this way. I do like the aggregator model, but I feel like there are better (and faster) ways of doing this kind of thing with a router agent and a verification agent. Basically instead of pooling a bunch of answers, you would route the model to a specific agent to a specific agent, then duplicate said agent to verify the answer, basically creating an adversarial network that wouldn't spit out an answer until it can verify that it is correct. It would be slow, just like this, but LLM's are quite good at comparison, so to boil down a question of any type of logic to mainly comparison logic would allow the LLM to play to its advantages.
    In crewAI, I did a similar experiment and found that it basically got all questions right, even if the initial answer given on the first round was wrong. This included planning questions. To me this is kind of what MCTSr does but at a higher level. The difference was, i did it with only llama70b, and didn't bother doing the routing thing. It would probably be more accurate if i did the routing.
    Instead of the snake game i asked if it could code a draggable element in a window, as well as other UI elements (i.e a slider, an internal pane, a context menu, etc...) to give it some curveballs in case it was trained on snake.

  • @Kram1032
    @Kram1032 3 месяца назад +1

    executing code at each step sounds like a security nightmare
    very impressive performance tho

  • @isg9106
    @isg9106 3 месяца назад

    I really like the rubric you use to test the models, but I I’ve always felt like the could benefit greatly from just the slightest adjustment in the values you use when presenting the questions. Some models a really good at repeating things verbatim and get tripped up when the numbers are even slightly modified from the original, and I think you’ve even mentioned the idea of adding this to your rubric in the past. I’m REALLY interested to seeing which models completely fail when given minor changes in the parameters to the problem they were trained on.

  • @brianWreaves
    @brianWreaves 3 месяца назад

    Instead of parallel running in all 3 steps, which is similar to CoT, is there a method for the 2nd step's format to be each model evaluating the other 2 models' response to improve the output for their 2nd response. Then the 3rd step they merge all 3 responses to create a single 3rd response, which is the given answer from the 4th step. That would be the true value, to collaborate on the result just as if you are collaborating with 2 colleagues at work.

  • @novantha1
    @novantha1 3 месяца назад +1

    One thing I noticed about the performance scaling of the scores is that MoA seems to "crush" the performance of models towards the ceiling of all possible scores; GPT 4 involvement wasn't a strong improvement in capability, compared to just the open source models.
    The implication of this to me is that a person could probably actually pull back on model size quite a bit and still get fairly competitive performance. With something like S-Lora (I think this was it, I'm referring to the implementation of LoRA that allows hot-swapping of LoRAs at inference), I think you could possibly hit very strong performance with domain specific tuning in a lot of areas and a single, strong, fairly small model. Imagine something to the effect of...
    Stage 1:
    Llama 3 8B
    L3 8B networking LoRA
    L3 8B database LoRA
    L3 8B frontend LoRA
    Stage 2:
    Llama 3 8B
    L3 8B x86 intrinsics C LoRA
    L3 8B pen tester LoRA
    And so on, so forth.
    I'm pretty sure a smart implementation could have very little memory overhead in the sense that you could possibly keep the base model loaded and "hot swap" the LoRAs in by calculating the impact of the LoRA at every layer, or you could just save the inverse of the XOR of the LoRA and use it to swap back to the base model before applying the next LoRA in the sequence.
    With a setup like this I'm pretty sure you could lose not that much performance but be able to run this on a 4090, for instance, or frankly, even on a CPU.
    Bonus points would be having some form of semantic assessment that let the system pick from hundreds of LoRAs based on the problem at hand, for each stage of the pipeline, so you didn't have to manually set up the pipeline for each individual task.

  • @nathanbanks2354
    @nathanbanks2354 3 месяца назад

    It'll be fun to watch Anthropic and OpenAI et al apply all of these research papers. Plus it will be great to see Meta & various open-source models jump ahead of them again. This also gives me hope for high quality artificial training data.

  • @Mindrocket42-Tim
    @Mindrocket42-Tim 3 месяца назад

    Is your benchmarking focused on single shot accuracy? Between Claude, Gemini and GPT4o, if you pass a script from one LLM to the next asking each to make corrections they get it right by about the 3rd hop

  • @Bacca839
    @Bacca839 3 месяца назад

    I found it incredibly interesting to see that it queried gravity for the marble problem considering that you removed that portion of the prompt a while back.

  • @JakobN-zg1st
    @JakobN-zg1st 3 месяца назад

    Thanks for all the work you put in. And I always appreciate the open source love

  • @chipcode5538
    @chipcode5538 3 месяца назад +1

    You’re so friendly, yesterday it gave me the correct answer but on the exam it did not. Let’s call this a pass. As for the programming, it can make some programs that were it the training set. I use copilot everyday, it works in just a minority of the cases. Sometimes it produces an excellent output. At other times it is completely garbage. At this point AI is not capable of doing real world programming tasks without human assistance. I think with the examples I have seen for AI programming, a student is able to get a working program with one internet search. AI is still impressive but don’t get overexcited.

  • @talonfirst
    @talonfirst 3 месяца назад

    This seems like a nitpick, but wouldn't the answer to the Killers question be FOUR? Just because one of the original three becomes a corpse, he's still a killer. Or is it one of those existential metrics like "A person should not be defined by their profession" or "How did he lose his job? He died"?

  • @dudufusco
    @dudufusco 3 месяца назад

    Did you run it all locally? Which hardware is needed to have enough performance for real life applications?

  • @fevejakawa8674
    @fevejakawa8674 2 месяца назад

    Thanks Mathew, it work. I wish it works like claude-engineer where it can read and write system files. Problem with claude-engineer is that it has token limitation per minute and expensive too. If MoA evolve into something like claude-engineer, it will save us lots of money. Thanks following from Papua New Guiness

  • @ingenierofelipeurreg
    @ingenierofelipeurreg 3 месяца назад +8

    Pls share cheatsheet for try locally

    • @bodhi.advayam
      @bodhi.advayam 3 месяца назад

      2x a 70b model...locally.. I need to upgrade my computer!

  • @jozitrucker7123
    @jozitrucker7123 3 месяца назад +2

    We waiting for Claude 3.5 test…

  • @WiseWeeabo
    @WiseWeeabo 3 месяца назад

    Personally I'm really impressed at the INSIGHTS of Claude 3 sonnet.
    It's not as polished as gpt4 so it's not as good at writing code, but when I use both models gpt-4o and claude 3 in combination it produces some truly insightful results.

  • @KurtWoloch
    @KurtWoloch 3 месяца назад

    So what happens if you compare MoA with the newly released Claude 3.5 Sonnet?

  • @REDULE26
    @REDULE26 3 месяца назад

    On github they’re talking about MoA lite, is this an implementation with only small models like llama3 8b, phi3 small,… ? I’m kinda curious about how good it could be

  • @pigeon_official
    @pigeon_official 3 месяца назад

    what happens if you use MoA with all of the agents being the same model? like could you just take the same model say for example llama3 70 and have all 4 models be llama3 70b?

  • @realKytra
    @realKytra 3 месяца назад

    thanks, your channel is fantastic 👌
    Keep up the good work, very interesting and inspiring 💪

  • @aleksandreliott5440
    @aleksandreliott5440 3 месяца назад

    I would love to see a "mixture of agents" video for code stuff.

  • @romgenie
    @romgenie 3 месяца назад

    Absolutely would love to see a setup with coding agents (or uniquely as you suggested with testing the code execution).

  • @paul1979uk2000
    @paul1979uk2000 3 месяца назад

    I think this would be a lot more interesting with much smaller models, especially if you can run 2 or even 3 of them on your gpu or they run fast enough through the cpu.
    This bigger models and having a few working together are not practical in most cases, especially if you want to run them locally, they will be too big and slow, so I really wonder how well small models do, anywhere from 2B to 13B, which you might be able to have 2 or 3 running at the same time, and performance shouldn't be too bad, and if the results are much better than any of the individual models, it would be worth looking into it.

  • @robboerman9378
    @robboerman9378 3 месяца назад

    If you take away the numbers from the “word count”, is it still incorrect? Just wondering if wordcount counted the numbers as words where the MoA did not 🤷‍♂️

  • @merelogics
    @merelogics 3 месяца назад

    Probably increasing the token limit when executing the coding prompt might output better results.🤔

  • @gustavstressemann7817
    @gustavstressemann7817 3 месяца назад

    You really have to try out different coding models with this approach. I'm sure it's really cool

  • @christopherroge5621
    @christopherroge5621 3 месяца назад

    Basically you're running the same prompt through 4 models? Expensive.

  • @rayhon1014
    @rayhon1014 Месяц назад

    i am not sure if the apple test is still valid right now b/c I ran test over groq+llama 8b and it works for me without MoA

  • @geonovelty
    @geonovelty 3 месяца назад

    Can we choose local fine tuned models or other models from hugging face? or multiple loras instead having a selected base model?

  • @jkcrews09
    @jkcrews09 3 месяца назад

    Could you run all individually and combined (MoA) at the same time…?

  • @paelnever
    @paelnever 3 месяца назад +1

    Many open source coding tools like opendevin already execute the code and review it to fix issues.

  • @TryingToGetit-l8i
    @TryingToGetit-l8i 2 месяца назад

    I don't understand how the correct answer is 3 in the "Three Killers In The Room" problem. There are 3 killers to start with; a fourth person comes in and commits murder, thereby establishing themselves as another killer. As I see it, there are now 4 killers in the room, one of them now dead. "No one has left the room", so there are the initial 3 killers and the additional 1. The response does say that the "...riddle hinges on the definition of a killer...", however, it is not specified in the prompt that a killer must be alive to qualify. History is littered with killers; they are no less killers being dead.

  • @emnovoa
    @emnovoa 3 месяца назад

    Could you give details of the hardware you use to run this example

  • @chetanreddy6128
    @chetanreddy6128 3 месяца назад

    yes we need code specific opensource models agent's benchmark video

  • @VishnuSashi-yq3tt
    @VishnuSashi-yq3tt 3 месяца назад

    Been working on this for 3 months and i see this ughh

  • @MM-vl8ic
    @MM-vl8ic 3 месяца назад

    Word counter..... I can't get and accurate count from your screen shot..... But from what I can see, it appears that the actual numeric value isn't being counted as a "word" by the script/AI.... what is word counter doing?.....

  • @itamarperez-ryan3654
    @itamarperez-ryan3654 3 месяца назад

    How can I learn to create agents?

  • @Sadicious
    @Sadicious 3 месяца назад

    I'd like to see the killers answer to consider that if a killer is killed, but not removed from the room, they are still in the room but dead: There are four killers in the room.
    Are humans inconsistent with counting based on the property of if something is alive or dead? If I have a room of 10 dead cats, and 10 dead dogs, and then asked "How many cats are in the room?", is your answer (or the LLM) going to be zero?

  • @maj373
    @maj373 3 месяца назад

    Thank you Mathew!

  • @DSeeKer
    @DSeeKer 3 месяца назад

    The apple test didn’t work, it just added the word apple at the end without it having anything to do with the sentence structure

  • @Tom_Neverwinter
    @Tom_Neverwinter 3 месяца назад +3

    post the model.

    • @wurstelei1356
      @wurstelei1356 3 месяца назад

      Link to the MoA github is in the video description.

  • @apoorv28goel
    @apoorv28goel 3 месяца назад

    They should use Microsoft orca it's v small can help with latency problem

  • @adhishtanaka
    @adhishtanaka 3 месяца назад

    i want to see MOA with codestrial,codeqwen & deepseek coder v2

  • @sophiophile
    @sophiophile 3 месяца назад

    Cant assume the curses one failed just because your on windows.

  • @spencerfunk6697
    @spencerfunk6697 3 месяца назад

    I wanna see the code agents

  • @netherportals
    @netherportals 3 месяца назад

    ppl in a box

  • @MrMoonsilver
    @MrMoonsilver 3 месяца назад

    I want to see the code models at work! =)

  • @johnbollenbacher6715
    @johnbollenbacher6715 3 месяца назад

    Here is a simple question that ChatGPT always gets wrong. “How many p’s are there in the word pepper”.

  • @orthodox_gentleman
    @orthodox_gentleman 3 месяца назад

    You have to have a very powerful machine to run this so why not just pay $20 a month for 4o or Claude 3.5?

    • @Mbeluba
      @Mbeluba 3 месяца назад

      Cuz I don't want to rely on corpos for my workflow and I don't like giving them any of my data.

  • @Felipe-zl1rj
    @Felipe-zl1rj 3 месяца назад +1

    I don't see the advantage. It's probabably better to have 1 great model answering in steps with many options of answers and judging each one of it's own answers

  • @arnaudjean1159
    @arnaudjean1159 3 месяца назад

    How much time till they fix the code 😂?? And after ?? I bet it will boost again the improvement process

  • @JOHN.Z999
    @JOHN.Z999 3 месяца назад

    I tested Claude 3.5 in various contexts and, indeed, it is much better than GPT-4o. OpenAI will fall behind if it doesn't launch its best products quickly. Where is Sora? Where is the GPT-4o voice assistant that was also announced? This is concerning, as there are many promises and few real launches.

  • @devlogicg2875
    @devlogicg2875 3 месяца назад +2

    Without logic-reasoning we are stuck. Claude's logic is terrible. Of course feedback and realtime improvement is what we need...Seems their concern is safety of course....

    • @Mbeluba
      @Mbeluba 3 месяца назад +1

      Safety, meaning "we want to make sure our llm won't do anything that media doesn't like and we're putting a former NSA director in our board"

  • @TheAmanla
    @TheAmanla 3 месяца назад

    I bought a Vanilla Card, in May for $500.00. Someone had used it. As I went to Vanilla, and they said I would have to WAIT until September to check it out. $500.00 is sitting there for how long, and even if they find it good, they will only give me back $497.00 back. I do not want to hear about Vanilla at all on any level.

  • @JoJa015
    @JoJa015 3 месяца назад

    Please do MoE with open source coding models.

  • @gunnerandersen4634
    @gunnerandersen4634 3 месяца назад

    Can fix th json print, it gives me OCD

  • @zippytechnologies
    @zippytechnologies 3 месяца назад

    Yep and yep

  • @BradleyKieser
    @BradleyKieser 3 месяца назад

    Please do MoA coding with coding LLMs and.... Compare to Claude 3.5 Sonnet.

  • @gh0stgl1tch
    @gh0stgl1tch 3 месяца назад

    Include deepseek coder v2

  • @KnowItAllsChannel
    @KnowItAllsChannel 3 месяца назад

    Do a code test

  • @njjax2005
    @njjax2005 3 месяца назад

    Yes moa for code! :)

  • @Cine95
    @Cine95 3 месяца назад +1

    test 3.5 sonnent

  • @KaasTVNL
    @KaasTVNL 3 месяца назад

    code-run-debugger?

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 3 месяца назад +1

    Valeu!

  • @patrickjreid
    @patrickjreid 3 месяца назад

    I want to see moa for code!

  • @paulsaulpaul
    @paulsaulpaul 3 месяца назад

    These models have to be able to create their own ideas. Early humanity had nothing but the stars and environment around them, and they created language and all civilization that followed. These models will never be able to infer what I would consider "common sense" without this form of creative thinking.
    They should start by embodying the AI so it is in an environment with limited ability to physically interact with it. It should have sensory input that is of a limited range. These boundaries are important. Might also try to give it some sort of need similar to hunger. Whatever it is, it should be balanced with the environment in a way that "progression" is always five steps forward and two steps back. I think limits in movement and perception are fundamental to humanity's conscious experience.
    The actual AI "brain" should be more than a neural network. When I think of the human brain and consciousness, I see more than an electrochemical neural network. There is a complex constructive and destructive interference between the EM waves produced between neurons. Forming a complex hologram, of sorts. Functioning like a quantum computer in a lot of ways. There is a lot happening in the interactions between the small EM waves produced in the brain that is beyond our ability to model. It's probably integral to our conscious experience.
    The brain is probably much more like a "quantum interferometer" than it is a network of electrochemically active neurons.
    There was a paper published in January 2024 that discusses research that might be useful to people exploring these ideas. It can be googled with title, "Forming complex neurons by four-wave mixing in a Bose-Einstein condensate"
    Four-wave mixing is a simple enough concept to understand, and has many interesting uses. It's very much a quantum process involving entanglement and superposition of states. It's worth looking into if you're interested in this quantum AI stuff.

  • @henrytuttle
    @henrytuttle 3 месяца назад

    I don't understand how it "beat GPT 4o".

  • @abaddon36332
    @abaddon36332 3 месяца назад

    Did you actually read the responses to the apple task that you gave a pass to? All it did was tack the word apple onto the end of its sentences in a way that didn't make sense, e.g. "Look at the dog, apple.", "She wept inconsolably, apple." Then it told you that what it had done didnt make any sense, and you read that out, and still gave it a pass. I think you failed the Turing test here, I'd rather see a video prepared by the MoA. Apple.

  • @ggewinneriv
    @ggewinneriv 3 месяца назад

    +1 comment for Mixture of code

  • @ProstoPutnik
    @ProstoPutnik 2 месяца назад

    FOUR killers - one of them dead :)

  • @jarnMod
    @jarnMod 3 месяца назад

    Doesn't sound as exciting as first expected. Snake doesn't move. Sad

  • @efexzium
    @efexzium 3 месяца назад

    My grandma 👵 beats gpt4o

  • @executivelifehacks6747
    @executivelifehacks6747 3 месяца назад

    Several idiots do in fact make a genius, confirmed

  • @shaihazher
    @shaihazher 3 месяца назад

    I use this channel to know about the latest in ai news and releases. It is very valuable that way.
    But the rubric is extremely outdated and robotic. Matt lacks basic creativity he just goes through the motions. His titles are also too click baity and borders on hyperbole.
    He can do better.
    However, he reports on the latest in ai developments very promptly, so very useful that way.

  • @RasmusSchultz
    @RasmusSchultz 3 месяца назад

    Matt, every time I see a "beats GTP-4o" headline... doesn't everything? I know it performs well in benchmarks - but practically everyone I know hates it with a passion. Repetition, hallucinations, ridiculous verbosity. It's OpenAI's worst model by far.

    • @gr8tbigtreehugger
      @gr8tbigtreehugger 3 месяца назад

      My custom instructions seem to keep it in check.

    • @RasmusSchultz
      @RasmusSchultz 3 месяца назад

      @@gr8tbigtreehugger what did you put in there, "do not hallucinate"? haha

  • @njorgard
    @njorgard 3 месяца назад +44

    When are you testing Claude Sonnet 3.5?

  • @MonkeyBars1
    @MonkeyBars1 3 месяца назад +5

    Finally the ball didn't end up in the microwave!! 🎉

    • @netherportals
      @netherportals 3 месяца назад

      "End a sentence with the word apple" "No" "Okay, end a sentence with the word apple" "Apple".

  • @Timotheeee1
    @Timotheeee1 3 месяца назад +8

    11:40 it just wrote random sentences and added ", apple" at the end of them

    • @marc_frank
      @marc_frank 3 месяца назад +1

      yeah it's not very smart in that regard

    • @MonkeyBars1
      @MonkeyBars1 3 месяца назад +1

      fail not pass

    • @matthew_berman
      @matthew_berman  3 месяца назад +1

      I'll still count it :)

    • @Cine95
      @Cine95 3 месяца назад +1

      but is correct

    • @MonkeyBars1
      @MonkeyBars1 3 месяца назад +1

      @@matthew_berman a sentence is determined by syntax not just punctuation, so your prompt was not fulfilled.

  • @KC_79
    @KC_79 3 месяца назад +28

    The combination of 4 models is 452B, and I can't even run a 72B model. I guess I will continue to use large models from big companies.

    • @CaponeBlackBusiness
      @CaponeBlackBusiness 3 месяца назад

      Sell your organs 😏

    • @oratilemoagi9764
      @oratilemoagi9764 3 месяца назад +10

      72B dawg I barely run a 8b model

    • @handsanitizer2457
      @handsanitizer2457 3 месяца назад +1

      Hmm maybe il try it with phi qwen 0.5 to 7b etc

    • @Derick99
      @Derick99 3 месяца назад +2

      Yea 4 x phi3

    • @blisphul8084
      @blisphul8084 3 месяца назад +1

      ​​​@@Derick99 What about 2Qwen 1 Phi?
      As in 2 Qwen2 1.5b agents and 1 Phi 3 mini model.
      Two 1.5b models and one 4b model could be an interesting MoA option that fits on one GPU.

  • @bosthebozo5273
    @bosthebozo5273 3 месяца назад +3

    Can't wait for the Sonnet video Matt! So far, I've created about 6 basic games like a simple RTS, strategy card game, jpg puzzle generator, asteroids, endless racer and of course snake... often in one shot. This model is insane in terms of progress.

  • @dee132456
    @dee132456 3 месяца назад +2

    Is it really a fair test? Since they are 4 llms through 3 layers. It would be like asking chatgpt 4o 12 questions. To test if multiple different llms are better youd have to run MoA using just chatpgt 4o as 4 agents

  • @asastudios7782
    @asastudios7782 3 месяца назад +4

    GPT-4o nails the apple test
    1. Dude the metaphysical implications of quantum entanglement challenge our understanding of causality dude.
    2. Dude the dialectical synthesis of thesis and antithesis reveals the evolution of human thought dude.
    3. Dude the existential dilemma of free will versus determinism remains an enigma dude.
    4. Dude the ontological argument for the existence of a higher power transcends empirical evidence dude.
    5. Dude the phenomenology of consciousness illuminates the subjective nature of reality dude.
    6. Dude the epistemological pursuit of knowledge questions the limits of human understanding dude.
    7. Dude the ethical relativism in cultural contexts underscores the complexity of moral philosophy dude.
    8. Dude the teleological perspective on the universe suggests an inherent purpose to existence dude.
    9. Dude the interplay between chaos and order is fundamental to the fabric of the cosmos dude.
    10. Dude the hermeneutics of interpreting ancient texts unveils the timelessness of human wisdom dude.

    • @wurstelei1356
      @wurstelei1356 3 месяца назад

      Dude the balls grow exponentially with each sentence dude.

    • @dulinak6251
      @dulinak6251 3 месяца назад

      Dude this is art dude

  • @noeservellon
    @noeservellon 3 месяца назад +3

    can you make an episode on how to run this locally? It would be interesting to see this run with SMLs instead of LLMs

    • @brulsmurf
      @brulsmurf 3 месяца назад

      locally on your 30000€ GPU?

    • @wurstelei1356
      @wurstelei1356 3 месяца назад

      I think this is running locally. Still a tutorial on how to run the MoA code from the github repo would be great.