How Did Open Source Catch Up To OpenAI? [Mixtral-8x7B]

Поделиться
HTML-код
  • Опубликовано: 31 янв 2024
  • Sign-up for GTC24 now using this link!
    nvda.ws/48s4tmc
    For the giveaway of the RTX4080 Super, the full detailed plans are still being developed. However, it'll be along the line of you taking a photo of yourself attending a GTC virtual session, so you can sign-up to the conference now to set an early reminder!
    What is Mixtral8-7B? The secret Mixture of Experts (MoE) technique that has beaten OpenAI's GPT-3.5 which was published around a year ago? In this video, you will learn what is Mixtral8x7B and how Mixture of Experts work which made them the new rising standard of LLM format.
    Mixture of Experts
    [Paper] arxiv.org/abs/2401.04088
    [Project Page] mistral.ai/news/mixtral-of-ex...
    [Huggingface Doc] huggingface.co/docs/transform...
    This video is supported by the kind Patrons & RUclips Members:
    🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi
    [Discord] / discord
    [Twitter] / bycloudai
    [Patreon] / bycloud
    [Music] massobeats - waiting
    [Profile & Banner Art] / pygm7
    [Video Editor] @askejm
  • НаукаНаука

Комментарии • 264

  • @bycloudAI
    @bycloudAI  4 месяца назад +25

    Sign-up for GTC24 now using this link! nvda.ws/48s4tmc
    RTX4080 Super Giveaway participation link: forms.gle/2w5fQoMjjNfXSRqf7
    oh and check out my AI website leaderboard if you're interested! leaderboard.bycloud.ai/
    Here's a few corrections/clarifications for the video:
    - it's named 8x7B but not 56B is because it's duplicating around 5B of Mistral-7B's parameters 8 times (=40B), then there's the extra 7B for attention and other stuff (40B+7B=47B)
    - experts are not individual models, but only FFNs from Mistral-7B
    - experts operates not just on a per token level, but also per layer level. Between each feed-forward layer, the router assigns that token to an expert, but each layer can have a different expert assignment.
    - Router is layer based, and it decides which expert for that next layer to activate, thus it looks different over each layers 3:26 and you can refer to this diagram imgur.com/a/Psl1Fi4 an example process of activations that might happen for any given token, and it's picking a new set of 2 experts for every layer
    special thanks to @ldjconfirmed on X/twitter for the feedbacks

    • @Psyda
      @Psyda 26 дней назад

      Who won?

  • @MCasterAnd
    @MCasterAnd 4 месяца назад +1151

    I love how OpenAI is called "Open" yet it's just closed source

    • @cat-le1hf
      @cat-le1hf 4 месяца назад +190

      there needs to be a law that says you can't just use "open" for non-FOSS projects

    • @boredSoloDev
      @boredSoloDev 4 месяца назад +159

      IIRC it was originally an idea for a non profit open AI system... But they realized how much mf money they could make, and now it's a private for profit company that's being rid by Microsoft n

    • @lemmyboy4107
      @lemmyboy4107 4 месяца назад +10

      @@boredSoloDevIf nothing changed then openai is still a none profit but has a child company for profit which then in theory funds the none profit.

    • @MrGenius2
      @MrGenius2 4 месяца назад +8

      it started open thats why it's like that

    • @caty863
      @caty863 4 месяца назад +1

      @@lemmyboy4107 Name one project they have that is "open" and then go on with the rambling; oh, they are a non-profit!

  • @alexander191297
    @alexander191297 4 месяца назад +269

    Kinda like GLaDOS with its cores, where each model is a core. So maybe, Portal had some prescience regarding how modern AI will work!

    • @gustavodutra3633
      @gustavodutra3633 4 месяца назад +40

      No, Aperture Science is real, Borealis must be too, so, Half-Life 3 confirmed???? ​@@xenocrimson
      Actually, I think someone at Valve knew about some stuff about AI and decided to put some references in Portal.

    • @Mikewee777
      @Mikewee777 4 месяца назад +7

      GladOS combined emotional personality cores with the remains of a corpse .

    • @choiceillusion
      @choiceillusion 4 месяца назад +2

      Gabe Newell talking about BCI brain computer interfaces 5 years ago is a fun topic.

    • @starblaiz1986
      @starblaiz1986 4 месяца назад +2

      SpaAaAaAaAaAaAaAaAace! 😂

    • @aloysiuskurnia7643
      @aloysiuskurnia7643 4 месяца назад

      @@gustavodutra3633 Cave Johnson is real and he's been watching the future the whole time.

  • @andru5054
    @andru5054 4 месяца назад +499

    1/8th of neurons is me as fuck

    • @Mkrabs
      @Mkrabs 4 месяца назад +28

      You are a mixture, Harry!

    • @zolilio
      @zolilio 4 месяца назад +5

      Same but with single neuron

    • @dontreadmyusername6787
      @dontreadmyusername6787 4 месяца назад +6

      That means you have excellent computation time
      I usually lag when prompted due to memory intensives background tasks (horrible memories) that keep running all the time even since my neural net is mostly trained on traumatic experiences

    • @dfsgjlgsdklgjnmsidrg
      @dfsgjlgsdklgjnmsidrg 4 месяца назад +2

      @@Mkrabs gaussian mixture

    • @kormannn1
      @kormannn1 3 месяца назад

      I'm 8/8 neurons with Ryan Gosling fr

  • @brianjoelbasualdo7436
    @brianjoelbasualdo7436 4 месяца назад +15

    Makes sense "Mixture of Experts" is a powerful model. In some parts of the human (and other superior mammals) cortex, neurons adopt a configuration where groups of neurons are "relatively separated" from others, in the sense that they conform separate networks which aren't much connected between them.
    In some books (pardon me for my english, I dont know the specific term), they are reffered as "cortical cartridges" ("cartuchos corticales", in spanish), since they have a "column" disposition, where each column/network is by the side of others.
    This is the case for the processing of visual information in the context of guessing the orientation of an object.
    Different cartridges "process" the image brought by the retina, tilted by some degree.
    One cartridge processes the image tilted by 1°
    The next one by 3°
    Etc...
    This way we can generalize the concept of orientation for a certain object, in such a way that no matter it's orientation, we can still recognize that object.
    For example, this allows us to be able to recognize the same tea cup in different orientations, rather that seeing the same teacup in different orientations and thinking they are two distinct ones.
    I am not a software engineer, but it amazes me how the most powerful models are often the same ones used by biology.

  • @iAkashPaul
    @iAkashPaul 4 месяца назад +69

    Mixtral needs ~28GB VRAM as a FP4/NF4 loaded model via TGI/Transformers. So you can attempt to load this with a 4090 or lesser using 'accelerate' & a memory map config for plugging in system memory as well as VRAM in tandem.

  • @MonkeyDLuffy368
    @MonkeyDLuffy368 4 месяца назад +33

    Mistral/Mixtral is not open source, it's open weight. We have no idea what went into the model nor do we have their training code. There is quite literally no 'source' released at all. It's baffling to see so many people call these things 'open source'. It's about as open source as adobe software.

    • @ghasttastic1912
      @ghasttastic1912 4 месяца назад +8

      open weight is still better than openai. you can theoreticlaly run ti on ur pc

    • @theairaccumulator7144
      @theairaccumulator7144 4 месяца назад +2

      ​@@ghasttastic1912just need a workstation card

    • @ghasttastic1912
      @ghasttastic1912 4 месяца назад +2

      @@theairaccumulator7144 lol

    • @thatguyalex2835
      @thatguyalex2835 2 месяца назад +1

      @@theairaccumulator7144Well, Mistral Instruct 7B runs fine on my 2018 laptop computer CPU (95-99% of my 8 GB of RAM is used), but 8X7B will probably not.

  • @thegiverion3982
    @thegiverion3982 4 месяца назад +9

    >Mom can we have fireship?
    >We have Fireship at home.
    >Fireship at home:

  • @Steamrick
    @Steamrick 4 месяца назад +187

    Isn't GPT-4 basically 8x220B? I read that it's composed of eight very chunky LLMs working together.
    I have no idea how the input and output are generated between the eight of them, though, so there could be huge differences between how GPT-4 and Mixtral-8x7B work and I wouldn't know.

    • @Words-.
      @Words-. 4 месяца назад +70

      Yeah, its been rumored for a long time that GPT 4 already has mixture of experts...

    • @HuggingFace_Guy
      @HuggingFace_Guy 4 месяца назад +7

      speculated that it's moe but not sure about that

    • @whannabi
      @whannabi 4 месяца назад +11

      Since they don't wanna reveal their secrets, it's hard to know

    • @michaeletzkorn
      @michaeletzkorn 4 месяца назад +6

      @@Words-.it’s open information that DALLE3 whisperer and other models are included alongside GPT for ChatGPT4

    • @petal9547
      @petal9547 4 месяца назад +27

      ​@@michaeletzkornThat makes It multimodal . Moe is a different thing. We know it's multimodal and it's very likely it's Moe too.

  • @bigphatballllz
    @bigphatballllz 4 месяца назад +15

    Amazing job explaining the paper; precise and engaging! Perhaps, also mention that this is only one of the very few models released under Aapche 2.0 - a completely open-source and permissible licence! Super excited about how the open-source community takes this forward!

  • @FelixBerlinCS
    @FelixBerlinCS 4 месяца назад +31

    I have seen several videos about Mixtral before and still learned something new. Thanks

  • @CMatt007
    @CMatt007 4 месяца назад +4

    This is so cool! A friend recommended this video, and now I'm so glad they did, thank you!

  • @diophantine1598
    @diophantine1598 4 месяца назад +27

    To clarify, Mixtral’s experts aren’t different entire models. Two “experts” are chosen for every layer of the model, not just for every token it generates.

  • @oscarmoxon102
    @oscarmoxon102 4 месяца назад +6

    This is such an interesting explanation. Thank you.

  • @titusfx
    @titusfx 4 месяца назад +10

    What amazes me is how it's not discussed that, in the end, with the GPTs, we are training OpenAI's expert models because they use an MoE architecture. Now, with the mentions, we are training the model that selects the expert... a free community for a closed model.
    Everyone knows that they use the data to train models the small difference is the MoE architecture means several experts models not just one

    • @Slav4o911
      @Slav4o911 4 месяца назад +1

      That's why I don't use any of OpenAI models, I'm not going to pay them and then train their models for free.

    • @XrayTheMyth23
      @XrayTheMyth23 4 месяца назад

      @@Slav4o911 you can just use gpt for free though? lol

    • @Slav4o911
      @Slav4o911 4 месяца назад +1

      @@XrayTheMyth23 I don't have any interest to waste my time with brain dead models, because that's what "GPT for free" is . And I hope that your suggestion was just a joke.

  • @colonelcider8292
    @colonelcider8292 4 месяца назад +205

    AI bros are too big brain for me

    • @GeorgeG-is6ov
      @GeorgeG-is6ov 4 месяца назад +34

      you can learn, if you care about the subject, educate yourself.

    • @joaoguerreiro9403
      @joaoguerreiro9403 4 месяца назад +17

      Study computer science bro! You’ll love it!

    • @colonelcider8292
      @colonelcider8292 4 месяца назад +3

      @@GeorgeG-is6ov no thanks, I'm not learning another course on top of the one I am already doing

    • @whannabi
      @whannabi 4 месяца назад

      ​@@colonelcider8292be greedy

    • @whannabi
      @whannabi 4 месяца назад

      ​@@colonelcider8292be greedy

  • @guncolony
    @guncolony 4 месяца назад

    This is super interesting because one could envision each expert eventually being hosted on a different cluster of machines in the datacenter. Hence you only need enough VRAM for a 7B model on each of the machines, meaning much lower cost, but the entire model performs as well as a 56B model.

  • @peterkis4798
    @peterkis4798 4 месяца назад +40

    To clarify Mixtral implements layer wise experts/routers and picks 2 of them based on the router output for every forward pass to generate a new token. That means maybe layer1 runs expert 4,5 but layer 2 runs 6,2 etc.

  • @saintsscholars8231
    @saintsscholars8231 4 месяца назад

    Nice quirky explanation, thanks !

  • @dockdrumming
    @dockdrumming 4 месяца назад +3

    The mixtral8x7b model is really good. I have using it with Python code to generate stories. It is also rather fast. I am seeing somewhat quick inference times on Runpod with 4 A40 GPUs.

  • @abunapha
    @abunapha 4 месяца назад +4

    what is that leaderboard you showed at 0:54? I want to see the rest of the list

  • @MustacheMerlin
    @MustacheMerlin 4 месяца назад +1

    Note that while we don't _know_ for sure, since OpenAI hasn't said it publicly, it's pretty generally accepted that GPT4 is a mixture of experts model. A rumor, but a very credible one.

  • @amafuji
    @amafuji 4 месяца назад +34

    "I have no moat and I must scream"
    -OpenAI

  • @grabsmench
    @grabsmench 4 месяца назад +6

    So a bot only uses 12.5% of their brain at any given moment?

  • @clarckkim
    @clarckkim 4 месяца назад +11

    An addition that you didnt probably know, GPT4 is a mixture of Experts. exactly 16 gpt3. That was leaked in August.

  • @Nik.leonard
    @Nik.leonard 4 месяца назад +7

    Running mixtral (or dolphin-mixtral) on cpu+gpu is not that terrible. I've got 5-7 tokens per second on a Ryzen 5600x (64gb DDR4 3200 ram) + RTX 3060 12gb with 4bit quantization. I consider that "usable", but ymmv.

  • @Mustafa-099
    @Mustafa-099 4 месяца назад

    The sprinkle of random memes makes your content fun to watch :)

  • @THEWORDONTECH
    @THEWORDONTECH 3 месяца назад

    I was going to hit the subscribe button I'm already a subscriber. Solid content!

  • @davids5257
    @davids5257 4 месяца назад

    I don't really understand the topic too much but it sound to me very revolutionary

  • @DAO-Vision
    @DAO-Vision 4 месяца назад

    Mixtral is interesting, thank you for making this video!

  • @ImmacHn
    @ImmacHn 4 месяца назад +1

    This is what we call distributed computing power. Lots of people solving small problems create a spontaneous order that far surpasses centralized organization.

  • @planktonfun1
    @planktonfun1 4 месяца назад

    Looks like they used an ensemble learning approach, not surprised since its mostly used in competitions

  • @HFYShortStories
    @HFYShortStories 4 месяца назад +4

    Mixtral did not come up with the idea of MOE for neural networks, nor were they the first ones to create an MOE LLM, and GPT4 is also leaked (credibly) to be MOE.

    • @robonator2945
      @robonator2945 4 месяца назад +15

      I still find it so funny that shit like that has to be 'leaked'. "We're OpenCars, but don't fucking ask us how our engines work, you don't get to know. We're open tho."
      Imagine a party that's open invite, but only if you get an invitation, otherwise you're not allowed.

  • @brainstormsurge154
    @brainstormsurge154 4 месяца назад +2

    This is the first time I've heard anything about a mixed network model it's very interesting just on what's presented here. At 3:38 you talk about how the router chooses several experts based on context rather than subject. I'm curious if that actually does make it work better than having just one expert for the given subject, such as programming, than if it was able to use subject only.
    Would be interesting if someone was able to get the model to behave that way and have the subject model compete with the current context model to see which one performs better.
    Makes me think about how our own brain compares. While I know this is still run on regular hardware and not neuromorphic hardware, it's getting there soon, it would be interesting nonetheless.

    • @chri-k
      @chri-k 4 месяца назад

      It might be better to combine both and use two different sets of experts and two routers

  • @ipsb
    @ipsb 4 месяца назад

    I have a question, do pareto principle still holds true when it comes to these LLMs ?

  • @Faceless_King-tc7kt
    @Faceless_King-tc7kt 4 месяца назад

    what was the chart comparing the AI performances? may we have the link

  • @realericanderson
    @realericanderson 4 месяца назад +1

    1:18 the cod interface for the different experts got me good thanks cloud

  • @johnflux1
    @johnflux1 4 месяца назад +1

    Hey. I want to highlight that the router is choosing the best two model **per token**. So for a single question, it will be using many (usually all) the models. You do say this in the second half of the video, but in the first half you said "the router choses which two models to use for a given question or prompt". But the router is chosing which two models to use for each token.

    • @starblaiz1986
      @starblaiz1986 4 месяца назад

      Not only that but it's also making that choice **per layer** for each token too. So one token will also have many (often all) experts chosen at some point in the token generation.

  • @itisallaboutspeed
    @itisallaboutspeed 4 месяца назад +5

    as a car content creator i approve this video

    • @thatguyalex2835
      @thatguyalex2835 2 месяца назад +1

      Mistral's smaller model, 7B had some bad things to say about German cars though. So, maybe you can use AI to help diagnose your cars. :)
      This is a small snippet of what Mistral 7B said.
      BMW X5: Issues with the automatic transmission and electrical system have been reported frequently in the 2012-2016 models.
      Mercedes C-Class: Several complaints about electrical problems, particularly with the infotainment system and power windows in models from 2014 to 2020.

    • @itisallaboutspeed
      @itisallaboutspeed 2 месяца назад

      @@thatguyalex2835 I Have To Agree Man

  • @bibr2393
    @bibr2393 4 месяца назад +9

    afaik vram requirements are not that high for mixtral. Sure it's not 13B level. You can run at 6bpw elx2 (quant around Q5_k_m for gguf file type) with 36 gb of vram. so an rtx 3090 + rtx 3060.

    • @JorgetePanete
      @JorgetePanete 4 месяца назад +6

      I like your funny words, magic man

    • @Octahedran
      @Octahedran 4 месяца назад +1

      Managed to get it running with 20 GB of vram, Although just barley. It could not do a conversation without running out and had to do it on the raw arch terminal

    • @MINIMAN10000
      @MINIMAN10000 4 месяца назад

      @@JorgetePanete 6bpw means 6 bits per weight, a LLM is a collection of weights. exl2 refers to exllama2, like llama.cpp it is used to run the llm, only faster and more smaller ( not sure how that works but it does ) quant, short for quantization refers to shrinking the size of the weights, basically chopping off a bunch of information in hopes that none of it was important, usually works fine depending on how much you chop off. Q5_k_m Q5 meaning 5 bits per weight as opposed to the 16 bits precision commonly used in training. from what I can tell _k ( refers to k quants apparently means clustering ) _s/m/l refers to small medium and large. in increasing size, from what I can pull up it increases the precision of "attention.wv and feed_forward.w2" which play a large part in quality. GGUF is a file type, created specifically to shove an entire collection of files we used to have into one single file.

  • @AngryApple
    @AngryApple 4 месяца назад +1

    mixtral works incredible well on Apple Silicon.
    I use an 64GB M2 Max

  • @capitalistdingo
    @capitalistdingo 4 месяца назад +2

    Funny, I thought I had heard that there were cross-training advantages whereby training models with data for different, seemingly unrelated tasks improved their performance but this seems to suggest that smaller models with more focus are better. Seems like a bit of a contradiction. Which view is correct?

    • @MINIMAN10000
      @MINIMAN10000 4 месяца назад

      So the answer is both are true if we exclude the term "best" in the sense that "no one defines best in LLMs" the field changes too much no one would say which is best. Certain things like programming have shown to increase a model's ability to adhere more strictly ( programming fails on any mistake but is very well structured ) it improves the model universally not just in programming because of this behavior. Mixtral is 47B and is impressive, however Mistral is 7B and is impressive for its size and Miqu shows that mistral medium is 70B which is again impressive for its size, so we can't conclude one way or the other if mixtral is disproportionately good. But what it did prove is that MoE works without a doubt.

  • @DanielSeacrest
    @DanielSeacrest 4 месяца назад +1

    Well a few slight inaccuracies here. For example, MoE is not multiple models joined together, that is a common misunderstanding. A problem here though is it can be multiple models stitched together, but that is not the original intention of MoE. The whole point of MoE is to reduce the number of parameters at inference, not to have multiple different domain specific models working together. A MoE model like GPT-4 is one model, just during pretraining specific sections (or experts) of that model specialised in specific parts of its large dataset.
    I definitely think the "experts" of MoE really tripped people up as well, as I said they were not suppose to be seperate domain specific models but just different sections of one model specialised to a specific part of a dataset. In Mixtral's case to my understanding they instantiated several Mistral-7B models and trained them all on a dataset with the gating mechanism in place, but the problem here is that there are a lot of wasted parameters that have duplicated knowledge from the initial pertaining of the 7B model. It would be a lot more efficient to train a MoE model from scratch.

  • @user-ti6sq5yu3f
    @user-ti6sq5yu3f 3 месяца назад

    Wow this mistral model is impressive.

  • @potpu
    @potpu 4 месяца назад

    How does it compare with the miqu model in terms of architecture and running it on lower spec hardware?

    • @Slav4o911
      @Slav4o911 4 месяца назад +1

      Miqu is better than Mixtral, they say it should be almost as good as GPT4 when ready. But it's still not ready, what was "leaked" by accident is more like an alpha version. I think the open source community will reach GPT4 quality about 4 - 6 months from now. The open source community is much more motivated than the people who are working for open AI so I have no doubt we the real open source community will outperform them max 1 year from now, and yes I think by GPT5 we will be much close than now. Once the open source community outperforms the closed models, there is no going back, these closed models would never have a chance to catch up.

  • @aviralpatel2443
    @aviralpatel2443 4 месяца назад

    considering mixtral came before the launch of gemini-pro-1.5(also uses MoE method) and is open source, is it safe to assume that google might have taken the inspiration from this open source model? If they did, dang the open source ai models are upping their game pretty quickly.

  • @waterbot
    @waterbot 3 месяца назад

    gtc is goona be hype this year

  • @LumiLumiLumiLumiLumiLumiLumiL
    @LumiLumiLumiLumiLumiLumiLumiL 4 месяца назад

    Mistral Medium is even more insane

  • @zaman.tasiin
    @zaman.tasiin 4 месяца назад

    Damn I signed up but how am I going to follow along? I don't have an Nvidia GPU?

  • @holdthetruthhostage
    @holdthetruthhostage 4 месяца назад +1

    They will be launching medium soon which is even more powerful

  • @animationmann6612
    @animationmann6612 4 месяца назад +2

    I hope that in Future we need less VRAM for better AI so we can actually use them in our Phones.

  • @diophantine1598
    @diophantine1598 4 месяца назад

    You should cover Sparsetral next, lol. It only has 9B parameters, but has 16x7B experts.

  • @DoctorMandible
    @DoctorMandible 4 месяца назад +1

    Open source didn't so much catch up as closed source temporarily jumped ahead. Open source was the leading edge before OpenAI existed. And now we are the leader again.

  • @razzledazzlecheeseontoast9808
    @razzledazzlecheeseontoast9808 4 месяца назад +1

    Experts seem like the lobes of the brain. Different specialities working synergistically.

  • @luciengrondin5802
    @luciengrondin5802 4 месяца назад +6

    Shouldn't a nn, by essence, be able to do something like that? I mean shouldn't the training process naturally result in a segmentation of the network into various specialized areas?

    • @joaoguerreiro9403
      @joaoguerreiro9403 4 месяца назад +1

      Very good take! I’m wondering the same… but it makes sense if you ask me! For instance, think of residual learning. ResNet was introduced with the idea that skip-connections can alleviate the training since it is easier to learn the difference between the input and output than to learn the full mapping. Well a neural network in essence could learn that difference, but without the skip-connection the number of possible computational paths possible is enormous! The skip-connection enables to enforce such residual learning :D the same is analogous for mixture of experts, now you enforce specialised areas based on conditions (input)!
      Sorry if you don’t have a Computer Science background. I can try to explain in a different way to help you understand :)

    • @borstenpinsel
      @borstenpinsel 4 месяца назад +1

      Why? In our brains, certain skills being associated with certain parts surely is no function of a neural network. If it was, everybody's brain would look different (light up in different parts in whatever tech they use, in order to be able to tell that speech is here and music is there).
      So isn't it more likely that evolution "decided": "the input from the eyes goes here, the input from the ears goes here..and here be bridges to combine the info"?
      Instead of dumping every single nerve impuls into a huge network and expecting some sort of stream?

    • @shouma1234
      @shouma1234 4 месяца назад +2

      You’re right, but I think the point is the performance advantage of only having to run 2/7 of the model at a time instead of the whole model every time. More performance means you can make a bigger model in the long run at less cost

    • @MINIMAN10000
      @MINIMAN10000 4 месяца назад +2

      @@shouma1234 I assume you mean higher performance at a lower runtime cost. Because it still has the same training cost and the same size. You just iterate over a small portion of the whole model at a time. This means lower cost per token at runtime.

  • @Shnugs
    @Shnugs 4 месяца назад

    So what happens if the number of parallelized experts are ramped up to N? What happens if there are layers of routers? Where does the performance plateau?

  • @David-lp3qy
    @David-lp3qy 4 месяца назад +1

    This smells like how sensory organs are specifically innervated to lobes dedicated to processing their particular modality of information. I wonder if having actual specialized experts would yield superior results than the current model

  • @mrrespected5948
    @mrrespected5948 4 месяца назад

    Nice

  • @TiagoTiagoT
    @TiagoTiagoT 4 месяца назад

    I'm not sure if if I'm misunderstanding something and not actually doing what I think I'm doing, but I have managed to make the gguf variant run with 16GB VRAM

    • @juanjesusligero391
      @juanjesusligero391 4 месяца назад

      GGUF models are quantized models. That means less precission, hence less quality in the answers of the model (it's kinda like having less decimal numbers in a division). But the good part is that they require lower specs :)

    • @TiagoTiagoT
      @TiagoTiagoT 4 месяца назад

      @@juanjesusligero391 Ah, right, forgot about that detail

  • @favianeduardo4236
    @favianeduardo4236 3 месяца назад

    We finally cracked the code people!

  • @JamesRogersProgrammer
    @JamesRogersProgrammer 3 месяца назад

    Mixtral 7x8b runs fast on cpu. I am running a 5 bit quantized version on two different machines with no GPU but with 64GB of RAM and getting great performance out of it. Using the mixtral version of llama.cpp.

  • @initialsjd5867
    @initialsjd5867 25 дней назад

    I have been running Mixtral on a surface laptop 5 with 32gb off ram core i7 12th gen and no dedicated gpu for a while now, it runs Fedora 39, the first prompt always generates quite fast, but the more you ask it in a single terminal windows the slower it gets over time, but just e new window fixes that. though i'm going to try it now on a 4080 super core i7 13th gen with 32gb ram, Fireship has a video about it in which he also goes over how to easily install it, the minimum is 32gb off ram i would say although i noticed Fedora linux doesn't use nearly as much ram in the background as windows 11, maybe a tenth.

  • @Romogi
    @Romogi 4 месяца назад +3

    All those layoffs will give many software developers plenty of time.

  • @kipchickensout
    @kipchickensout 4 месяца назад

    I can't wait for these to run without too high hardware requirements, as well as offline...
    as well as without any black box restrictions (looking at you, OpenAI). They should be configurable

  • @aniksamiurrahman6365
    @aniksamiurrahman6365 4 месяца назад

    Wow!

  • @mateoayala9569
    @mateoayala9569 3 месяца назад

    The mistral mediums seems very powerful.

  • @exploreit7466
    @exploreit7466 4 месяца назад +1

    I need 3d inpaining as soon as possible but its not working properly please make that video again

  • @pictzone
    @pictzone 4 месяца назад

    Omfg this video is pure gold thank you thank you thank you

  • @dishcleaner2
    @dishcleaner2 4 месяца назад

    I ran a GGUF quantized version on my rtx 4090 and it was comparable to quality and speed of ChatGPT when it first came out.

  • @szymex22
    @szymex22 4 месяца назад

    I found some smaller version of mixtral and it runs ok on CPU.

  • @hobocraft0
    @hobocraft0 4 месяца назад +11

    We really need a paradigm shift, where we're not having to multiply huge sparse matrices together, where the router is part of the architecture itself, kind of like how the brain doesn't 'run' the neurons it's not firing.

    • @vhateg
      @vhateg 4 месяца назад +3

      Yes, but if this world was a simulation, the simulation would need to run the brain neurons that don't fire too, because it would have no way to know how they would behave. A computer is the simulator of a brain, not the simulation itself (that is the LLM)
      But, there definitely can be optimizations to remove parts that are unused, as you said. I agree with you that there should be some huge shift. If sparser matrices is not enough, then removal of neurons might be the next step. It would change the topology of the network dynamically, and that is just too much to even imagine. 😂

  • @pajeetsingh
    @pajeetsingh 4 месяца назад

    Every model using google’s tensorflow. That’s some solid monopoly.

  • @JoeD0403
    @JoeD0403 4 месяца назад

    The problem with AI at the moment is the technology is ahead of practical usage. If social media existed in the early 80s, there would be videos made about every single PC model coming out, even though the most popular “computer” was the Atari 2600. The bottom line is, you need lots of proprietary data for the AI to process in order to generate any real value that can be monetized.

  • @romainrouiller4889
    @romainrouiller4889 4 месяца назад

    @Fireship brother?

  • @BeidlPracker-vb8en
    @BeidlPracker-vb8en 4 месяца назад

    This is definitely too many cuts for my MySpace era brain.

  • @drgutman
    @drgutman 4 месяца назад +3

    It didn't ... mixtral rickrolled me while working with it. I was talking with it in lm studio about some python code and after a few exchanges, at the end of a reply I got a youtube link. I thought, ohh it's a tutorial or something. Nope, Rick Ashley - Never gonna give you up.
    so, yeah. powerful model, but tainted (no, it didn't solve my coding problem)

  • @legorockfan9
    @legorockfan9 4 месяца назад +1

    Shrink it to 3 and call it the Magi system

  • @michael_hackson_handle
    @michael_hackson_handle 4 месяца назад

    Lol. I thought ChatGPT worked in the way of Mistral - having a router to pick what part of neurons to use for each topic. O_o Since it makes more sense.

  • @rafaelestevam
    @rafaelestevam 4 месяца назад

    The Bloke is already messing with quantizing it 😁
    (is a v0.1 as I write)

  • @seansingh4421
    @seansingh4421 4 месяца назад

    I actually tried the Llama models and let me tell you nothing even comes close to GPT-4. Only Llama 70B is somewhat alright

  • @garrettrinquest1605
    @garrettrinquest1605 4 месяца назад

    Wait. Can't you just run Mixtral-8x7B on any machine with a decent GPU using Ollama? I thought you only needed something like 8-10 GB of VRAM to have it run well

    • @Slav4o911
      @Slav4o911 4 месяца назад

      No it gobbles about 36GB RAM + VRAM on my machine.... it's not a 7B model but 8x7B.... so it's actually heavier than a "normal" 32B model. I personally don't like it too much it's too slow for me, there are faster models with similar results.

  • @deathshadowsful
    @deathshadowsful 4 месяца назад +1

    This router choosing an expert to maximize likelihood feels like a recursion of what makes the small models. This is all imaginative right now but it feels like thats how neurons would work in the brain too. What if this just kept on going and folding onto itself

  • @tungvuthanh5537
    @tungvuthanh5537 4 месяца назад

    Being a French startup also explain for why it is named 8x7B

  • @Purpbatboi
    @Purpbatboi 4 месяца назад

    what is the ''B' in this? gigabytes?

    • @capitalistdingo
      @capitalistdingo 4 месяца назад +9

      I think it means billion as in 7 billion parameters. But I say that with only a weak understanding of what that means so don’t take my word for it.

    • @squeezyDUB
      @squeezyDUB 4 месяца назад +4

      Yes it's billion

    • @alansmithee419
      @alansmithee419 4 месяца назад +2

      7billion parameters.
      Basically a list of numbers that determine the AI's behaviour.
      Depending on implementation it is likely either half precision (16bits/2bytes per parameter), or using tensor processing which is popular for AI (8bits/1byte).
      So it could be either 14 or 7 Gigabytes, depending. But yes, probably 7GB.

  • @ThorpenAlnyr
    @ThorpenAlnyr 4 месяца назад

    "We discovered CPUs people".

  • @aniksamiurrahman6365
    @aniksamiurrahman6365 4 месяца назад

    Let's see how long does it takes for Mixtral to become a for profit company.

  • @yolocrayolo1134
    @yolocrayolo1134 4 месяца назад

    any plans of talking about cartoon ai generated stuff

  • @exploreit7466
    @exploreit7466 4 месяца назад +1

    Heyyyyyyyyy bro can you please make a video on 3d photo inpaining again pleaseeeeeeeeeee dude I really need it and it's not working properly

  • @andreamairani1512
    @andreamairani1512 3 месяца назад

    OpenAI shaking in their digital boots right now

  • @reamuji6775
    @reamuji6775 Месяц назад

    I wish someone would train an AI model with 3x7B parameter and call it MAGI

  • @SkyyySi
    @SkyyySi 4 месяца назад

    4:08 Then... quantize it? It will be 4x smaller and somewhat faster, while losing about 1% - 2% in quality. A hell of a deal if you ask me.

  • @Eric-yd9dm
    @Eric-yd9dm 4 месяца назад +1

    An rtx giveaway? Wouldn't a good bar of solid gold be cheaper?

  • @GoshaGordeev-yg5bc
    @GoshaGordeev-yg5bc 4 месяца назад

    what about miqu?

    • @MINIMAN10000
      @MINIMAN10000 4 месяца назад

      Mistral is 7B and good for its size Mixtral is 47B but runs as fast as a 13B and is good for its size MiQu is a preview build of mistral medium a standard 70B model, no idea on what peoples opinions are on its quality.

  • @orbatos
    @orbatos 4 месяца назад

    This doesn't answer the stated question, so I will. The open source community currently includes most AI researchers and their advancements, is quicker to react to new developments, and isn't hiding implementations from eachother for profit. This could change when companies start paying better, but the incentives are also different, corporate development is about displacing labour costs and shifting responsibility, not the future of technology.

  • @Oktokolo
    @Oktokolo 4 месяца назад

    The real breakthrough will be experts in fields of knowledge - not experts selected by random syntax artifacts.

  • @gdplayer1035
    @gdplayer1035 4 месяца назад

    finally open ai

  • @amykpop1
    @amykpop1 4 месяца назад +8

    Finally, hopefully OpenAI will stop their overly exaggerated "safety testing" and be pushed to release their newer models faster. I'm not saying that there should not be any safety testing at all, but trying to only please some useless bureaucrats who know nothing about AI and LLM's will just slow the process of developing accessible AGI.

    • @turbogymdweeb184
      @turbogymdweeb184 4 месяца назад

      Inb4 gov's banning the personal development of LLM's and other AI tools and will only allow certain organizations who either know someone in the government or abide by unrealistic safety standards lmao

    • @manitoba-op4jx
      @manitoba-op4jx 4 месяца назад +5

      they need to cut out the political nonsense

    • @JorgetePanete
      @JorgetePanete 4 месяца назад +1

      LLMs*

    • @LowestofheDead
      @LowestofheDead 4 месяца назад

      ​@@manitoba-op4jxOpenAI don't really care ethics, or it's very unlikely that they do.
      Think about it, by refusing to release their models (supposedly because of "bias") they are now one of the most valued startups in the world at around $100Bn. Do you really think they just happened to care about ethics when it was the most financially convenient?
      They're not going to give away their product for free, and neither is Elon even though he has opposite politics.

    • @robonator2945
      @robonator2945 4 месяца назад +12

      they aren't doing safety testing at all, they're doing censoring. LLMs are not skynet, they architecturally can't be. They have very fundamental architectural limitations. Their verson of 'safety' is stopping their models from saying things they don't want them to. (something which has been well-documented by asking models even vaguely political questions, IIRC the exact percentage of how much whiteness was unacceptable to be praised was somewhere around 20%) This isn't stopping skynet, it's breaking the model's leg before releasing it into the wild so that it's more controllable.

  • @imerence6290
    @imerence6290 4 месяца назад +1

    A year to get to GPT3.5 while OpenAI is heading towards GPT5 is impressive ?

  • @Alice8000
    @Alice8000 4 месяца назад

    More memes please 🎉

  • @bluesnight99
    @bluesnight99 4 месяца назад +2

    Me who knows nothing about ai: U were speaking english....

    • @nbshftr
      @nbshftr 4 месяца назад +6

      hivemind vs big brain. hivemind has a bunch of ai who are each very good at certain skills. hive mind is fast but maybe a little stupid sometimes. big brain is one big ai that is good at a little bit of everything. big brain consistent and smart but slow.

    • @amykpop1
      @amykpop1 4 месяца назад

      @@nbshftr cheers for the explanation!

    • @alansmithee419
      @alansmithee419 4 месяца назад

      @@nbshftr Big brain also maybe not as good at individual tasks as each hive mind brain is at their specialised tasks, but is more generally intelligent.