Qwen QwQ 2.5 32B Ollama Local AI Server Benchmarked w/ Cuda vs Apple M4 MLX

Поделиться
HTML-код
  • Опубликовано: 28 ноя 2024

Комментарии • 31

  • @frankjohannessen6383
    @frankjohannessen6383 5 часов назад +2

    It's the first open model that has perfectly solve a logic puzzle I've asked a lot of models. I also like the very verbose answers. That way you can verify that it didn't just get to the answer by a lucky guess. As for the inconsistency, I think that is because of the very long responses. A few low-probability tokens early is probably sending it far off course. So it should probably be run at a very low temperature.

    • @DigitalSpaceport
      @DigitalSpaceport  5 часов назад

      Oh I didnt adjust my temp on it, good call! This is by far the best assistive model for thoughtful explorations I've found. Very correctable and feels like im working with a human almost.

  • @dijikstra8
    @dijikstra8 11 часов назад

    Very cool that this kind of model is open sourced that can be run locally given sufficient resources, I think this bodes well for the future as we get more specialized chips in our computers, we could have very competent local personalized models for e.g. coding. It's also very interesting to see an open Chinese model perform like this, from a geopolitical point of view.

    • @DigitalSpaceport
      @DigitalSpaceport  5 часов назад

      Yes this being open is pretty wild. The commitment of the qwen team is awesome. Im eager for llama 4 also

  • @DigitalDesignET
    @DigitalDesignET День назад +5

    We need to try Aider in Architect Mode, with Qwen-Coder 32B/72B as the coder and QwQ 32B as an architect. What do you think?

    • @DigitalSpaceport
      @DigitalSpaceport  День назад +3

      This sounds interesting and aider looks approachable also. Im going to try to get it running.

  • @Eldorado66
    @Eldorado66 21 час назад +2

    You should try this out with LM Studio. It’s always worked best for me and is much easier to customize, especially when it comes to loading the model. Open WebUI has some issues and the connection to Ollama, especially at the start, can be pretty laggy.

  • @andrepaes3908
    @andrepaes3908 День назад +4

    Great analysis! Good insight to see the 3090's running at almost 2x the speed of M4 Max. Also interesting to see the QWQ context size allocates the same amount of VRAM than the model. For 32 Q8 = 34+34 and 32 Q4 = 20+20. This is way more than the Qwen Coder 2.5 32b context size consumes! Any thoughts why this?

    • @DigitalSpaceport
      @DigitalSpaceport  День назад +1

      @andrepaes3908 i dont have any firm insight as to why but there is variation ive seen among models. Not like this however. I did try setting the num gpu to 2 and running the q8 but it spilled out. Could be a sw thing, but its notable. If you observe different lmk. Im always sus of a potential sw issue.

  • @thcleaner22
    @thcleaner22 3 часа назад

    with 8bit model on a M1 Ultra with mlx-lm
    2024-11-29 20:22:25,189 - DEBUG - Prompt: 147.551 tokens-per-sec
    2024-11-29 20:22:25,189 - DEBUG - Generation: 14.905 tokens-per-sec
    2024-11-29 20:22:25,189 - DEBUG - Peak memory: 35.314 GB

  • @TheZEN2011
    @TheZEN2011 День назад +1

    I played with QwQ a little bit. I don't know what to think of it quite yet. Quinn coder, seem to work better for coding. But yeah, QwQ is kind of lively in it's thinking process.

  • @aidanpryde7720
    @aidanpryde7720 20 часов назад

    omg that powershell gpu monitor is so cool, any chance you can share what program/script it is?

    • @DigitalSpaceport
      @DigitalSpaceport  15 часов назад

      Its nvtop cmd. Im not sure if it runs in ps but pmk if you find out. Its shown here running in Linux via my ssh term.

  • @UCs6ktlulE5BEeb3vBBOu6DQ
    @UCs6ktlulE5BEeb3vBBOu6DQ День назад +1

    For the P40 crowd, Q8 with 2x P40 gives me 8 t/s.

    • @DigitalSpaceport
      @DigitalSpaceport  День назад

      Full model fit into 2 at 32769 context?

    • @UCs6ktlulE5BEeb3vBBOu6DQ
      @UCs6ktlulE5BEeb3vBBOu6DQ 23 часа назад

      @@DigitalSpaceport I have a tiny RTX A2000 12GB in there for larger models. But it would fit without because nvidia-smi reports the vram usage as 16gb/24 for both P40 and 8 out of 12gb for the A2000.

  • @thaifalang4064
    @thaifalang4064 День назад +4

    On M1 Max 15.5t/s 4Bit/ 9,3t/s 8Bit (LM Studio) (Qwen_QwQ-32B-Preview_MLX-8bit)

    • @DigitalSpaceport
      @DigitalSpaceport  День назад

      @@thaifalang4064 thanks for adding more datapoints. Did you observe the ram allocation? Seems like a very ram hungry model.

  • @manofmystery5709
    @manofmystery5709 День назад

    I've read that someone found a way to string together multiple 4090's using PCIe (they don't support NVLINK). Would that be a configuration possible to set up on consumer motherboards and PSUs?

    • @DigitalSpaceport
      @DigitalSpaceport  День назад

      The ollama/llama.cpp software does it automagically over pcie for inference workloads. You need nvlink for training, but not inference really. These 3090s are just running off the pcie bus.

  • @BeastModeDR614
    @BeastModeDR614 3 часа назад

    Athene-V2 is a 72B parameter model is much better and is available in Ollama. I can run it locally with my 48GB M3 MAX. the 72b-q3_K_L Model version

  • @thingX1x
    @thingX1x День назад

    The camera was shaking so much in the intro it almost gave motion sickness, lol. but cool content!

  • @A_Me_Amy
    @A_Me_Amy 22 часа назад

    I keep thinking small models properly optimized are best

  • @blisphul8084
    @blisphul8084 18 часов назад

    Imagine running this chinese AI model on a Chinese Moore Threads GPU. If Nvidia keeps stalling with the vram, perhaps we'll see that soon.

    • @DigitalSpaceport
      @DigitalSpaceport  5 часов назад

      I didnt think about that til now but you have a good point. VRAM moat is practically understandable, but def not secure for nvidia.

  • @rogerc7960
    @rogerc7960 День назад

    Runs on just a CPU!

    • @DigitalSpaceport
      @DigitalSpaceport  День назад

      Super slow from what I saw but yeah you can also run a 405b low quant on CPU provided you have the ram. Just too slow to be useful.

  • @大支爺
    @大支爺 5 часов назад

    NO APUs able to beat 3090/4090 in at least 10 yrs.