It's the first open model that has perfectly solve a logic puzzle I've asked a lot of models. I also like the very verbose answers. That way you can verify that it didn't just get to the answer by a lucky guess. As for the inconsistency, I think that is because of the very long responses. A few low-probability tokens early is probably sending it far off course. So it should probably be run at a very low temperature.
Oh I didnt adjust my temp on it, good call! This is by far the best assistive model for thoughtful explorations I've found. Very correctable and feels like im working with a human almost.
Very cool that this kind of model is open sourced that can be run locally given sufficient resources, I think this bodes well for the future as we get more specialized chips in our computers, we could have very competent local personalized models for e.g. coding. It's also very interesting to see an open Chinese model perform like this, from a geopolitical point of view.
You should try this out with LM Studio. It’s always worked best for me and is much easier to customize, especially when it comes to loading the model. Open WebUI has some issues and the connection to Ollama, especially at the start, can be pretty laggy.
Great analysis! Good insight to see the 3090's running at almost 2x the speed of M4 Max. Also interesting to see the QWQ context size allocates the same amount of VRAM than the model. For 32 Q8 = 34+34 and 32 Q4 = 20+20. This is way more than the Qwen Coder 2.5 32b context size consumes! Any thoughts why this?
@andrepaes3908 i dont have any firm insight as to why but there is variation ive seen among models. Not like this however. I did try setting the num gpu to 2 and running the q8 but it spilled out. Could be a sw thing, but its notable. If you observe different lmk. Im always sus of a potential sw issue.
I played with QwQ a little bit. I don't know what to think of it quite yet. Quinn coder, seem to work better for coding. But yeah, QwQ is kind of lively in it's thinking process.
@@DigitalSpaceport I have a tiny RTX A2000 12GB in there for larger models. But it would fit without because nvidia-smi reports the vram usage as 16gb/24 for both P40 and 8 out of 12gb for the A2000.
I've read that someone found a way to string together multiple 4090's using PCIe (they don't support NVLINK). Would that be a configuration possible to set up on consumer motherboards and PSUs?
The ollama/llama.cpp software does it automagically over pcie for inference workloads. You need nvlink for training, but not inference really. These 3090s are just running off the pcie bus.
It's the first open model that has perfectly solve a logic puzzle I've asked a lot of models. I also like the very verbose answers. That way you can verify that it didn't just get to the answer by a lucky guess. As for the inconsistency, I think that is because of the very long responses. A few low-probability tokens early is probably sending it far off course. So it should probably be run at a very low temperature.
Oh I didnt adjust my temp on it, good call! This is by far the best assistive model for thoughtful explorations I've found. Very correctable and feels like im working with a human almost.
Very cool that this kind of model is open sourced that can be run locally given sufficient resources, I think this bodes well for the future as we get more specialized chips in our computers, we could have very competent local personalized models for e.g. coding. It's also very interesting to see an open Chinese model perform like this, from a geopolitical point of view.
Yes this being open is pretty wild. The commitment of the qwen team is awesome. Im eager for llama 4 also
We need to try Aider in Architect Mode, with Qwen-Coder 32B/72B as the coder and QwQ 32B as an architect. What do you think?
This sounds interesting and aider looks approachable also. Im going to try to get it running.
You should try this out with LM Studio. It’s always worked best for me and is much easier to customize, especially when it comes to loading the model. Open WebUI has some issues and the connection to Ollama, especially at the start, can be pretty laggy.
Great analysis! Good insight to see the 3090's running at almost 2x the speed of M4 Max. Also interesting to see the QWQ context size allocates the same amount of VRAM than the model. For 32 Q8 = 34+34 and 32 Q4 = 20+20. This is way more than the Qwen Coder 2.5 32b context size consumes! Any thoughts why this?
@andrepaes3908 i dont have any firm insight as to why but there is variation ive seen among models. Not like this however. I did try setting the num gpu to 2 and running the q8 but it spilled out. Could be a sw thing, but its notable. If you observe different lmk. Im always sus of a potential sw issue.
with 8bit model on a M1 Ultra with mlx-lm
2024-11-29 20:22:25,189 - DEBUG - Prompt: 147.551 tokens-per-sec
2024-11-29 20:22:25,189 - DEBUG - Generation: 14.905 tokens-per-sec
2024-11-29 20:22:25,189 - DEBUG - Peak memory: 35.314 GB
I played with QwQ a little bit. I don't know what to think of it quite yet. Quinn coder, seem to work better for coding. But yeah, QwQ is kind of lively in it's thinking process.
omg that powershell gpu monitor is so cool, any chance you can share what program/script it is?
Its nvtop cmd. Im not sure if it runs in ps but pmk if you find out. Its shown here running in Linux via my ssh term.
For the P40 crowd, Q8 with 2x P40 gives me 8 t/s.
Full model fit into 2 at 32769 context?
@@DigitalSpaceport I have a tiny RTX A2000 12GB in there for larger models. But it would fit without because nvidia-smi reports the vram usage as 16gb/24 for both P40 and 8 out of 12gb for the A2000.
On M1 Max 15.5t/s 4Bit/ 9,3t/s 8Bit (LM Studio) (Qwen_QwQ-32B-Preview_MLX-8bit)
@@thaifalang4064 thanks for adding more datapoints. Did you observe the ram allocation? Seems like a very ram hungry model.
I've read that someone found a way to string together multiple 4090's using PCIe (they don't support NVLINK). Would that be a configuration possible to set up on consumer motherboards and PSUs?
The ollama/llama.cpp software does it automagically over pcie for inference workloads. You need nvlink for training, but not inference really. These 3090s are just running off the pcie bus.
Athene-V2 is a 72B parameter model is much better and is available in Ollama. I can run it locally with my 48GB M3 MAX. the 72b-q3_K_L Model version
The camera was shaking so much in the intro it almost gave motion sickness, lol. but cool content!
@@thingX1x sry should have fed camerawife first
I keep thinking small models properly optimized are best
This is really good for a 32 q4 imho
Imagine running this chinese AI model on a Chinese Moore Threads GPU. If Nvidia keeps stalling with the vram, perhaps we'll see that soon.
I didnt think about that til now but you have a good point. VRAM moat is practically understandable, but def not secure for nvidia.
Runs on just a CPU!
Super slow from what I saw but yeah you can also run a 405b low quant on CPU provided you have the ram. Just too slow to be useful.
NO APUs able to beat 3090/4090 in at least 10 yrs.