It was a very fun and instructive video. It would be interesting to see a comparison of the same game coded by the free versions or paid of the other providers
Thanks very much! I have used the same prompt in some separate testing videos for paid providers, though I will next time perhaps throw in a quick comparison in the video as well!
@OminousIndustries yes, you're right. Apologies for not checking before asking. But a 1 to 1 comparison would still be cool. Maybe for the future, you could paste the link of related videos in the description down below 😊
@@marcomerola4271 I agree direct comparisons are a good idea. Good thought on the additional links. I will keep note of this for future videos with similar testing conditions!
Nice test! I was planning to build a dual 3090 system, but I guess I need to reconsider. This was slower then expected. Would a dual 5090 perform somewhat better or just 2x?
I can't speak to this aside from what I have experienced, but fwiw I have had exl2 70B models running in the text-gen-webui that were much faster than this. I have seen some discussion on speed differential between gguf and exl2 but I am not knowledgeable enough to make any definitive statements on this - just personal anecdotes. Not sure how much faster, but a dual 4090 let alone 5090 should be a nice speed increase based on some of the user benchmarks I have seen on r/localllama
would be interesting to know the difference between llama 3.0, 3.1, 3.2, and 3.3 in 4 bit quant. i got hardware running 70b in 8 bits, but i still cant make the jump from 3.0 to 3.1 or 3.2, from my own testing it seems like 3.0 with 8k context is still superior to the 128k models (although i didnt test 3.3 yet). im testing on real world use cases.
I would assume they benchmark better between .0/.1/.2/.3 etc, but like you say real world use cases are often more important than benchmarks for folks like us.
Come on man, where’s that mike you’ve been talking about? I know you can afford it. /s When you add it, your vids will level up. Thanks for the walk throughs!!!
I spent the mic budget on a ChatGPT pro subscription LOL. Thanks for the kind words, I actually have a nice akg mic I used to use for music related tasks so perhaps I will hook that up to the system and use that for screen recording audio.
Is DO Anything managing the hosting of the local model? I am interested in running multiple gpus and am trying to figure out the best way forward, as far as performance.
I personally prefer Ubuntu, but if someone is used to windows and does not want to have to trouble shoot a lot it might not be a bad idea to stick with windows haha
Love your videos. Keep them coming!
Thanks very much! I will for sure.
Good work mate, keep it up!
Thanks very much! Been meaning to send you a note!
It was a very fun and instructive video. It would be interesting to see a comparison of the same game coded by the free versions or paid of the other providers
Thanks very much! I have used the same prompt in some separate testing videos for paid providers, though I will next time perhaps throw in a quick comparison in the video as well!
@OminousIndustries yes, you're right. Apologies for not checking before asking. But a 1 to 1 comparison would still be cool. Maybe for the future, you could paste the link of related videos in the description down below 😊
@@marcomerola4271 I agree direct comparisons are a good idea. Good thought on the additional links. I will keep note of this for future videos with similar testing conditions!
Nice test! I was planning to build a dual 3090 system, but I guess I need to reconsider. This was slower then expected. Would a dual 5090 perform somewhat better or just 2x?
I can't speak to this aside from what I have experienced, but fwiw I have had exl2 70B models running in the text-gen-webui that were much faster than this. I have seen some discussion on speed differential between gguf and exl2 but I am not knowledgeable enough to make any definitive statements on this - just personal anecdotes.
Not sure how much faster, but a dual 4090 let alone 5090 should be a nice speed increase based on some of the user benchmarks I have seen on r/localllama
Better late than never. Sweet model. Been using the 8bit and the 128K context really smokes.
You are giving me quantization inferiority complex mentioning the 8bit LOL! it is a rather impressive model indeed.
would be interesting to know the difference between llama 3.0, 3.1, 3.2, and 3.3 in 4 bit quant. i got hardware running 70b in 8 bits, but i still cant make the jump from 3.0 to 3.1 or 3.2, from my own testing it seems like 3.0 with 8k context is still superior to the 128k models (although i didnt test 3.3 yet). im testing on real world use cases.
I would assume they benchmark better between .0/.1/.2/.3 etc, but like you say real world use cases are often more important than benchmarks for folks like us.
On these locally hosted models, it would be interesting to know how many tokens per second that you're getting back.
Good thought, I will try to get speed results for local testing in the future.
@OminousIndustries just run : ollama run modelname --verbose
will give you full Statistics after each response, including tokens per second
Come on man, where’s that mike you’ve been talking about? I know you can afford it. /s
When you add it, your vids will level up. Thanks for the walk throughs!!!
I spent the mic budget on a ChatGPT pro subscription LOL. Thanks for the kind words, I actually have a nice akg mic I used to use for music related tasks so perhaps I will hook that up to the system and use that for screen recording audio.
Is DO Anything managing the hosting of the local model? I am interested in running multiple gpus and am trying to figure out the best way forward, as far as performance.
No, Ollama is handling the hosting of the local model here, Anything LLM is just providing a user interface to be able to interact with the model.
@@Bijanbowen Ah, good to know, thanks.
What do you recommend? Linux? (What version) or Windows?
I personally prefer Ubuntu, but if someone is used to windows and does not want to have to trouble shoot a lot it might not be a bad idea to stick with windows haha
How much did it took of ram?
It was using about 19gb on one card and 22 on the other, so a total of about 41gb or so for this Q4K_M quant.