Since I test Qwen-2.5-Coder-32b with oTToDev in this video, I wanted to say: Come join our Discourse community for oTToDev, our fork of Bolt.new! thinktank.ottomator.ai Also, I'm working on more super big things behind the scenes: ottomator.ai
Best vid showing what Qwen 2.5 32b can do. I have it on my mac studio after watching your video and am looking very much forward to putting it in my pipelines.
Your an amazing young man, keep up the good work. I value your videos, I keep going back to them like a valuable library which they are. Thanks for the high value you bring to each video.
hey Cole great video as always! Curious for Mac machines and with the new m4 chips what would be the MIN requirement for the 32B model lets say? the Q models are an interesting touch that's for sure!
Seen some of your stuff before but just liked and subbed. Love the no fluff, no hype, actually useful approach. Thanks for not giving us a Snake app in Python and then just emoting about how amazing AI is.
Good job with the eval agent! Do you plan to bring in an additional layer of agents, managed by an orchestrator? Bringing in the swarm concept into langgraph will be awesome
Great Counter FOMO Operations video Cole ->> not letting the latest, super cool and sexy proprietary closed source AI feature or company think we're missing out with local AI. I am curious how much context control you have with Qwen: this file, this module, the entire code base. If "context is King" is true, then we creators need help adding the right amount of context for the task while being cognizant of the GPU resources available.
@@JimWellsIsGreat Wow that's interesting, could you call out a specific video or two? I've only seen good results from others so I'd love to check it out!
@@ColeMedin AIcodeking video, it seemed to fail in aider and cline. I'm not sure why, but I imagine it has something to do with the way they implement tools, file edits, etc. When I used it with cline it opened the file and started writing it, and then somewhere in the middle it just kept repeating text over and over. Looks like its working pretty well with ottodev though!
Could you create a video about building different level PC builds for running AI locally? For example, I dunno if 2 x Nvidia RTX 3080s is better than 1 x Nvidia RTX 4080. Also, it's diffucult to understand the bottlenecks; is it enough to have a beefy GPU? I would like to invest all my money as effectively as possible, and this is relevant right now because Black Friday is coming. 😁
I really appreciate this suggestion! I'll certainly consider making this video before Black Friday, though I'd have to figure out how to fit it into my content calendar! 2x of a slightly weaker model compared to 1x of a slightly stronger depends a lot on the models you want to run. If a model can fit into one GPU, it'll be much faster than having to split between 2 GPUs. That all depends on the VRAM of the GPUs to determine if a model can fit.
@ColeMedin Thanks for summing up the the key differences of 1 and 2 GPU setups! Also great to hear that we might get good hints for making our Blck Friday shopping lists. Thanks for great content and taking local AI into account!
Thanks for testing that out! To be fair, with Cline it's probably prompted under the hood specifically to work with Claude models. I know from oTToDev and other experiences it definitely is beneficial to prompt different models in different ways.
I'm using Cline many hours each day, and my experience is that it works well with Claude, not bad with GPT family, but has difficulty with other models. Just the nature of the tool, if you look under the hood. I'm looking into forking it for experimenting more and getting results using different models
Thanks Cole! I am using Bolt.new with the paid version and it is quite a nightmare to prompt and prompt and prompt a lot , UI breaks a lot. Do you reco to switch to Qwen and Ottodev or wait a bit. I am not a dev but can build some stuff.
Can we use qwen 2.5 32B hosted in a huggingface space in ottodev? It would be awesome since not everyone has a gpu that supports 32B parameters models.
@@GuilhermeHKohnGAnti Great question! HuggingFace isn't supported for oTToDev and I'm not sure if it would be a practical implementation like some of the other providers, but it would be good to look into! You can use Qwen 2.5 Coder 32b through OpenRouter though and it's super cheap, as cheap as GPT-4o-mini
@@jesusjim Great question! I did change the context window to make it bigger. Not necessary for oTToDev anymore since we included this parameter within the code to instantiate Ollama, but it was necessary for me to do for my custom coded agent.
What are the minimum requirements to run this model at an acceptable performance? My PC isn't really powerful, and I'd like to know the minimum upgrade I need to run AI models.
I have a very similar system with 2x3090s and a threadripper 3960x. My qwen qwen2.5-coder:32b-q8_0_in32k doesn't do the things yours does..it fails miserably like your second tested model. Just tested now: qwen2.5-coder:32b-q8_0_in32k: 2.2 tokens/sec vs qwen2.5-coder:32b-base-q8_0 : 19 tokens/sec in ollama promt. So increasing the context window to 32k makes the inference way slower
Yeah it makes sense because with oTToDev the prompt is actually bigger than the default context size of 2k tokens! Does it fail for you even with the increased context limit?
mine seems to be working in a singel 3090 with PARAMETER num_gpu 100 PARAMETER num_ctx 15000 PARAMETER num_batch 64 PARAMETER num_thread 10 SYSTEM OLLAMA_NUM_PARALLEL=1 Anything more than that and it crashes. But the speed is great because it just fits in the vram. Even when using up all 15k of context. So if you need more than 15k context and you want full GPU, this is not the model for you.
Yo! I don’t have a gpu capable of running this but I do know you can rent gpus for a few hours or days or whatever. Think there is a Linux distro that we could upload to a rented gpu with all the configuration to run qwen 32 and deploy quickly to utilize the gpu and then export that code out of the machine. Maybe a Virtualbox or VM that’s ready to go off the rip? I’m still struggling on how to get my local llm setup and running because I ate paint chips when I was little
Maybe even an email preconfigured with a proton mail or something so people could email out the code they generate on rented gpu or servers? This would be really really cool and helpful and it’s common in the crypto world. Especially for the new gpu / cpu hybrid chains for mining
I love your thoughts here! So basically you are thinking of a way to have a machine in the cloud that isn't accessible to everyone directly, but people would have a way to send requests into the LLM with something like email to use Qwen-2.5-Coder-32b for larger coding tasks? You could definitely do this even with just a GPU instance running Qwen-2.5-Coder-32b behind an API endpoint that you could create a frontend around! Or just host oTToDev itself for people to use which is something I am looking into doing!
@ yeah a lot of people are renting clusters in the xen and x1 community which is gpu intensive. I think they were renting out rigs on… I forget but basically make it so someone could download a VM that’s preconfigured to run ottodev - save the virtual machine. Rent a server (or in your case your pc) out for isolated gpu access and run that VMware on your rent gpu server. Then email or Dropbox the files you create back to your regular system before your gpu rental rig expires
Been waiting for you to make this one and what it means for oTToDev. This makes things a lot more interesting Also saw you have a custom Qwen-2.5-Coder-32b for oTToDev.. what changes are you playing with?
It sure does! I haven't been playing with anything big yet, the main thing was increasing the context length since Ollama by default only has 2k tokens for the context length for any model. That isn't necessary for oTToDev anymore actually since we handle that config in code, but I needed that for my other agent testing.
Yeah, but it failed for me. I used Qwen Coder for my Flutter project, and it messed up all the Dart code. 3.5 Sonnet is just on another level in comparison.
That sucks... I had good experience with qwen2.5 72B and 3.5 sonnet. I'll try out 32B today though. To be honest, 3.5 Sonnet was good also except for my Swift projects.
@@DesiChichA-jq8hx Super interesting, I appreciate you sharing! I wonder if you found a big issue with Qwen 2.5 Coder where it isn't fine tuned on much Dart code? Certainly could be, I'm not surprised that Sonnet is still clearly better with some languages.
That Failure also happens to me with 7B model :( i have higher than recomended system for even 14b model instalation made perfectly but i guess you gotta work on your fork about bolt.new so i fixed the issue by refreshing the container from docker or the page restart generally works but u gotta work on it man.
What is your VRAM, RAM, and NVME SSD? With custom code, you can split the load betwwen VRAM, RAM and NVME SSD. I'm doing that in windows but my NVME is 7400MB/sec DirectStorage enabled so it is fast. I'm able to run qwen2.5 32B bf16 which is not quantized. No ollama no LM Studio
But at 18ct in/out for the API, you could still run it all the time agentically, right? Or any other reasons why you need to run it locally for the agentic stuff you're talking about?
Yeah that is true, though the cost still does add up! Running models locally also helps when you need private AI (your company has compliance requirements, you have intellectual property you want to keep safe, etc.) and you can fine-tine local LLMs on your own data!
Qwen 2.5:32b runs really slow for me... I'm on M4 Max 32GB so not an ideal spec, but by comparison codellama:32b was fast and quite usable. Guess I need to experiment with some of the quant models
You don't have to stop posting on RUclips but I appreciate you joining the Discourse! Yeah for Qwen-2.5-Coder-32b I would recommend 64 GB, but the quant models would be good to test!
Sorry... me again. 😂 So.. my brain was thinking about GPU and resources with LLMs. What I was thinking is.. if say qwen had 10 languages. Could you essentially split that model into ten and give each language to an agent to specialise in. Like a advanced swarm of agents. So when you run a prompt, each agent is specialised in each language. But come together to dev the code. So each model only uses the GPU for its part of the code? Maybe use a large LLM just for structure. Then the agents to produce the code one at a time. I'm really about quality over speed for my agents. I'd happily prompt in the morning and have an all singing and dancing in the night. Thoughts?
Yes you could certainly do that, I love your thoughts here! You would just have to have a router agent that takes in each request and determines which agent to send the request to based on the language.
Hey. For some reason whenever I ask ollama to create me something it gives me code instead of using the implemented code editor. I did what you told us to do to fix it by changing the message length but it still didn’t work
I saw other people test this and it's bad with Cline and with Aider as well. They are promising stuff same as Mistral but it's not performing. They have cooked benchmark tests.
Yeah I've heard! I think it's because Cline and Aider are prompted more specifically for Claude 3.5 Sonnet and other larger models. I could be wrong though, I know this model is certainly starting to be very polarizing haha
I'm not sure about the usefulness of AI tools like this. I mean when do we really need to generate simple projects from scratch? The reality is that we have a large codebase already and want AI to help us evolve it. And all within our IDE of course. No tool can do this yet afaik.
Yeah benchmarks are not realistic often, though I have had great experience with Qwen-2.5-Coder-32b. I'm surprised you've seen demos where it isn't good tbh! A lot of others I have talked to have had good experience as well. Super weird this model is so polarizing haha
The model has been working great for me, it's super interesting it isn't for others! I'm just sharing what I've experienced even with examples in the video with oTToDev and my agent, certainly not being dishonest!
Oh come on.. can you please ask AI how you can make a „normal“ face for a thumbnail??? I‘ve clicked it because the topic interests me, DESPITE the dace you‘ve made… almost skipped the video because of it…
All sizes and families of Qwen2.5 (not just the coders) are very bad at everything other than coding and math. Even Qwen2.5 72b instruct hallucinates so frequently and egregiously about so many different topics that it's less than useless to the general population. Qwen didn't discover anything new. All they did was get/steal tons of math and coding synthetic data (largely from Claude) and train obsessively on it. We shouldn't be applauding them for doing this, let alone exclaiming that they're changing the game. Open source AI isn't going anywhere until the coders learn to judge and review general purpose AI models, rather than selfishly focusing on their AI powered coding tools. More and more companies will do what Qwen did with Qwen2.5, effectively ending open source AI for the general population, forcing them to exclusively use proprietary AI models.
Very interesting perspective Brandon - I really appreciate you sharing and putting a lot of thought into it! I honestly have to disagree but I'm very curious to hear your thoughts more. I actually think having a bunch of different LLMs that are fine tuned for specific tasks (coding, creative writing, etc.) is a good thing for the open source LLM ecosystem, so I welcome having model families like Qwen that are really good at some things, even if it means they aren't as good as a generalized model. Of course, as a coder and "AI agent" guy myself I'm biased towards focusing on Qwen as you mentioned, but there are certainly other models out there for other niches. I also think that Qwen is a fantastic example of training a model to do fantastic things with synthetic data (code generated by Claude), which is a huge topic in the space right now and one of the big ways LLMs are going to keep getting better. In general I simply love the idea of fine tuned models trained on synthetic data - it really is the frontier of LLM advancements right now especially for open source. Let me know your thoughts on this - I'm all ears man!
@@ColeMedin A little code & math improves LLMs, but overtraining code & math really does destroy the broad functionality of an LLM. Like you, I hope they continue making specialized math, coding... LLMs with synthetic data, but they really have to leave their instruct/chat versions alone. For example, Q2.5 hallucinates far more than Q2.
When i start "Bolt.new-any-llm" with this KI-Model: qwen2.5-coder:32b then i get this Error: There was an error processing your request. What can i do now? I hope you know the answer for this?
I am running ollama behind a nginx proxy (incl. Traefik) and i had to increase the timeouts. Furthermore i cant get this model to run stable while actually using the full 30k contexteindow on a 3090. Even with lowered batch size. @colemedin are you sure all layers are in the gpu? And how does your modelfile look?
Since I test Qwen-2.5-Coder-32b with oTToDev in this video, I wanted to say: Come join our Discourse community for oTToDev, our fork of Bolt.new!
thinktank.ottomator.ai
Also, I'm working on more super big things behind the scenes:
ottomator.ai
Best vid showing what Qwen 2.5 32b can do. I have it on my mac studio after watching your video and am looking very much forward to putting it in my pipelines.
@@jaredcluff5105 Thanks Jared! Sounds great, I hope it works great for you too!
Your an amazing young man, keep up the good work. I value your videos, I keep going back to them like a valuable library which they are. Thanks for the high value you bring to each video.
Young man 🍼 .
Uncensored and offline = the best
Agreed!
I could swear you're Matthew Berman's son. :D
@@lancemarchetti8673 haha that's a first but I see it a bit 😂
😄😄😄
😂 can’t unsee.
Very true
Unlawful son ?
hey Cole great video as always! Curious for Mac machines and with the new m4 chips what would be the MIN requirement for the 32B model lets say? the Q models are an interesting touch that's for sure!
Thank you! For the new Mac machines, for Qwen-2.5-Coder-32b I'd recommend the M4 Max chip and at least 64 GB of unified memory.
@@ColeMedin damn you better write some legit code to get back $5000 CAD worth of tokens lol. Ah well good thing there's pay per use for us peasants :D
Best one on Qwen 2.5 Coder 32 B , thanks for great sharing, wishing you best in future
Thank you, you too!!
Seen some of your stuff before but just liked and subbed. Love the no fluff, no hype, actually useful approach. Thanks for not giving us a Snake app in Python and then just emoting about how amazing AI is.
Good job with the eval agent!
Do you plan to bring in an additional layer of agents, managed by an orchestrator?
Bringing in the swarm concept into langgraph will be awesome
Great Counter FOMO Operations video Cole ->> not letting the latest, super cool and sexy proprietary closed source AI feature or company think we're missing out with local AI. I am curious how much context control you have with Qwen: this file, this module, the entire code base. If "context is King" is true, then we creators need help adding the right amount of context for the task while being cognizant of the GPU resources available.
Interesting. I saw a couple other channels using Quen 2.5 32B and it failed at some simple to moderate code.
@@JimWellsIsGreat Wow that's interesting, could you call out a specific video or two? I've only seen good results from others so I'd love to check it out!
@@ColeMedin AIcodeking video, it seemed to fail in aider and cline. I'm not sure why, but I imagine it has something to do with the way they implement tools, file edits, etc. When I used it with cline it opened the file and started writing it, and then somewhere in the middle it just kept repeating text over and over. Looks like its working pretty well with ottodev though!
Yeah, same here saw the video was really disappointing to see. But your review turned out good
Will be interesting to see how well it runs on my Pi5 8GB. The last few LLM models have been pretty good.
Impressive localAI maybe won't be on par with online models but it's already pretty useful
your fork is awesome! I wish it could help in deployment, which is the one of biggest pain points.
Thank you! And that is one of the features we are hoping to implement really soon!
New Subscriber ! I have been testing this new Qwen 2.5 32b coder llm under Ollama for nearly 72 hours and It runs well on my old Dell G-7 !
That's fantastic!
looks awesome
Could you create a video about building different level PC builds for running AI locally? For example, I dunno if 2 x Nvidia RTX 3080s is better than 1 x Nvidia RTX 4080. Also, it's diffucult to understand the bottlenecks; is it enough to have a beefy GPU? I would like to invest all my money as effectively as possible, and this is relevant right now because Black Friday is coming. 😁
I would like to just know how you are getting the code to run on a specific GPU if possible or it just the first GPU?
I really appreciate this suggestion! I'll certainly consider making this video before Black Friday, though I'd have to figure out how to fit it into my content calendar!
2x of a slightly weaker model compared to 1x of a slightly stronger depends a lot on the models you want to run. If a model can fit into one GPU, it'll be much faster than having to split between 2 GPUs. That all depends on the VRAM of the GPUs to determine if a model can fit.
@ColeMedin Thanks for summing up the the key differences of 1 and 2 GPU setups! Also great to hear that we might get good hints for making our Blck Friday shopping lists. Thanks for great content and taking local AI into account!
further to your commentary, I would love to see a spec build of your rig on pcpartpicker or something
I am planning on sharing that soon!
I tested it with Cline, and no, it is not as capable as Claude 3.5 Sonnet. It has problems understanding and properly responding to Cline.
Thanks for testing that out! To be fair, with Cline it's probably prompted under the hood specifically to work with Claude models. I know from oTToDev and other experiences it definitely is beneficial to prompt different models in different ways.
@ColeMedin well, have to find the system prompt for Claude, to tweak Cline then. A totally free system would be great for experimentation.
How did you test it with cline , test only accepts multimodals right ?
I'm using Cline many hours each day, and my experience is that it works well with Claude, not bad with GPT family, but has difficulty with other models. Just the nature of the tool, if you look under the hood. I'm looking into forking it for experimenting more and getting results using different models
Thanks Cole! I am using Bolt.new with the paid version and it is quite a nightmare to prompt and prompt and prompt a lot , UI breaks a lot. Do you reco to switch to Qwen and Ottodev or wait a bit. I am not a dev but can build some stuff.
I like your videos.
Just a heads up.
Vite is pronounced veet, like feet but with a v. It's french for Fast.
Ahh yes, thank you for pointing this out!
Can we use qwen 2.5 32B hosted in a huggingface space in ottodev? It would be awesome since not everyone has a gpu that supports 32B parameters models.
@@GuilhermeHKohnGAnti Great question! HuggingFace isn't supported for oTToDev and I'm not sure if it would be a practical implementation like some of the other providers, but it would be good to look into! You can use Qwen 2.5 Coder 32b through OpenRouter though and it's super cheap, as cheap as GPT-4o-mini
did you change the contect window for the model or did you just use the standard 2k in otto dev?
i meant this : PARAMETER num_ctx 32768?
@@jesusjim Great question! I did change the context window to make it bigger. Not necessary for oTToDev anymore since we included this parameter within the code to instantiate Ollama, but it was necessary for me to do for my custom coded agent.
This is awesome Cole 😂🎉
@@ContentVibeio Thank you 😁
Nice!! Though, how about vLLM instead of Ollama?
Great suggestion! We don't support that now but that would be a good addition!
This is awesome, What about opencoder ?
Thanks! I'd love to try Opencoder too!
Should I be fine with a 3060 12gb and 64 gbs of ram? I am able to run codestral just fine
limited to 14B like mine
@ i’m able to run 22b models just fine
Reallly....gotta go see what that was all about
@@gamez1237 I would try for sure, but it might be pushing it a bit! You could always try the Q_2 version as well that I referenced in the video.
What are the minimum requirements to run this model at an acceptable performance? My PC isn't really powerful, and I'd like to know the minimum upgrade I need to run AI models.
I have a very similar system with 2x3090s and a threadripper 3960x. My qwen qwen2.5-coder:32b-q8_0_in32k doesn't do the things yours does..it fails miserably like your second tested model.
Just tested now: qwen2.5-coder:32b-q8_0_in32k: 2.2 tokens/sec vs qwen2.5-coder:32b-base-q8_0 : 19 tokens/sec in ollama promt. So increasing the context window to 32k makes the inference way slower
Yeah it makes sense because with oTToDev the prompt is actually bigger than the default context size of 2k tokens! Does it fail for you even with the increased context limit?
mine seems to be working in a singel 3090 with PARAMETER num_gpu 100
PARAMETER num_ctx 15000
PARAMETER num_batch 64
PARAMETER num_thread 10
SYSTEM OLLAMA_NUM_PARALLEL=1
Anything more than that and it crashes. But the speed is great because it just fits in the vram. Even when using up all 15k of context.
So if you need more than 15k context and you want full GPU, this is not the model for you.
Yo! I don’t have a gpu capable of running this but I do know you can rent gpus for a few hours or days or whatever. Think there is a Linux distro that we could upload to a rented gpu with all the configuration to run qwen 32 and deploy quickly to utilize the gpu and then export that code out of the machine. Maybe a Virtualbox or VM that’s ready to go off the rip? I’m still struggling on how to get my local llm setup and running because I ate paint chips when I was little
Maybe even an email preconfigured with a proton mail or something so people could email out the code they generate on rented gpu or servers? This would be really really cool and helpful and it’s common in the crypto world. Especially for the new gpu / cpu hybrid chains for mining
I love your thoughts here! So basically you are thinking of a way to have a machine in the cloud that isn't accessible to everyone directly, but people would have a way to send requests into the LLM with something like email to use Qwen-2.5-Coder-32b for larger coding tasks?
You could definitely do this even with just a GPU instance running Qwen-2.5-Coder-32b behind an API endpoint that you could create a frontend around! Or just host oTToDev itself for people to use which is something I am looking into doing!
@ yeah a lot of people are renting clusters in the xen and x1 community which is gpu intensive. I think they were renting out rigs on… I forget but basically make it so someone could download a VM that’s preconfigured to run ottodev - save the virtual machine. Rent a server (or in your case your pc) out for isolated gpu access and run that VMware on your rent gpu server. Then email or Dropbox the files you create back to your regular system before your gpu rental rig expires
Been waiting for you to make this one and what it means for oTToDev. This makes things a lot more interesting
Also saw you have a custom Qwen-2.5-Coder-32b for oTToDev.. what changes are you playing with?
It sure does!
I haven't been playing with anything big yet, the main thing was increasing the context length since Ollama by default only has 2k tokens for the context length for any model. That isn't necessary for oTToDev anymore actually since we handle that config in code, but I needed that for my other agent testing.
Yeah, but it failed for me. I used Qwen Coder for my Flutter project, and it messed up all the Dart code. 3.5 Sonnet is just on another level in comparison.
That sucks... I had good experience with qwen2.5 72B and 3.5 sonnet. I'll try out 32B today though. To be honest, 3.5 Sonnet was good also except for my Swift projects.
@@DesiChichA-jq8hx Super interesting, I appreciate you sharing! I wonder if you found a big issue with Qwen 2.5 Coder where it isn't fine tuned on much Dart code? Certainly could be, I'm not surprised that Sonnet is still clearly better with some languages.
@@ColeMedinAiCodeKind channel directly said. The model is bad.
Your channel - the model is good.
Which channel goes to veilguard then?
That Failure also happens to me with 7B model :( i have higher than recomended system for even 14b model instalation made perfectly but i guess you gotta work on your fork about bolt.new so i fixed the issue by refreshing the container from docker or the page restart generally works but u gotta work on it man.
We are working on the prompting for sure, but really it comes down to a lot of local LLMs still struggle with larger prompts necessary for Bolt!
Nice the Project Is getting better but I "only" got the reasources to run 14 B models sad
What is your VRAM, RAM, and NVME SSD? With custom code, you can split the load betwwen VRAM, RAM and NVME SSD. I'm doing that in windows but my NVME is 7400MB/sec DirectStorage enabled so it is fast. I'm able to run qwen2.5 32B bf16 which is not quantized. No ollama no LM Studio
@@benkolev4290 14B is still great! And you can use OpenRouter if you want to access larger open source models through an API!
@@AaronBlox-h2t that’s super interesting! What are your tokens per second metrics and time to first token, if you don’t mind me asking?
Thanks!
Как в Луцке дела?
@@igorshingelevich7627 OK + -
32B still. That's large. I'll be excited if anything below 12B can be as good. That's the point I'd say, that the federated Agentic model wins.
Yeah that's totally fair!
I do not see qwen 2.5 in bolt any llm local fork from Open Router. To run it with bolt, do I need to install Olama and qwen 2.5 ?
🔥🔥🔥
I’m curious though, is this new coder good at Python/I/ML programming or mostly Java/typescripts/etc… ?
It's really good at Python/ML programming!
Can you try the 14b please
Yes I certainly will be diving into the 14b version as well!
But at 18ct in/out for the API, you could still run it all the time agentically, right? Or any other reasons why you need to run it locally for the agentic stuff you're talking about?
Yeah that is true, though the cost still does add up! Running models locally also helps when you need private AI (your company has compliance requirements, you have intellectual property you want to keep safe, etc.) and you can fine-tine local LLMs on your own data!
Qwen 2.5:32b runs really slow for me... I'm on M4 Max 32GB so not an ideal spec, but by comparison codellama:32b was fast and quite usable. Guess I need to experiment with some of the quant models
Just joiuned the Discourse chat so I'll stop leaving replies on YT!
You don't have to stop posting on RUclips but I appreciate you joining the Discourse! Yeah for Qwen-2.5-Coder-32b I would recommend 64 GB, but the quant models would be good to test!
Please mention hw requirements for Mac too...😎
To be honest I'm not an expert at Mac, but I'd say you'd want at least 64 GB of unified RAM with the new M4 chip.
I just cant seem to get a preview window going, is there a place for troubleshooting this particular version? Im a noob as you can see.
would be wonderful if users have service like fusion of ottodev and novita ai , choose gpu, choose llm , then create projects .....
Yes!!! That is certainly one of the end goals!
Sorry... me again. 😂
So.. my brain was thinking about GPU and resources with LLMs. What I was thinking is.. if say qwen had 10 languages. Could you essentially split that model into ten and give each language to an agent to specialise in. Like a advanced swarm of agents. So when you run a prompt, each agent is specialised in each language. But come together to dev the code. So each model only uses the GPU for its part of the code? Maybe use a large LLM just for structure. Then the agents to produce the code one at a time.
I'm really about quality over speed for my agents. I'd happily prompt in the morning and have an all singing and dancing in the night. Thoughts?
Yes you could certainly do that, I love your thoughts here! You would just have to have a router agent that takes in each request and determines which agent to send the request to based on the language.
Is it possible to edit existing project in your bolt.new ?
Not yet but that is something we are looking to have implemented very soon for the project!
Does it do dart language well?
I haven't tested myself with Dart yet actually! One other person reported it not doing the best with Dart but I'd try yourself and see!
After entering the prompt, it just writing the codes in the chat interface instead of code interface, Why? Never get this worked with ollama
Which model are you using? Really small models will do this because they can't handle the bolt prompting.
What hardware are you using
I have two 3090 GPUs and 128 GB of RAM!
Can you pls try the 14b? 🙏🏻
I am planning on doing this in the near future for more content!
I wish I can run the 32b model faster on my machine.
I'd try the 14b parameter model if you can! Or use OpenRouter for API access to the models!
We need one sharp model, uncensored and fast, to be the new unicorn. If its not uncensored, dont work for me
Hey. For some reason whenever I ask ollama to create me something it gives me code instead of using the implemented code editor. I did what you told us to do to fix it by changing the message length but it still didn’t work
Interesting... I'm guessing the model you are using is too small (3b/1b). Which model are you using?
11:57 can Monday be used instead of Asana?
Yeah certainly! They have a great API too!
I saw other people test this and it's bad with Cline and with Aider as well. They are promising stuff same as Mistral but it's not performing. They have cooked benchmark tests.
Yeah I've heard! I think it's because Cline and Aider are prompted more specifically for Claude 3.5 Sonnet and other larger models. I could be wrong though, I know this model is certainly starting to be very polarizing haha
can u code a react native / node.js mobile app with that bolt??
Bolt is just meant for web development right now! But this would be so cool to have in the future!
All versions is quantized… even fp16… full version is fp32 and need at least two huge nvidia tesla cards.
Can anyone help me get through this installation for a fee? Please feel free to reach out. Thanks in advance
I'm not sure about the usefulness of AI tools like this. I mean when do we really need to generate simple projects from scratch? The reality is that we have a large codebase already and want AI to help us evolve it. And all within our IDE of course. No tool can do this yet afaik.
It doesn't matter the benchmarks. I've tried it and it works much worse than gpt4o
It depends on the tasks for sure!
I have seen some demos about Qwen-Coder and they are quite disappointed about the quality of the code. Benchmarks are not realistic
Yeah benchmarks are not realistic often, though I have had great experience with Qwen-2.5-Coder-32b. I'm surprised you've seen demos where it isn't good tbh! A lot of others I have talked to have had good experience as well. Super weird this model is so polarizing haha
First of Latinoamerica
Use windsurf Ai thank me later
I will take a look!
Hi Cole. Pls refer to the email I sent you. Appreciate it man.
I will take a look!
AICodeKing is the only one honest about how bad this model is.
The model has been working great for me, it's super interesting it isn't for others! I'm just sharing what I've experienced even with examples in the video with oTToDev and my agent, certainly not being dishonest!
Oh come on.. can you please ask AI how you can make a „normal“ face for a thumbnail??? I‘ve clicked it because the topic interests me, DESPITE the dace you‘ve made… almost skipped the video because of it…
I use normal faces too often, have to switch it up once and a while 😂
@ there is always someone complaining about something I guess 😉
All sizes and families of Qwen2.5 (not just the coders) are very bad at everything other than coding and math. Even Qwen2.5 72b instruct hallucinates so frequently and egregiously about so many different topics that it's less than useless to the general population. Qwen didn't discover anything new. All they did was get/steal tons of math and coding synthetic data (largely from Claude) and train obsessively on it. We shouldn't be applauding them for doing this, let alone exclaiming that they're changing the game. Open source AI isn't going anywhere until the coders learn to judge and review general purpose AI models, rather than selfishly focusing on their AI powered coding tools. More and more companies will do what Qwen did with Qwen2.5, effectively ending open source AI for the general population, forcing them to exclusively use proprietary AI models.
Very interesting perspective Brandon - I really appreciate you sharing and putting a lot of thought into it! I honestly have to disagree but I'm very curious to hear your thoughts more.
I actually think having a bunch of different LLMs that are fine tuned for specific tasks (coding, creative writing, etc.) is a good thing for the open source LLM ecosystem, so I welcome having model families like Qwen that are really good at some things, even if it means they aren't as good as a generalized model. Of course, as a coder and "AI agent" guy myself I'm biased towards focusing on Qwen as you mentioned, but there are certainly other models out there for other niches.
I also think that Qwen is a fantastic example of training a model to do fantastic things with synthetic data (code generated by Claude), which is a huge topic in the space right now and one of the big ways LLMs are going to keep getting better.
In general I simply love the idea of fine tuned models trained on synthetic data - it really is the frontier of LLM advancements right now especially for open source.
Let me know your thoughts on this - I'm all ears man!
@@ColeMedin A little code & math improves LLMs, but overtraining code & math really does destroy the broad functionality of an LLM. Like you, I hope they continue making specialized math, coding... LLMs with synthetic data, but they really have to leave their instruct/chat versions alone. For example, Q2.5 hallucinates far more than Q2.
When i start "Bolt.new-any-llm" with this KI-Model: qwen2.5-coder:32b then i get this Error: There was an error processing your request. What can i do now? I hope you know the answer for this?
What is the error you get in the terminal where you started the site or in the developer console in the browser?
@@ColeMedin this Error: "There was an error processing your request"
I have sent you a email because i can show it better there.
I am running ollama behind a nginx proxy (incl. Traefik) and i had to increase the timeouts. Furthermore i cant get this model to run stable while actually using the full 30k contexteindow on a 3090. Even with lowered batch size. @colemedin are you sure all layers are in the gpu? And how does your modelfile look?