It’s suuuuuch a relief to see your videos in feed, regarding something I’m interested in, and know instantly I WILL NOT be disappointed. RUclips is a long game and the cream always rises my good man. 💪
#1 my ass. I tested it with 10+ prompts and it hallucinated a lot. It would suddenly stop generating response in AI studio. As of today, it's quite buggy and unreliable. I wouldn't recommend using this.
I also don't trust it, but I get it. It's not Devs who vote for coding tasks, and people just cast random votes. That's why I believe actual benchmarks are needed, like we do on the channel, and what Aider does. The problem with Aider benchmarks is that LLMs can train on them because they are public
I think the new gemini models have been tuned pretty well in terms of human preference (at least I like the newer models more than their older ones). Claude imo is usually #1, then everyone else is about the same. However from my usage of the model, it seems like gemini 2.0 exp 1206's responses get pretty bad/mediocre after 60k tokens of context.
Bravo...seems always you have something really useful
Thanks! Will keep them coming
This Open-source model is truly revolutionary!🔆🔆
@@ragibhasan2.0 it's not Open Source
@@MarvijoSoftware i mean deepseek r1
🤗
It’s suuuuuch a relief to see your videos in feed, regarding something I’m interested in, and know instantly I WILL NOT be disappointed. RUclips is a long game and the cream always rises my good man. 💪
@@JayBallentine 🙏🏾 I appreciate you my good man!
its so satisfying to see spreadsheet of performance at end , 👍
😃
Appreciate your videos. Thanks for the comparisons and insights
@@TheBuzzati I appreciate your viewing 🙏🏾
#1 my ass. I tested it with 10+ prompts and it hallucinated a lot. It would suddenly stop generating response in AI studio. As of today, it's quite buggy and unreliable. I wouldn't recommend using this.
@@MrParad0x yep, I don't trust LM Arena in the slightest, especially after it ranked the weak o1-mini so high for coding
Can you compare OpenHands with the agents you've already tested? I’d like to see how it stacks up against them using DeepSeek V3 or R1.
@@moidrugag okay
Thank you
Please Use this voice in every video , and also please Do Cursor(Deepseek-R1) vs Windsurf(Sonnet 3.5) video .
Alright, I'll queue it up. The problem is that Cursor + R1 don't support Composer. Cursor vs Windsurf: ruclips.net/video/duLRNDa-CR0/видео.html
I don't trust the chatbot arena, they put Claude 3.5 Sonnet in 11th place
I also don't trust it, but I get it. It's not Devs who vote for coding tasks, and people just cast random votes. That's why I believe actual benchmarks are needed, like we do on the channel, and what Aider does. The problem with Aider benchmarks is that LLMs can train on them because they are public
❤ Gemini 2.0
Thanks for the nice video. Please make a comparison of Roocline vs Aider vs Cursor
@@andrewandreas5795 A Roo-Cline video is incoming soon, after the R1 Architect video in a larger codebase
How the f its number one on arena? Like one or two days after release? I had to wait a lot longer to see phi4 in that list.
I asked myself the same question after it was just released! So quick, who voted? Bots? Something might be off
No ,deepseek R1 is number one, it batter then new Gemini 2 Flash thinking
I agree. That's the LMArena leaderboard which is based on random people voting
lol.. these aren't three Rs Gemini..😂😂
@@abdusalamolamide 😂
I think the new gemini models have been tuned pretty well in terms of human preference (at least I like the newer models more than their older ones).
Claude imo is usually #1, then everyone else is about the same. However from my usage of the model, it seems like gemini 2.0 exp 1206's responses get pretty bad/mediocre after 60k tokens of context.
All models get dumber as tokens increase. Also, they start to output random characters at a certain context