I think with smaller context window for now, is for letting people know about them, let them use it, give them the feedback. And then later they'll make it another pro model, like gemini 2 Pro. So this could be a testing (as the name already suggests) model, ready to be commercial very soon. just After 1 week, coming of another model means they are speeding their work on them.
Interestingly enough, Magnus is actually mentioned 3 times on that page (3rd time it's just 'Carlsen'). But, again, 3rd mention is not covered in the chunk of text copy&pasted by you in the prompt, it was later in sources. But I wonder if he was mentioned in that chunk as well the third time, just not by his name..? It should be possible to catch by llm probably. I'm too lazy to check though, so I'll just leave this thought here
That one result its not (tied for) leading on is not as you say "for style", its rather the leaderboard's attempt at _controlling_ for style, so that only substance counts -- so essentially exactly the reverse; its saying that when you discount idk human preferences for say greater length of response even when its not saying anything more or for way of its markdown use etc, gemini drops a rank to #2 in the overall category. Still pretty impressive ofc. This is how Chatbot Arena describes it in their blogpost about the criterion: " The goal here is to understand the effect of style vs substance on the Arena Score. Consider models A and B. Model A is great at producing code, factual and unbiased answers, etc., but it outputs short and terse responses. Model B is not so great on substance (e.g., correctness), but it outputs great markdown, and gives long, detailed, flowery responses. Which is better, model A, or model B? The answer is not one dimensional. Model A is better on substance, and Model B is better on style. Ideally, we would have a way of teasing apart this distinction: capturing how much of the model’s Arena Score is due to substance or style."
Regularly follow your channel, keep it up. But do your thumbnails have to always feature Elon Musk, and Sam Altman when talking about their business. There are a lot of hard working people behind them that are building the companies.
Gemini was "refined" into oblivion it has little to nothing to offer compared to other models i tried it myself and it was reduced to nothing more then a lazy persons prompt gadget sadly enough
so, I could make a bot that runs my model locally and asks remotely and regardless of the "best" answer, simply choose the answer that matches my local output, and in this way skew the results to look like my model was best... And this is why clever people can't have nice things.
Arjun Erigasi - mostly at this point! but nothing is so solid tbh, I used to root a lot to Nepo (RUS) but I think his time for championships are gone! Wesly is another favorite player - very humble!
Many of the new models now where simply trained to pass benchmark questions; for example, the new qwen 2.5 model, (which was once a favorite for coding) passed all the benchmark question, you will think its better than anthrophic's claude but its a complete trash when used in real life 😅
Gave it a shot. Sucked as usual. Was hoping deep thinking with web access will give it the edge, but it sure didn't... 4o (not o1) did a much better job...
New Video Generation AI model - ruclips.net/video/CwvN2Ccddgk/видео.html
I have noticed this - openai just trys to overshadow attention once Google does something
attention is all you need maybe
always
@@pixelperfectpravinlmao the reference
I think with smaller context window for now, is for letting people know about them, let them use it, give them the feedback. And then later they'll make it another pro model, like gemini 2 Pro. So this could be a testing (as the name already suggests) model, ready to be commercial very soon.
just After 1 week, coming of another model means they are speeding their work on them.
The more awesome thing is Google uses its own tpu chips. Sam and OpenAI will just burn the cashflow of msft in capex costs.
Interestingly enough, Magnus is actually mentioned 3 times on that page (3rd time it's just 'Carlsen'). But, again, 3rd mention is not covered in the chunk of text copy&pasted by you in the prompt, it was later in sources. But I wonder if he was mentioned in that chunk as well the third time, just not by his name..? It should be possible to catch by llm probably. I'm too lazy to check though, so I'll just leave this thought here
Woah. That's very interesting. Let me check it again.
Really appreciate you covering this
Chatbot Arena must be horrible broken if Claude models are not top in coding.
The McKinsey comment is hilarious and so true
That one result its not (tied for) leading on is not as you say "for style", its rather the leaderboard's attempt at _controlling_ for style, so that only substance counts -- so essentially exactly the reverse; its saying that when you discount idk human preferences for say greater length of response even when its not saying anything more or for way of its markdown use etc, gemini drops a rank to #2 in the overall category. Still pretty impressive ofc.
This is how Chatbot Arena describes it in their blogpost about the criterion:
" The goal here is to understand the effect of style vs substance on the Arena Score. Consider models A and B. Model A is great at producing code, factual and unbiased answers, etc., but it outputs short and terse responses. Model B is not so great on substance (e.g., correctness), but it outputs great markdown, and gives long, detailed, flowery responses. Which is better, model A, or model B?
The answer is not one dimensional. Model A is better on substance, and Model B is better on style. Ideally, we would have a way of teasing apart this distinction: capturing how much of the model’s Arena Score is due to substance or style."
Regularly follow your channel, keep it up. But do your thumbnails have to always feature Elon Musk, and Sam Altman when talking about their business. There are a lot of hard working people behind them that are building the companies.
Gemini was "refined" into oblivion it has little to nothing to offer compared to other models i tried it myself and it was reduced to nothing more then a lazy persons prompt gadget sadly enough
so, I could make a bot that runs my model locally and asks remotely and regardless of the "best" answer, simply choose the answer that matches my local output, and in this way skew the results to look like my model was best...
And this is why clever people can't have nice things.
Your Fav Indian chess player, apart from vishy ?
Arjun Erigasi - mostly at this point! but nothing is so solid tbh, I used to root a lot to Nepo (RUS) but I think his time for championships are gone! Wesly is another favorite player - very humble!
Arena votes are not really a good way to assess models. It's subjective.
McKinsey employees - Mujhe kyu toda? 😂😂
Gemini is leading by a margin of error, but nevertheless it is leading
Try this prompt... Watch it fail dismally. Only o1-preview comes close to success.
Find pairs of words where:
1. The first and last letters of the first word are different from the first and last letters of the second word. For example, "TeacH" and "PeacE" are valid because:
The first letters are "T" and "P" (different).
The last letters are "H" and "E" (different).
2. The central sequence of letters in both words is identical and unbroken. For example, the central sequence in "TeacH" and "PeacE" is "eac".
3. The words should be meaningful and, where possible, evoke powerful, inspiring, or thought-provoking concepts. Focus on finding longer words for a more varied and extensive list.
Examples
1. Banged Danger
2. Bated Gates
3. Beached Reaches
4. Belief Relied
5. Blamed Flames
6. Blamed Flamer
7. Blazed Glazer
8. Blended Slender
9. Bolted Jolter
10. Boned Toner
11. Braced Traces
12. Branded Grander
13. Braved Craves
14. Braved Graves
15. Braver Craved
16. Brushed Crusher
17. Busted Luster
18. Busted Muster
19. Causes Paused
20. Chased Phases
21. Chaser Phased
22. Cracked Tracker
23. Craved Graves
24. Crated Grates
25. Creamy Dreams
26. Created Greater
27. Dared Bares
28. Dancer Lanced
29. Dreamed Creamer
30. Fabled Tables
31. Faith Baits
32. Fallen Baller
33. Favoured Savourer
34. Famed Gamer
35. Famed Cameo
36. Fared Cares
37. Fasten Master
38. Fated Gates
39. Faved Caves
40. Feared Bearer
41. Fiery Piers
42. Fired Tires
43. Flared Glares
44. Flashed Clashes
45. Flipped Slipper
46. Foamed Roamer
47. Folded Bolder
48. Founder Sounded
49. Gifted Lifter
50. Gleaned Cleaner
51. Graced Traces
52. Hades Wader
53. Hardened Gardener
54. Hated Fates
55. Laced Racer
56. Laced Races
57. Lasted Faster
58. Leader Beaded
59. Leaves Heaved
60. Lighted Fighter
61. Lives Given
62. Manned Banner
63. Mailer Sailed
64. Mended Bender
65. Missed Kisses
66. Mounted Counter
67. Moved Lover
68. Named Games
69. Paced Laces
70. Paced Racer
71. Paced Races
72. Pained Gaines
73. Painted Fainter
74. Parched Marches
75. Placed Glaces
76. Plates Slated
77. Popes Roped
78. Races Faced
79. Racer Laced
80. Rarer Cares
81. Rated Dates
82. Raver Waves
83. Rested Tester
84. Saved Waver
85. Seated Beater
86. Sailer Wailed
87. Sainted Painter
88. Seeder Needed
89. Slayer Played
90. Tainted Painter
91. Tamed Games
92. Tailed Raider
93. Teach Peace
94. Tested Fester
95. Tinker Linked
96. Tired Siren
97. Traced Graces
98. Treated Greater
99. Warmed Farmer
100. Wasted Baster
101. Watched Catcher
Many of the new models now where simply trained to pass benchmark questions; for example, the new qwen 2.5 model, (which was once a favorite for coding) passed all the benchmark question, you will think its better than anthrophic's claude but its a complete trash when used in real life 😅
Absolutely right, 😂
attention is all you need
Gave it a shot. Sucked as usual. Was hoping deep thinking with web access will give it the edge, but it sure didn't... 4o (not o1) did a much better job...
Gemini is useless, it is so censored that it refuses to do things like hash a simple password.
i uploaded an entire textbook into the context window and now i ask questions of it daily with useful results.
me too. i use it daily in similar and other ways. sliders down and it does its job in a huge context.
It's not that impressive at coding