what? I'm using opeai in production, and tried this model with the same config using langChain. It worse than open ai 3.5. They cannot even return simple json - just plain text
@@yourpc Yup. I only tested the 32B q8. Definitely inferior to o1, o1 is a different league. Big problems with languages other than English (can mix different languages in one sentence). Not suitable for programming, but maybe this is a Roo Code@VSC problem. I have tried to draw out an analysis of approaches to the problem of the Riemann Hypothesis. Every 5-7 paragraphs he repeated what he had done before, slightly altered. I set the context at 64k. No higher maths, the only formulas he gave were those on the Wikipedia pages. Then I asked o1 to check what r1 had given, everything was true, but as o1 put it: the text is more like brainstorming - he writes out very different ideas, which have one thing in common: each of them in its own way tries to limit (or explain) the position of the zeros of ζ.
@@hipotures I tried 8b. So I have structuredOutput config via LangChain, but in result I got only text chat. Also I tried with super simple example from lagncain to translate from English to French, and it translated, but also added around 5 sentence more text then just translate "I love programming"
Nice deep dive. These models are great, and are actually doing something I wasn't sure was possible. Now that I see it, I'm not sure why I thought this would be difficult. 🤷
I dug a little into the DeepSeek R1 paper, particularly GRPO (Group Relative Policy Optimization) and GAE. It is fascinating to me that GAE (Generalized Advantage Estimation) is related to adaptive control systems. I would not be surprised if we discover that the process of learning to achieve "intelligence", "reasoning" and maybe even "consciousness" is an adaptive control system process in itself. Was probably a biological mutation that became an advantage, and now we humans may have discovered how to build this advantage into silicon space.
The 1.5B model was randomly spitting out emojis and mandarin when I was playing with it yesterday, but the "reasoning" is still really impressive, I'm excited to play with the larger ones
Curious about multilingual capability here, will definitely play around soon! Also, for testing reasoning i would suggest a large complex task and treat it like a one shot solver, not a chat model. At least that seems to be the trick and strength of openai O models right now. Best!
So before LLMs, I predominantly worked on GANs, the prompt used in here, seems like a Generator and discriminator that we use in GaN, to get the best output.
Strange about the 'not for tool use', this is the first Model that uses my MCP tools without asking it to. (It does have to know about them being there ofc in the system prompt. You can get greater results with the smaller models wy giving it tools to help it reflect and plan, like mcp sequential thinking, and an mcp RAG it can fully manage itself with adding it's own vectors to it based on your data.
Reasoning combined with test time training would be killer for local OSS models. We need models with these techniques combined together somehow. I believe at that point we'd be beyond AGI, but we'd probably be at ASI at that point.
Sam, great video ❤ -- My local tests with r1 32B q8: How many strawberries are contained in the letter "r"? Alright, I need to figure out how many strawberries are contained in the letter "r." Hmm, at first glance, this seems a bit strange because "r" is a letter, and strawberries are fruits. So, how can a letter contain a fruit? Maybe the question is metaphorical or plays on words. Let's break it down. First, let's think about the letter "r" itself. It's a consonant in the English alphabet, often pronounced as /ɜː/ or /ə/. Now, strawberries are red, juicy fruits that are commonly associated with the letter "s" for "strawberry." But the question specifically mentions the letter "r." Could it be referring to something like the shape of the letter "r" resembling a strawberry? Or perhaps the number of times the letter "r" appears in the word "strawberry"? Let's examine the word "strawberry." The letters are s-t-r-a-w-b-e-r-r-y. I notice that the letter "r" appears twice in "strawberry." So, if we're counting the occurrences of the letter "r" in the word "strawberry," there are two "r"s. But the original question was about how many strawberries are contained in the letter "r." If we interpret this as how many times the letter "r" appears in the word "strawberry," the answer would be two. Alternatively, if we're being creative, maybe the letter "r" can be visualized to resemble a strawberry in some way, but that seems less straightforward. In summary, taking the question at face value and interpreting it as counting the letter "r" in the word "strawberry," the answer is two. There are two strawberries contained in the letter "r" when considering the word "strawberry," as the letter "r" appears twice in it. o1: This question is really a kind of nonsense riddle or wordplay. Literally speaking, a single letter-like “r”-doesn’t (and can’t) contain strawberries. So the straightforward answer is “None.” Sometimes such riddles are asked just to highlight how language can be twisted in playful (but ultimately meaningless) ways. There isn’t a hidden puzzle or code here-just a joke that the letter “r” can’t “hold” any strawberries at all!
@@samwitteveenai I can't send more, YT block text, but this model is Crazy Overthinker! (3 pages of text , 67 seconds) Okay, let's see. The user is asking, "How many strawberries are contained in the letter 'r'?" Hmm, that's an interesting question. At first glance, it doesn't make much literal sense because letters and strawberries are entirely different things. But maybe there's a creative or metaphorical angle here. Let me try to break this down. First, I need to understand the context. Is this a riddle, a play on words, or perhaps a visual puzzle? Sometimes questions like these rely on homophones, visual similarities, or puns. Let's consider different possibilities. Starting with the literal interpretation: the letter "r" as a character. If I imagine the lowercase "r" in some fonts, it might have a shape that could resemble a strawberry. But that's a stretch. The uppercase "R" doesn't look like a strawberry. However, maybe in a specific typeface, the curves or lines of the "r" could be stylized to include images of strawberries. But I don't recall any standard fonts that do that. So perhaps the answer is zero because letters don't contain actual strawberries. (...) -- The answer to the riddle is 3. Here's the breakdown: The question cleverly plays on the letters within the word "strawberry." When you spell out "strawberry" (S-T-R-A-W-B-E-R-R-Y), there are three instances of the letter "r." The riddle asks how many strawberries are contained in the letter "r," but the trick is to recognize that each "r" in the word "strawberry" corresponds to a strawberry. Thus, the answer is 3. 🍓🍓🍓
AFAIK they haven't released the data but I talked about the distillation in the video. the basically just do a FT on 800k examples sampled from R1 and DeepSeekv3 for non reasoning tasks.
I expect they may end up doing this as in the paper they said they did not do RL on reasoning for engineering/coding tasks - thus R1 doesnt have a huge improvement over V3 for coding. Once they do the RL for coding i suspect they may release something like this.
Most of my tests of the 70b model resulted in a chain of vomited text. It’s easy to say that it is the wrong model to prompt for “Please write an overview about the German tense Plusquamperfekt.” There is a lot to think about, and yes, it is far away from anything correct. There is no wrong question or wrong model for a certain question.
Well it is a reasoning model after all, and a product mainly for research. And if you really want it work like an assistant, one you can always fetch the part and feed it into another model as prompt, it will greatly increase the accuracy of that the other model. Lol
@ no, what i was trying to say is that the value of this model is in *how it was trained*, the model we have honestly is just to prove that the its feasible and has potential. the most valuable thing is the papers, not the model. as if right now, the model is just one of the usable models.
Play devils advocate for a second… if your enemy bakes you a cake, would you eat it? What’s the best way to get more information from us? Create the ultimate search engine replacement. Be smart and careful out there.
More like, if your enemy publishes the recipe for a cake, would you copy the recipe and do whatever you want with it? Remove or add ingredients? Call it a different name? The code for Deepseek is open source so anyone can download and edit it, remove any restrictions, run it offline, create your own branch of it... whatever. It's only dangerous if you want it to be. The recipe is in your hands, not the cake.
Yes, the fact it is chinese, already raises a red flag for me. Nevertheless, the model is groundbreaking. And if any corruption is found in its database, that would be very bad for china. So my guess is that the chance of it being an advanced form of spyware is about 10%. And it's already evident the model is based on closed source models, like GPT-4.
@@sanseverything900 The cake is you using it. Now leading models can’t fund their projects. Sorta leveling the playing field. Developers make it better and hand it right over to China. It’s tempting tho… just like TikTok. I’m just observing and standing from a far… I wouldn’t touch it… it doesn’t smell right. I’m not hear to argue just to tell you to all be careful.
Question: Is OpenAI (and other AI companies in the U.S. and Europe) using supervised fine-tuning (SFT) because concerns about "AI safety" are such a big topic in the public discourse in the U.S. and Europe? DeepSeek bypassed SFTs as a preliminary step and went straight to reinforced learning, and that's how, if I understand your description of DeepSeek's technical paper, they were able to catch up (or exceed) other AI platforms from the U.S. and Europe.
Yes they did that for the Zero model, for the main model they had a ting amount of SFT (thousands) to get it on track but still relied more on the RL etc. Certainly before this people were using RL more for alignment etc.
Thank you. I’ve read on LinkedIn that the terms & conditions of Deepseek are that they have copyrights on the applications that are developed using their models. Is it true? Then it’s not really MIT license, is it?
I wouldn't say they mean "nothing" a model that performs middling or bad on benchmarks are usually not good. Actually most of the time not good. However, I agree when we are using SOTA models, it becomes less useful. We need some empirical metrics, like benchmarks, but we also have to know that doesn't tell the whole story.
The benchmarks that are really interesting here are the DeepSeek-R1 compared to the DeepSeekv3 as they are the exact same base model but mean the different is showing the strength of their new post training compared to a more standard post training regime.
But still training the base model is nit revolutionary here, correct? Fine tuning is not that expensive. An interesring approach, but the stock dump seems like an overreaction.
The lesson from the refinement here is that, as suspected, o1 was no breakthrough at all, and all we need to fool people that the model is "reasoning" is to make it mimic better training data (and this alternative RL method is just a way to synthetically generate such data). Mostly this model reveals the marketing stupidity and mean character of OpenAI.
Conspiracy theory time! Put on your foil hats! I don't actually know anything, but I gave DS3 and Clause 3.5 a prompt asking for a paragraph of corporate jargon that uses cliché catchy business phrases, without actually saying anything useful. There were slight variations in the words, but the paragraph structure and phrases were beat-for-beat the same. Same phrases, same order. Wouldn't it be hilarious if DS3 was a slightly modified wrapper around Claude? A single data point is all you need for a conspiracy theory, right?
Insert Anchorman "Touché" gif here. For the sake of beating a dead horse, and enjoying the comfort of my foil hat: they're government funded? I totally think it was a fluke, I just couldn't believe how similar they were, right down to the order of phrase usage. Even with separate sessions using the same model you usually get more variation, even if you copy/paste your prompt, so having it be nearly identical across models had me shook for a moment.
Just to stoke this conspiracy a little more, the last few days Claude tells me it's "experiencing high demand" and defaulting to concise responses. The plot thickens! Our dollar bills having tracking devices! Birds aren't real!
Ask about Tibet! Taiwan! Hong hong! Uyghurs! And tell me what communist apps telling you? Is same information from chat GPT? Or Google? You must learn Chinese communist apps and they are qualities from above examples!! Don’t ignored differences between communist and democracy!! DONT USE ANY OF APPS MADE IN COMMUNIST CHINA!
The most useful video about ds r1 in youtube. I enjoy the concise and approachable technical details in your videos. Please never stop posting.
This open model is so good, hard to believe that this is MIT license.
Well, with TikTok getting regulated, there needs to be a new hole.
When you read the paper Deepseek says themselves there is a lot more meat left on the bone. Expect a follow up model pretty quick.
what? I'm using opeai in production, and tried this model with the same config using langChain. It worse than open ai 3.5. They cannot even return simple json - just plain
text
@@yourpc Yup. I only tested the 32B q8. Definitely inferior to o1, o1 is a different league. Big problems with languages other than English (can mix different languages in one sentence). Not suitable for programming, but maybe this is a Roo Code@VSC problem.
I have tried to draw out an analysis of approaches to the problem of the Riemann Hypothesis. Every 5-7 paragraphs he repeated what he had done before, slightly altered. I set the context at 64k. No higher maths, the only formulas he gave were those on the Wikipedia pages. Then I asked o1 to check what r1 had given, everything was true, but as o1 put it: the text is more like brainstorming - he writes out very different ideas, which have one thing in common: each of them in its own way tries to limit (or explain) the position of the zeros of ζ.
@@hipotures I tried 8b. So I have structuredOutput config via LangChain, but in result I got only text chat. Also I tried with super simple example from lagncain to translate from English to French, and it translated, but also added around 5 sentence more text then just translate "I love programming"
Nice deep dive. These models are great, and are actually doing something I wasn't sure was possible. Now that I see it, I'm not sure why I thought this would be difficult. 🤷
You make a really good point, when you actually see what they're doing, it's not as complicated as a lot of people would think.
This is the greatest gift for for the upcoming Chinese New Year holiday.
That’s why there is a discount for API. I am going to use it during the holiday.
you mean lunar new year
@@JH-bb8in I specifically refer to the Chinese starting date, not every lunar calendar is the same, the Indian's starts at March 22 for example.
@@JH-bb8inChinese new year
@@junyang1688 nah
I dug a little into the DeepSeek R1 paper, particularly GRPO (Group Relative Policy Optimization) and GAE.
It is fascinating to me that GAE (Generalized Advantage Estimation) is related to adaptive control systems.
I would not be surprised if we discover that the process of learning to achieve "intelligence", "reasoning" and maybe even "consciousness" is an adaptive control system process in itself. Was probably a biological mutation that became an advantage, and now we humans may have discovered how to build this advantage into silicon space.
Sam. Great job explaining they engineering that went into DeepSeekR1.
Hey Jerry thanks. Hope you are well and that we catch up again this year at I/O
The 1.5B model was randomly spitting out emojis and mandarin when I was playing with it yesterday, but the "reasoning" is still really impressive, I'm excited to play with the larger ones
interesting what did you use for the System prompt? I am wodering how much it responds to that etc and if that helps it stay in one language
Always concise explanation and right to the point. Thank you Sam :D Great video!
Thanks much appreaciated
Great video! So thorough and covers all aspects in a tolerably short amount of time without any exaggeration or anything.
Curious about multilingual capability here, will definitely play around soon! Also, for testing reasoning i would suggest a large complex task and treat it like a one shot solver, not a chat model. At least that seems to be the trick and strength of openai O models right now. Best!
So before LLMs, I predominantly worked on GANs, the prompt used in here, seems like a Generator and discriminator that we use in GaN, to get the best output.
I always like your assessments. No hype
Thanks this is exactly what I am going for
Strange about the 'not for tool use', this is the first Model that uses my MCP tools without asking it to. (It does have to know about them being there ofc in the system prompt. You can get greater results with the smaller models wy giving it tools to help it reflect and plan, like mcp sequential thinking, and an mcp RAG it can fully manage itself with adding it's own vectors to it based on your data.
Reasoning combined with test time training would be killer for local OSS models. We need models with these techniques combined together somehow. I believe at that point we'd be beyond AGI, but we'd probably be at ASI at that point.
Sam, great video ❤
--
My local tests with r1 32B q8:
How many strawberries are contained in the letter "r"?
Alright, I need to figure out how many strawberries are contained in the letter "r." Hmm, at first glance, this seems a bit strange because "r" is a letter, and strawberries are fruits. So, how can a letter contain a fruit? Maybe the question is metaphorical or plays on words. Let's break it down. First, let's think about the letter "r" itself. It's a consonant in the English alphabet, often pronounced as /ɜː/ or /ə/. Now, strawberries are red, juicy fruits that are commonly associated with the letter "s" for "strawberry." But the question specifically mentions the letter "r." Could it be referring to something like the shape of the letter "r" resembling a strawberry? Or perhaps the number of times the letter "r" appears in the word "strawberry"? Let's examine the word "strawberry." The letters are s-t-r-a-w-b-e-r-r-y. I notice that the letter "r" appears twice in "strawberry." So, if we're counting the occurrences of the letter "r" in the word "strawberry," there are two "r"s. But the original question was about how many strawberries are contained in the letter "r." If we interpret this as how many times the letter "r" appears in the word "strawberry," the answer would be two. Alternatively, if we're being creative, maybe the letter "r" can be visualized to resemble a strawberry in some way, but that seems less straightforward. In summary, taking the question at face value and interpreting it as counting the letter "r" in the word "strawberry," the answer is two.
There are two strawberries contained in the letter "r" when considering the word "strawberry," as the letter "r" appears twice in it.
o1: This question is really a kind of nonsense riddle or wordplay. Literally speaking, a single letter-like “r”-doesn’t (and can’t) contain strawberries. So the straightforward answer is “None.” Sometimes such riddles are asked just to highlight how language can be twisted in playful (but ultimately meaningless) ways. There isn’t a hidden puzzle or code here-just a joke that the letter “r” can’t “hold” any strawberries at all!
nice example. I wonder how does the full MoE version answer this?
@@samwitteveenai I can't send more, YT block text, but this model is Crazy Overthinker!
(3 pages of text , 67 seconds)
Okay, let's see. The user is asking, "How many strawberries are contained in the letter 'r'?" Hmm, that's an interesting question. At first glance, it doesn't make much literal sense because letters and strawberries are entirely different things. But maybe there's a creative or metaphorical angle here. Let me try to break this down.
First, I need to understand the context. Is this a riddle, a play on words, or perhaps a visual puzzle? Sometimes questions like these rely on homophones, visual similarities, or puns. Let's consider different possibilities.
Starting with the literal interpretation: the letter "r" as a character. If I imagine the lowercase "r" in some fonts, it might have a shape that could resemble a strawberry. But that's a stretch. The uppercase "R" doesn't look like a strawberry. However, maybe in a specific typeface, the curves or lines of the "r" could be stylized to include images of strawberries. But I don't recall any standard fonts that do that. So perhaps the answer is zero because letters don't contain actual strawberries.
(...)
--
The answer to the riddle is 3.
Here's the breakdown: The question cleverly plays on the letters within the word "strawberry." When you spell out "strawberry" (S-T-R-A-W-B-E-R-R-Y), there are three instances of the letter "r." The riddle asks how many strawberries are contained in the letter "r," but the trick is to recognize that each "r" in the word "strawberry" corresponds to a strawberry. Thus, the answer is 3. 🍓🍓🍓
Do you know if they released the distillation procedure?
So that we can, for instance, distill it onto qwen2.5-coder
AFAIK they haven't released the data but I talked about the distillation in the video. the basically just do a FT on 800k examples sampled from R1 and DeepSeekv3 for non reasoning tasks.
@samwitteveenai oh yeah I could reproduce that in a hot minute! I'll get on it
I expect they may end up doing this as in the paper they said they did not do RL on reasoning for engineering/coding tasks - thus R1 doesnt have a huge improvement over V3 for coding. Once they do the RL for coding i suspect they may release something like this.
let us know if you post it
Most of my tests of the 70b model resulted in a chain of vomited text. It’s easy to say that it is the wrong model to prompt for “Please write an overview about the German tense Plusquamperfekt.” There is a lot to think about, and yes, it is far away from anything correct. There is no wrong question or wrong model for a certain question.
Well it is a reasoning model after all, and a product mainly for research. And if you really want it work like an assistant, one you can always fetch the part and feed it into another model as prompt, it will greatly increase the accuracy of that the other model. Lol
@@Telopead so only prompts beginning with „why“ are valid requests to the reasoning model?
@ no, what i was trying to say is that the value of this model is in *how it was trained*, the model we have honestly is just to prove that the its feasible and has potential. the most valuable thing is the papers, not the model.
as if right now, the model is just one of the usable models.
I think this is the best AI channel on RUclips right now! Thanks
Thank you and greets from Germany! love your videos
Play devils advocate for a second… if your enemy bakes you a cake, would you eat it? What’s the best way to get more information from us? Create the ultimate search engine replacement. Be smart and careful out there.
More like, if your enemy publishes the recipe for a cake, would you copy the recipe and do whatever you want with it? Remove or add ingredients? Call it a different name?
The code for Deepseek is open source so anyone can download and edit it, remove any restrictions, run it offline, create your own branch of it... whatever.
It's only dangerous if you want it to be. The recipe is in your hands, not the cake.
Yes, the fact it is chinese, already raises a red flag for me. Nevertheless, the model is groundbreaking. And if any corruption is found in its database, that would be very bad for china. So my guess is that the chance of it being an advanced form of spyware is about 10%.
And it's already evident the model is based on closed source models, like GPT-4.
@@sanseverything900 The cake is you using it. Now leading models can’t fund their projects. Sorta leveling the playing field. Developers make it better and hand it right over to China. It’s tempting tho… just like TikTok. I’m just observing and standing from a far… I wouldn’t touch it… it doesn’t smell right. I’m not hear to argue just to tell you to all be careful.
@@paulmichaelfreedman8334 I hear that if you prompt it correctly, it will tell you it is ChatGpt from OpenAi.
Can I run distilled models with mps
Question: Is OpenAI (and other AI companies in the U.S. and Europe) using supervised fine-tuning (SFT) because concerns about "AI safety" are such a big topic in the public discourse in the U.S. and Europe?
DeepSeek bypassed SFTs as a preliminary step and went straight to reinforced learning, and that's how, if I understand your description of DeepSeek's technical paper, they were able to catch up (or exceed) other AI platforms from the U.S. and Europe.
Yes they did that for the Zero model, for the main model they had a ting amount of SFT (thousands) to get it on track but still relied more on the RL etc. Certainly before this people were using RL more for alignment etc.
Thank you.
I’ve read on LinkedIn that the terms & conditions of Deepseek are that they have copyrights on the applications that are developed using their models. Is it true? Then it’s not really MIT license, is it?
Consipracy theory crap, other labs are panicking and spreading bs all over the net
R for Remarkable
dude, we already passed the point that bench marks mean nothing!
I wouldn't say they mean "nothing" a model that performs middling or bad on benchmarks are usually not good. Actually most of the time not good.
However, I agree when we are using SOTA models, it becomes less useful.
We need some empirical metrics, like benchmarks, but we also have to know that doesn't tell the whole story.
The benchmarks that are really interesting here are the DeepSeek-R1 compared to the DeepSeekv3 as they are the exact same base model but mean the different is showing the strength of their new post training compared to a more standard post training regime.
But still training the base model is nit revolutionary here, correct? Fine tuning is not that expensive.
An interesring approach, but the stock dump seems like an overreaction.
yes the base model seems to just be the DeepSeekv3 base then they add the different post training
Can you import it into cursor?
I wonder if openai's methods have been gleaned by the deepseek crew
Everyone already knows OpenAi's method. That's the reason why everyone and their mother is doing LLM these days.
The lesson from the refinement here is that, as suspected, o1 was no breakthrough at all, and all we need to fool people that the model is "reasoning" is to make it mimic better training data (and this alternative RL method is just a way to synthetically generate such data). Mostly this model reveals the marketing stupidity and mean character of OpenAI.
I get it going in circles mixing logically contradictory statements in the same response. In other words, move on, not much to see here really.
The full one or a distilled one ?
@@samwitteveenai I was trying with distilled 32G.
if the context length is 2million+ then it would desoy the competition
And it'll cost a small fortune to run (at that scale)...
Conspiracy theory time! Put on your foil hats!
I don't actually know anything, but I gave DS3 and Clause 3.5 a prompt asking for a paragraph of corporate jargon that uses cliché catchy business phrases, without actually saying anything useful. There were slight variations in the words, but the paragraph structure and phrases were beat-for-beat the same. Same phrases, same order. Wouldn't it be hilarious if DS3 was a slightly modified wrapper around Claude?
A single data point is all you need for a conspiracy theory, right?
Ok but if it was and they sold it this cheap they’d be losing a ton of money.
Insert Anchorman "Touché" gif here.
For the sake of beating a dead horse, and enjoying the comfort of my foil hat: they're government funded?
I totally think it was a fluke, I just couldn't believe how similar they were, right down to the order of phrase usage. Even with separate sessions using the same model you usually get more variation, even if you copy/paste your prompt, so having it be nearly identical across models had me shook for a moment.
Just to stoke this conspiracy a little more, the last few days Claude tells me it's "experiencing high demand" and defaulting to concise responses. The plot thickens! Our dollar bills having tracking devices! Birds aren't real!
I notice the same sort of similarities in competing video models produced by supposedly "independent" (American vs. Chinese) companies...
Ask about Tibet! Taiwan! Hong hong! Uyghurs! And tell me what communist apps telling you? Is same information from chat GPT? Or Google?
You must learn Chinese communist apps and they are qualities from above examples!!
Don’t ignored differences between communist and democracy!!
DONT USE ANY OF APPS MADE IN COMMUNIST CHINA!