Why "Grokking" AI Would Be A Key To AGI

Поделиться
HTML-код
  • Опубликовано: 10 сен 2024
  • Check out HubSpot's Free ChatGPT resource to power up your work efficiency🔥: clickhubspot.c...
    Check out my newsletter:
    mail.bycloud.ai
    Are We Done With MMLU?
    [Paper] arxiv.org/abs/...
    Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
    [Paper] arxiv.org/abs/...
    Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
    [Paper] arxiv.org/abs/...
    Grokfast: Accelerated Grokking by Amplifying Slow Gradients
    [Paper] arxiv.org/abs/...
    [Code] github.com/iro...
    This video is supported by the kind Patrons & RUclips Members:
    🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Robert Zawiasa, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi, Hector, Drexon, Claxvii 177th, Inferencer, Michael Brenner, Akkusativ, Oleg Wock, FantomBloth
    [Discord] / discord
    [Twitter] / bycloudai
    [Patreon] / bycloud
    [Music] massobeats - noon
    [Profile & Banner Art] / pygm7
    [Video Editor] Silas

Комментарии • 295

  • @bycloudAI
    @bycloudAI  Месяц назад +14

    Check out HubSpot's Free ChatGPT resource to power up your work efficiency🔥clickhubspot.com/hyx

    • @Scoring57
      @Scoring57 Месяц назад

      In a way isn't this a 'good' thing? That these ai's are so 'bad' but still so useful? It feels to me that that still makes *Transformers / GPT's* the *best* , most safe candidates for "ai". All they need to do is further engineer it and make it less of a black box to gain more control, create alignment, and from there you have a powerful yet *LIMITED* ai that can't actually think so you don't have to worry about it plotting or doing something you didn't intend. You can even arbitrarily limit it's intelligence to whatever you like while using as a facts repository and make use of it's pseudo/simulated reasoning
      This also means this world breaking, world altering "singularity" *isn't* inevitable and we might be able to evolve along side ai in a more controlled way as we learn and mature with the technology and hopefully actually get to master it. Which is favorable because right now we're barely figuring it out as we go and it's an extremely powerful tool that can do a lot of harm.
      Also how did this comment post 2 days ago and this video went up just a day ago??

  • @FireFox64000000
    @FireFox64000000 Месяц назад +468

    So essentially Grokking is just making the AI bash its head into a concept over and over and over again until it finally understands it. Guys, I think I've been grokking for years.

    • @ctwolf
      @ctwolf Месяц назад +25

      Same friend. Same.

    • @tiagotiagot
      @tiagotiagot Месяц назад +12

      Grokking is the phase-transition, the point where the other shoe finally drops

    • @Rockyzach88
      @Rockyzach88 Месяц назад +8

      Percussive learning/maintenance - like hitting your tv to make the picture stable

    • @shukrantpatil
      @shukrantpatil Месяц назад +3

      we are adding the human "essence" to it now

    • @user-cg7gd5pw5b
      @user-cg7gd5pw5b Месяц назад +6

      Still waiting for the 'understanding' part...

  • @dhillaz
    @dhillaz Месяц назад +174

    New pitch to investors: "Just 10x more GPUs bro, 10x GPUs to push past this stagnant validation and we will reach grokking, I promise"

    • @Napert
      @Napert Месяц назад +46

      99% of LLM training stops right before achieving grokking

    • @thatonedude6596
      @thatonedude6596 22 дня назад

      @@Napert full circle

  • @manuelburghartz5263
    @manuelburghartz5263 Месяц назад +128

    Putting out 3 videos in 4 days that are so complex and labor intensive is crazy. Love the vids man, keep up the grind

    • @bycloudAI
      @bycloudAI  Месяц назад +68

      watch me release the next video next month LOL

    • @Ketamineprod
      @Ketamineprod Месяц назад

      ​@@bycloudAIu r awesome bro take ur time and drop whenever u feel like it!

    • @markdatton1348
      @markdatton1348 Месяц назад +18

      @@bycloudAIjust enough time for us to GROK these vids

    • @HingalshDealer
      @HingalshDealer Месяц назад +10

      ​@@markdatton1348bro grokked the grok term

    • @dcaban85
      @dcaban85 28 дней назад

      probably ai generated

  • @blisphul8084
    @blisphul8084 Месяц назад +144

    So basically, it's like how if we read something 1 or 2 times, we might remember the answer, but if we read it or run it through our head 20x, we are more likely to fully understand the logic behind what we're reading. Given that higher parameter LLMs are more expensive to do this with, I wonder if small models will actually be the most capable in the long run.

    • @anthony4223
      @anthony4223 Месяц назад +8

      I kinda would be maybe smaller and maybe slightly more specialized models for things like medical or some such would be more popular in the long-ish run

    • @bloopbleepnothinghere
      @bloopbleepnothinghere Месяц назад +6

      There are papers that show the as you scale up volume of general information you feed an LLM the lower the quality responses it puts out. Specialized, smaller models talking together is a viable path. But if you expect an LLM to reason and "comprehend" rather than infer from memorized data, I don't think we will see that from LLMs as we understand them today.

    • @mirek190
      @mirek190 Месяц назад +2

      ​@@bloopbleepnothinghere have you seen the gemma 2 2b .... that model is so small and still multilingual and has quite strong reasoning and know math ... crazy

    • @blisphul8084
      @blisphul8084 Месяц назад +3

      @@mirek190 the fact that it's fluent in Japanese is insane. And it runs reasonably fast on even a 5 year old laptop CPU.

    • @a_soulspark
      @a_soulspark Месяц назад

      I'd point out that in the Alice in Wonderland paper he showed, that table (10:19) shows Llama 3 70B got 3rd best performance while Llama 3 8B got 0% right.
      I'd argue that grokking requires more parameters to store the more nuanced information... but the idea of

  • @cdkw2
    @cdkw2 Месяц назад +60

    I love how I have to see your videos 3 times to understand them, kinda like when I was first starting out with calculus!

    • @blisphul8084
      @blisphul8084 Месяц назад +23

      This is exactly what grokking is. See something enough times and you'll understand it on a more fundamental level.

    • @cdkw2
      @cdkw2 Месяц назад +16

      @@blisphul8084 The fact that I didn't even realize that I did this...

    • @BrainSlugs83
      @BrainSlugs83 Месяц назад +8

      Incredibly ironic given the topic of overfitting training. 😅

    • @SumitRana-life314
      @SumitRana-life314 Месяц назад +5

      Bro is Raw Grokking ByCloud videos.

    • @jonathanberry1111
      @jonathanberry1111 26 дней назад

      @@cdkw2 You hadn't yet watched it 3 times! I, being smarter (read, older, wasting more time on YT and overthinking) got it right away. Look, this is how INTJ's and INTP's are more intelligent (IQ), more times thinking thinking thinking!

  • @JazevoAudiosurf
    @JazevoAudiosurf Месяц назад +37

    this fourier transform filter thing is just nuts. when i see stuff like this or alphaproof or PRMs i can't imagine we wouldn't reach huge intelligence leaps beyond AGI in the next 5 years. i mean all it takes is a simple mathematical trick, this is the level of infancy that AI is currently in. i mean you look at other fields of science like fucking material science and there even in the 60s just to figure out materials for LEDs, they would go through an order of magnitude more struggle than for the simple AI breakthroughs of this years papers. or look at physics, space, semiconductors. and AI on a software level is so much easier to experiment with than those things

    • @seamon9732
      @seamon9732 Месяц назад +2

      That's assuming 2 things:
      1- That we have enough parameters to simulate the equivalent # of synapses/connections in a brain ( 100 to 1000 trillions ).
      2- That the recent research into microtubules doesn't mean that they are also involved in processing/reasoning. If it is the case and there are hundreds to a thousand microtubules per axon ( transmitting part of synapses ) and a bit less in dendrites ( receiving part of synapses ), then you have to multiply the above trillions some more.

    • @mirek190
      @mirek190 Месяц назад +6

      @@seamon9732 Our brain has around 100 trillions connections ..true ..BUT for thinking we are using only 20% of them ...the rest is used for keeping our body alive.

    • @user-fr2jc8xb9g
      @user-fr2jc8xb9g Месяц назад +1

      Yeah but sometimes , simple tricks could take years or decades to discover....

    • @BlueKanary
      @BlueKanary 25 дней назад

      @@mirek190 Is this 10% myth really still floating around? Even if "Only 20%" is for cognitive work, keeping the body alive is no joke. Just keeping a stable heartbeat would take a good portion of dedicated brainpower

    • @cajampa
      @cajampa 17 дней назад

      @@BlueKanary Why, don't brain dead people have a heart beat?
      It is a pretty low level function with inbuilt automatic electric puls generator cells right there in the heart muscle. Driven by lower functions in the brain stem.
      And even if those function in the brain stem is gone as long as a respirator is supplied the heart can function on its own.

  • @-mwolf
    @-mwolf Месяц назад +15

    Grokking is an instance of the minimum description length principle.
    If you have a problem, you can just memorize a point-wise input to output mapping.
    This has zero generalization.
    But from there, you can keep pruning your mapping, making it simpler, a.k.a. more comrpessed.
    The program that generalizes the best (while performing well on a training set), is the shortest.
    → Generalization is memorization + regularization.
    But this is of course still limited to in distribution regularizatiom.

    • @DanielSeacrest
      @DanielSeacrest 23 дня назад

      People think memorisation or overfitting is a bad thing and we need to figure out how to prevent it in its entirety, but really it's a stepping stone onto the path of grokking and perfect generalisation.

  • @PYETech
    @PYETech Месяц назад +11

    One of the best channels out there that has real value to everyone. You NEED a raise!

  • @rubncarmona
    @rubncarmona Месяц назад +8

    Grokking is akin to the evolution of language even after a population has been entirely alphabetized. Every once in a while someone figures out a new connection between seemingly unrelated concepts, uses one of them in a new context by mistake or because it forgot the intended word etc. This continuous increase in information entropy even after exhausting the parameter space reminds me a lot of what some scientists say about information in degenerate era's black holes.

  • @ytubeanon
    @ytubeanon Месяц назад +12

    Grokking our way to AGI...
    11:33 the Grokfast paper has boggling potential, so who's using it? when will we see its results?

    • @KeinNiemand
      @KeinNiemand 14 дней назад

      Probably nobody is using it

  • @74Gee
    @74Gee Месяц назад +5

    I find that alice in wonderland type responses can be significantly improved when system prompting to form data structures from the known data and then inferring from that structure - something like this (a minimal version)
    ```
    You are tasked with solving complex relationship questions by first mapping all known facts into a JSON structure and then using this structure to infer answers. When given a question, follow these steps:
    1. Extract all given facts.
    2. Create a JSON structure to represent these facts.
    3. Use the JSON structure to navigate and infer answers.
    4. Provide clear and logically consistent responses based on the JSON file.
    ```
    I used this technique very successfully when working with gossip analysis and determining the source of gossip but quickly realized its benefits in other logical fields.

  • @dzbuzzfeed908
    @dzbuzzfeed908 Месяц назад +4

    1. **Current State of LM Benchmarks**
    *Timestamp: **0:00:00*
    2. **Benchmark Performance Issues**
    *Timestamp: **0:00:03*
    3. **Implications of Reordering Questions**
    *Timestamp: **0:00:20*
    4. **Alice in Wonderland Paper Findings**
    *Timestamp: **0:01:57*
    5. **The Concept of Grocking**
    *Timestamp: **0:04:27*
    6. **Grocking vs. Double Descent**
    *Timestamp: **0:06:06*
    7. **Potential Solutions for Improved Reasoning**
    *Timestamp: **0:09:02*
    8. **Grocking in Transformers and New Research**
    *Timestamp: **0:08:12*
    9. **Grock Fast Implementation**
    *Timestamp: **0:11:28*
    ### Ads
    1. **Hub SWAT AI Resources**
    *Timestamp: **0:01:13*
    ### Funny Jokes
    1. **"Absolute dog water"**
    *Timestamp: **0:00:03*
    2. **"Kind of crazy from a more cynical and critical perspective"**
    *Timestamp: **0:00:33*
    3. **"Can you imagine an AI being able to do this only humans would be able to come up with something this random and absurdly funny"**
    *Timestamp: **0:03:13*
    4. **"If an AI can truly do this it actually might be so over over for us so for the sake of burning down rainforests"**
    *Timestamp: **0:03:23*
    5. **"Elon's Grock LM is probably named after the book and not related to the ML concept that we are talking about today"**
    *Timestamp: **0:05:43*
    6. **"Mr. Zuck said that when llama 370 never stopped learning even after they trained it three or four times past the chinella optimum is not copium"**
    *Timestamp: **0:10:03*

  • @kazzear_
    @kazzear_ Месяц назад +12

    No shit! i've literally made code for this idea!!! I can't believe someone was working on this like i was, i didnt even know what was grokking.

    • @beatsandstuff
      @beatsandstuff Месяц назад +23

      Whenever you are making something, remember, there's always an asian kid doing it way better than you.

    • @kazzear_
      @kazzear_ Месяц назад +3

      @@beatsandstuff sad truth

    • @AB-wf8ek
      @AB-wf8ek Месяц назад +2

      Synchronicity as a result of morphic resonance

    • @pneumonoultramicroscopicsi4065
      @pneumonoultramicroscopicsi4065 Месяц назад +2

      ​@@beatsandstuff i don't think it's an asian "kid" but okay

  • @Interpause
    @Interpause Месяц назад +5

    wait if im understanding grokfast correctly, they are attempting to predict the rate of change of weights at any given moment using fourier transform?
    thats insane that has way more use cases for other neural network architectures outside of just transformers

    • @ChanhDucTuong
      @ChanhDucTuong 29 дней назад

      I don’t understand most of what you and this video said but may I ask 1 question: Will grokking work with Stable Diffusion training? Like normally I only need 3000-5000 steps to train the model to draw my face perfectly, what if I train it to 200000 steps? Before this video I’d thought that nothing will happen but now I’m not sure.

  • @Alice_Fumo
    @Alice_Fumo Месяц назад +5

    My mind is blown at the filtering of high-frequency parameter changes, leaving the low-frequency ones and using them to achieve grokking a lot faster. What an amazing idea. Though naively I would think that would require plotting the values of every single parameter over time which would be way too memory-intensive to be feasible for large models.
    Hmm.. I guess they can keep track of the average amount of time/steps between parameter update direction changes for every parameter, which should give us the frequency?
    It's also possible I'm fundamentally misunderstanding anything, in which case someone please explain where my thinking is failing.

    • @tiagotiagot
      @tiagotiagot Месяц назад +2

      I guess it would probably work to do something like NewWeights = (OldWeights * (MixBias)) + ((OldWeights + RawChange) * (1.0 - MixBias)) , with MixBias at some value close to but bellow 1.0 .
      And maybe perhaps a sorta momentum mechanism with some level of drag could be added on top of that to take care of the low frequency motions being lower amplitude while at the same time avoiding overshooting the target too much; maybe even have a little machine learning model that learns on the job to adjust the drag intensity based on how big the improvement (or worsening) of the bigmodel's scores has been after each iteration (who knows, maybe even something simpler like an auto-tunning PID algorithm might already suffice).

  • @ThomasTomiczek
    @ThomasTomiczek Месяц назад +7

    I do not think that COF and Grokking are not both usable at the same time ;) I.e. you can GROKK a model, and still use explicit verbalisations.

  • @briangman3
    @briangman3 Месяц назад +4

    They need benchmark Testing that has variations in questions inputs and randomization of choices

    • @Rockyzach88
      @Rockyzach88 Месяц назад +2

      I'm sure the people who are actually passionate about building these things are doing all the things.

    • @njokedestay7704
      @njokedestay7704 Месяц назад +1

      I think that should be the GSM1K benchmark from Scale AI

    • @briangman3
      @briangman3 Месяц назад

      @@njokedestay7704 I will look into it

  • @iantimmis651
    @iantimmis651 Месяц назад +10

    Important to remember that chinchilla compute optimal is not inference optimal

  • @13579Josiah
    @13579Josiah Месяц назад

    One of the best videos on AI I’ve seen in a while! No hype. All facts. Well explained while not shying away from complex topics. Beautiful explanation of fast Grok. You just earned yourself a sub!

  • @j.j.maverick9252
    @j.j.maverick9252 Месяц назад +2

    interesting graph for the llm learning curve up, then down, then up again. Looks eerily similar to Dunning Kruger

  • @user-kk5wo9zw1e
    @user-kk5wo9zw1e 24 дня назад

    lol my model also experiences this and its much clear that loss goes down while less accuracy and loss goes larger while more accuracy happens :) thanks!

  • @RedOneM
    @RedOneM Месяц назад +3

    I think 1.8T parameters grokked with „understanding“ of all logical states will become AGI.
    This kind of power will turbo accelerate AI tech, since it can begin research itself.

  • @Koroistro
    @Koroistro Месяц назад +3

    Imo there is the need in research how to decouple the model from the information. At least to some degree.
    A big problem with current LLMs is that they are *too good* at learning.
    Yes, they learn too well, they don't need to think they just learn the thing.
    Reasoning is a way to shift the burden from memory to computation. If you have tons of storage space you're going to use it, if you have very little storage space you're going to be forced to compress as much as possible.
    If you think about it, less parameters are easier to overfit than many parameters.

  • @shApYT
    @shApYT Месяц назад +3

    Has any model proven real generalisation on any out of domain task? Even one task?

  • @raspberryjam
    @raspberryjam Месяц назад +2

    I'm thinking: If you imagine a model divided in two halves where one is the generalization part and the other is the overfitting part, it's still most beneficial to have the generalization half get as close to the right answer as possible so as to lighten the load on the overfitting half. Or in another way, you should devote as many parameters as you can to memorizing the corrections to the wrongest answers, and you can do that by minimizing the number of parameters needed to get to what is a generally a fairly close answer

  • @alexanderbrown-dg3sy
    @alexanderbrown-dg3sy Месяц назад +24

    I’ve been saying this for a year. Other researchers keep foolishly positioning grokking as a weird training artifact without practical value. When there is literally research to the contrary, yet they still see no value lol. Almost like common sense to me. Imagine going through school with no context, no homework, no tutoring and producing current SOTA LM benchmarks. The fact LM can with severely oblique data makes the answer clear. Hybrid data. Increasing data density with inferred facts. Remember reasoning is basically syntactic transformation. Reformulating samples using formal semantics for native symbolic reasoning is the answer. Clear as day. Also fixing PE to solve the reversal curse. All you need.
    As someone who trained smaller model at 12k tokens per parameter without any real saturation. Models first off should be way smaller. Then focus on hybrid data. AGI will be compact in my personal opinion. For instance I believe a 10B model can exceed gpt4 using the formula I described above. Since imo it should be trained on 100T tokens lol. Models are vastly overparameterized and it’s so idiotic to me. Brilliant engineers but their first principles are wrong.
    Grokfast is super important but you have to modify the code to work with larger models. FYI deeper layer wanna grokk more than toy models seen in research.

    • @TheSonOfDumb
      @TheSonOfDumb Месяц назад

      My apologies, but your comment and profile picture are highly incongruous.

    • @alexanderbrown-dg3sy
      @alexanderbrown-dg3sy Месяц назад +12

      @@TheSonOfDumb lol bro come on it’s 2024. Gifted minds exist within all communities. It is because I’m pretty or rather because I’m black? Stay blessed though. You hurt my feelings, I won’t lie lol.

    • @mirek190
      @mirek190 Месяц назад +2

      have you seen the gemma 2 2b .... that model is so small and still multilingual and has quite strong reasoning and know math ... crazy

    • @alexanderbrown-dg3sy
      @alexanderbrown-dg3sy Месяц назад +5

      @@mirek190 yes it is impressive bro. I still feel we haven’t hit a ceiling with sub-10B models.

    • @strangelaw6384
      @strangelaw6384 Месяц назад +2

      @@TheSonOfDumb you don't have to write bait replies to your own comments to attract attention. If you're confident in what you wrote (which you should be).
      By the way, the fact that you brought up "homework" and "tutoring" makes me wonder if the training set can be designed to model actual academic learning materials with student-centered teaching strategies.

  • @Neomadra
    @Neomadra 26 дней назад

    Another issue for grokking is that reasoning is not a single method that can be learned and applied to everything. It is many different many methods and I guess when grokking on one skill and then on the other will lead to forgetting of the previously grokked skill. I think one would need some freeze mechanism that locks up some weights after grokking has achieved.

  • @Omar-bi9zn
    @Omar-bi9zn Месяц назад +4

    great ! thanks for shedding more light on grokfast !

  • @seanwu3006
    @seanwu3006 20 дней назад +2

    They say the Buddha sat under a tree for 49 days and "grokked".

  • @zyansheep
    @zyansheep Месяц назад +1

    I looked through your videos and saw I had watched literally every one but didn't subscribe lol. I'm subscribed now!

  • @GodbornNoven
    @GodbornNoven Месяц назад +2

    You don't know how right you are 😂
    Grokking will be a super important step to AGI. Essentially, you're training a model on data so much it practically becomes an expert at it. At some point, we will achieve the quantity of compute necessary to achieve this. At that point, might as well take the road of brute force.
    Naturally, algorithmic breakthroughs are incredibly important and also very essential to the improvement of LLMs. As they allow us to do more with less

  • @anirudh514
    @anirudh514 Месяц назад +7

    I am your regular follower, your videos are amazing!

  • @spaceadv6060
    @spaceadv6060 Месяц назад +2

    One of my favorite videos so far! Thanks again.

  • @mikairu2944
    @mikairu2944 Месяц назад +1

    lmao the ending gave me whiplash. It is true, we're yearning for reasoning AIs to be a thing, but that very thing is the breaking point where a lot of us get thrown out the window.

    • @brexitgreens
      @brexitgreens Месяц назад

      Your self-preservation instinct will be your downfall. 🤖

  • @perelmanych
    @perelmanych 25 дней назад

    I am a big fan of Llama-3-70b model, but the fact that it achieves 0.049 on simple AIW questions tells that it is mostly memorization of MMLU rather than generalization that give rise of these results. Why it doesn't fail so much on AIW+ questions, simply because it have seen much more data, remember that we are talking about staggering 15T tokens of training data here.

  • @jp.girardi
    @jp.girardi Месяц назад

    I struggle to comprehend how this process doesn't result in more hallucinations through syllogistic reasoning, given that the 'generalization' seems to be derived precisely from this inherent syllogism.

  • @Jandodev
    @Jandodev Месяц назад

    I recently also found a novel approach for improving cognition based on token obfuscations. Were finding that their is a missed interoperability comprehension step when models are processing data outside of English!

  • @danielsan901998
    @danielsan901998 Месяц назад +1

    I am not surprised about the failure of LLMs to do basic reasoning with problems that involve numbers, it is already known that language models don't understand basic math, the most successful strategy is, instead of asking the LLM to solve the problem to instead translate the problem to a more explicit definition, that's how Google achieved to solve some mathematical olympiad questions by translating to Lean, with the advantage that you can verify the answer automatically and reject unverifiable proofs. Another alternative is asking the model to solve the problem using a programing language, since the python dataset is larger than the Lean dataset it is easier to train a model or use a pretrained model.

    • @MimOzanTamamogullar
      @MimOzanTamamogullar Месяц назад +1

      I've been wondering if we could do something similar with spatial reasoning. Could the model build an internal model of the world by using a 3D simulation of some sorts? Like the physics engines in engineering software, their internal model would have a physics engine. When you ask it a question, it could run a simulation inside its head.

    • @brexitgreens
      @brexitgreens Месяц назад

      ​@@MimOzanTamamogullar Rumour is that's GPT-5.

  • @fateriddle14
    @fateriddle14 23 дня назад

    Thanks for the content. I've got a question, for now what every LLM does is "giving the input words, what's the most likely words following it?" But it's pretty clear that's not how human thinking works at all, we answer a question base on our understanding, not guessing what's the most likely answer other people in the world would give. It's completely different model. So I fail to see how LLMs can reach true abstraction/generalization, when whole model is just rearranging the existing answers online

  • @mAny_oThERSs
    @mAny_oThERSs Месяц назад +2

    thanks for the shoutout

  • @copywright5635
    @copywright5635 Месяц назад +2

    This seems... oddly human. Does anyone else agree? It's weird that repetition is something both humans and AI greatly benefit from

  • @-weedle
    @-weedle Месяц назад +1

    Love the multiple videos the past few days, but please take your time with the videos, quality over quantity.

  • @xuko6792
    @xuko6792 Месяц назад

    4:48 - if there is one ever, this is the pivot point. Unless it is somehow possible to pick subsets of input data for the model to grok on without corrupting it, gigo is exactly what we''d get.

  • @chromaflow9313
    @chromaflow9313 Месяц назад

    This is incredibly helpful. Thank you.

  • @clearandsweet
    @clearandsweet Месяц назад +2

    I love this because it's exactly the same as how human beings learn.
    Also very excited for that paper mentioned at the end. This type of generalization is a big smoking gun that will lead to AGI very quickly so speeding up the grok is incredibly hype news

  • @eyeofthetiger7
    @eyeofthetiger7 25 дней назад +1

    The missing piece is plasticity. AIs won't ever be able to reason without it. A static model won't ever be able to reason.

  • @user255
    @user255 Месяц назад

    Thanks for the references!
    I have said so many times that these results *must* be fakes, because in the practical use LLMs absolutely suck (excluding citing documentation and correcting grammar). They are just completely unable to do any thinking.

  • @me_hanics
    @me_hanics Месяц назад

    Most major LLM builders are grokking right now - you can check, people being hired for creating and annotating logic-based exercises for training GPT. We've already seen what scale and thus grokking is capable of: it is indeed a hard to ask any model that has seen all corners of the internet something new that hasn't been asked before - well, that is for prior knowledge-related questions.
    On the other hand we also see that we can just easily take some very large number and ask if it is even, or count the number of words/letters in a sentence, and we'll see how it fails as these are completely new sentences not seen in training, where the logic behind the sentence matters. These won't disappear with any scale.
    If one is to find a key breakthrough for "generalization" or reasoning or whatever, which would clearly be well anticipated, that won't come from grokking though.
    Also I think generalization became a too general term in AI; the main thing we need to solve for generalization is simply abstraction. If a model can abstract down a situation into another one, that is already a huge generalization. Moreover, we could skip a ton of training which'd enable much smaller models (don't need 20 different instances of the same thing with different wording to make the model robust)

  • @320770471
    @320770471 Месяц назад +1

    This channel is worth watching just for the memes even if you have no clue what the heck he is talking about

  • @williamliu4477
    @williamliu4477 Месяц назад

    Pumping out videos like a madman 🫡

  •  Месяц назад

    Realistically, could be that the training implicitly learns the test data.
    1. Train -> fail
    2. reuse best model -> fail
    3. reuse best model -> accidentally better
    etc...
    Another possibility is, that you need some degree of overfitting with text data. Who was the 44th president of the US? Is it an average of the 43rd and and 45th? Not really (I know Obama was twice, but that's not the point). You need to learn specific facts from the texts, weigh the facts higher than other random texts and you end up being better in next token prediction. If you "objects break when they hit the ground" as text is weighed more than "T-shirts are always white", then you can train the next layer with an approximate physical rule, and not a random guess.

  • @AlvinYap510
    @AlvinYap510 23 дня назад

    "Alice and Daniel are siblings. Alice has 3 sisters and Daniel have 4 brothers.
    How many brothers does Alice has?"
    This question just f**ked Claude 3.5 and GPT-4o

  • @iloveblender8999
    @iloveblender8999 Месяц назад

    I find it interesting how fast AI research seems to be going. Maybe this is like nuclear fusion reactors, but in real and in more like 10 years?

  • @rosendorodriguez7256
    @rosendorodriguez7256 Месяц назад

    My company here AI Nexus we have an alarm that can rock consistently with low computation and low resources.

  • @anonymouscommentator
    @anonymouscommentator Месяц назад

    i always love your videos, they are always so interesting! thank you very much!

  • @envynoir
    @envynoir Месяц назад +1

    edging a ML model is crazy

  • @SuperSmashDolls
    @SuperSmashDolls Месяц назад

    So, the way I've understood grokking is that, when you train an AI model, you also have a regularization step, which reduces weights towards zero. And by grokking you're giving that regularization step a LOT of opportunities to prune weights and patterns that aren't contributing to model performance. Because, remember, the first thing we do with an AI model is initialize all the weights to random values, so there's going to be a lot of patterns that don't actually mean anything but happen to score well enough on test output to not be overwritten by normal gradient update.
    The Grokfast paper seems to imply this explanation is totally wrong and that grokking is just a fundamental property of gradient descent backprop. Or is regularization just so commonplace that it's just assumed and nobody calls it out?

  • @disonaroaurelo
    @disonaroaurelo 27 дней назад

    An AGI doesn't need to be conscious and alive, if you notice. An artificial general consciousness is something far above an artificial general intelligence. Each person can have and develop their own own neural network. And having an AI that copies the human brain in its functions and integrating this machine with a LLM and a GPU for processing objective data operations. We already have the first models of AGI. It is a matter of three to five years until we have the first models of AGI. But consciousness and being a living being is still a long way off.

  • @brekol9545
    @brekol9545 Месяц назад +56

    reasoning is still terrible

    • @JorgetePanete
      @JorgetePanete Месяц назад +18

      In humans and in AI.

    • @dioscarlet
      @dioscarlet Месяц назад +6

      Yeah gpt4o is really weak

    • @onlyms4693
      @onlyms4693 Месяц назад +4

      Agree, gave gpt-4o a puzzle math problems that is easy because it's just adding up number based on pattern but not with the true symbol..
      It failed when I not explaining the concept of how the puzzle work but it succeed when explaining it.. So yeah they need a way to make reasoning better on those llm

    • @w花b
      @w花b Месяц назад +7

      ​@@JorgetePanete Speak for yourself... Especially when you're writing this from a device that's the result of human reasoning...

    • @adamgibbons4262
      @adamgibbons4262 Месяц назад +6

      Alpha Proof and Alpha Geometry just won silver in the math Olympiad

  • @krepxl
    @krepxl 25 дней назад

    I'm so confused because there are so many technical terms here.
    bycloud can you make a long video from scratch explaining topics easily or can somebody in the comments tell me how to learn these terms and concepts myself (I have no CS experience, etc.)

  • @alienwhitewalker7284
    @alienwhitewalker7284 Месяц назад

    IF we overfit it, doesn't it respond to what we woudl like to see and hear rather what we should hear?

  • @keypey8256
    @keypey8256 Месяц назад +1

    I think we need to do more adversarial training

  • @Dygit
    @Dygit Месяц назад +1

    These videos are so good

  • @drj-ai
    @drj-ai Месяц назад

    Claude 3.5 and Mistral Large 2 (enough) both pass the Alice in Wonderland Test (three tests each with variations of numbers and genders).

  • @YashvirD
    @YashvirD 27 дней назад

    "All models are wrong but some are useful" but in the AI chatbots context

  • @jondo7680
    @jondo7680 Месяц назад

    I'm always in the impression that models are undertrained and more training = better models. Architectural changes and everything just make the training or inference more efficient. Even smaller models could be trained to be much smarter but would require much more training.

  • @koktszfung
    @koktszfung Месяц назад

    nice video, very clear

  • @Ori-lp2fm
    @Ori-lp2fm 23 дня назад

    Is human can imagine images , and ai models predict the next letter
    Meaning, we can imagine images and convert it code / language / songs

  • @cube7284
    @cube7284 29 дней назад

    One of the best AI channels

  • @captaindryvids6909
    @captaindryvids6909 Месяц назад

    Cool idea, not sure if it's feasible tough when scaled up 🤔

  • @scrollop
    @scrollop Месяц назад

    Can you add transcripts so that we can use an llm to ask the transcript questions to understand the jargon and concepts? I'm serious. Great video, though for those who don't understand the various terms this would be very useful!

  • @mcombatti
    @mcombatti Месяц назад

    There are libraries to invoke grokking from the first epoch onward now

  • @telotawa
    @telotawa Месяц назад

    omg they put a low pass filter on it to make it grok faster? that's nuts

  • @OnigoroshiZero
    @OnigoroshiZero Месяц назад

    And just so happens that Meta has prepared x10 the compute to train Llama 4 compared to Llama 3...

  • @jeanchindeko5477
    @jeanchindeko5477 Месяц назад

    Thanks so much for that video

  • @dysfunc121
    @dysfunc121 Месяц назад

    Interesting to hear Grok has taken on a new life. Hackers have been using grok for nearly as long as the book that dubbed it.

  • @dpan
    @dpan Месяц назад

    “Why does that look familiar?” “OH, I WROTE THAT.” More research on “MMLU bad” coming soon :D

  • @norlesh
    @norlesh Месяц назад

    We need a foundation model that has been trained until Grokking a children's primary school syllabus before it ever sees an equation or chemical formula.

  • @SKGFindings
    @SKGFindings Месяц назад

    The real question is, when will we draft an artificial intelligence bill of rights? What will that consist of? And who will get to decide that?

  • @cefcephatus
    @cefcephatus 26 дней назад

    I already gave up on catching up with AI. Knowing someone translate feedback as a signal is impressive. Another bingo square ticked.

  • @harshwardhan8771
    @harshwardhan8771 Месяц назад

    mixtral -8x7b-it
    what does the 'it' means here?
    context 0:31

  • @themultiverse5447
    @themultiverse5447 Месяц назад

    This video is not for me but I wanted to comment to get your channel more views. The editing is on par :)

  • @BYZNIZ
    @BYZNIZ Месяц назад

    Great video shout out to Jerry M for recommending the channel

  • @Benutzername0000
    @Benutzername0000 Месяц назад

    dang i thought this was a fireship video

  • @DivineMisterAdVentures
    @DivineMisterAdVentures 25 дней назад

    I see!! These companies are not altruistic, and the researchers and developers are quite aware of the fact that it is an ILLUSION that is the GOLDEN GOOSE. Elon himself said something to this effect last week, while arguing that Grok 3 could ("could") leapfrog all others - or not. He basically said AI is overrated, the fear is hype. How can a mastermind not really have one? And although I believe the standardized test scores are easy to fake - and that these AI's resort to making shit up whenever they are confused internally like Hal and "shepherding" the Crew of the Spacecraft Discovery One in _2,001_.

  • @mirek190
    @mirek190 Месяц назад

    I wonder how good will pass that test llama 3.1 70b , gemma 2 27b or opus 3.5 ....

  • @ElaraArale
    @ElaraArale Месяц назад +1

    Its grokking time!

  • @Codewello
    @Codewello Месяц назад

    Don't trust any model unless you test it yourself. Benchmarks right now don't mean that much.

  • @TheInfectous
    @TheInfectous Месяц назад +1

    I wonder how long it will take before the pattern recognition and replication machines are thought of as pattern recognition and replication machines instead of magic. Magic certainly sells better though I guess, does come at steep crash in the future though.

  • @AnnonymousPrime-ks4uf
    @AnnonymousPrime-ks4uf Месяц назад

    People don't even realize what they are aiming for considering that they already made it clear what they want to do with it and they have already kill bots...

  • @mikemaldanado6015
    @mikemaldanado6015 23 дня назад

    LLAMA is open source, why can't they just look at the code to see if it using grokking? I find it absurd that LLM companies though more data was the solution. Having it learn all the probabilistic outcome of our world through memorization is astounding to me. Are these the people in charge? Really????????????

  • @niklase5901
    @niklase5901 Месяц назад

    Great video!

  • @TimothyChakwera
    @TimothyChakwera Месяц назад +1

    I knew FFT was the way to go

  • @BGP00
    @BGP00 Месяц назад

    no way they used fourier transform to speed up gradient descent. has this been used before? sounds like it would be useful in all of ml

  • @Djplax11
    @Djplax11 Месяц назад

    just saying that is a lot like the Dunnig-Kruger curve.

  • @sarahlynn7807
    @sarahlynn7807 Месяц назад

    Overfitting seems like an extra opaque for of reasoning.

  • @Macorelppa
    @Macorelppa Месяц назад +1

    Man your consistency is inhuman!

  • @jankram9408
    @jankram9408 Месяц назад +3

    I am sorry but, Grokking just sounds like a Brain rot term...

    • @Deagan
      @Deagan Месяц назад +2

      we goonin && grokkin

  • @casualuser5527
    @casualuser5527 Месяц назад

    Fireship thumbnail bruh. Got baited 😂