Why "Grokking" AI Would Be A Key To AGI

Поделиться
HTML-код
  • Опубликовано: 21 ноя 2024
  • Check out HubSpot's Free ChatGPT resource to power up your work efficiency🔥: clickhubspot.c...
    Check out my newsletter:
    mail.bycloud.ai
    Are We Done With MMLU?
    [Paper] arxiv.org/abs/...
    Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
    [Paper] arxiv.org/abs/...
    Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
    [Paper] arxiv.org/abs/...
    Grokfast: Accelerated Grokking by Amplifying Slow Gradients
    [Paper] arxiv.org/abs/...
    [Code] github.com/iro...
    This video is supported by the kind Patrons & RUclips Members:
    🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Robert Zawiasa, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi, Hector, Drexon, Claxvii 177th, Inferencer, Michael Brenner, Akkusativ, Oleg Wock, FantomBloth
    [Discord] / discord
    [Twitter] / bycloudai
    [Patreon] / bycloud
    [Music] massobeats - noon
    [Profile & Banner Art] / pygm7
    [Video Editor] Silas

Комментарии • 297

  • @bycloudAI
    @bycloudAI  3 месяца назад +14

    Check out HubSpot's Free ChatGPT resource to power up your work efficiency🔥clickhubspot.com/hyx

    • @Scoring57
      @Scoring57 3 месяца назад

      In a way isn't this a 'good' thing? That these ai's are so 'bad' but still so useful? It feels to me that that still makes *Transformers / GPT's* the *best* , most safe candidates for "ai". All they need to do is further engineer it and make it less of a black box to gain more control, create alignment, and from there you have a powerful yet *LIMITED* ai that can't actually think so you don't have to worry about it plotting or doing something you didn't intend. You can even arbitrarily limit it's intelligence to whatever you like while using as a facts repository and make use of it's pseudo/simulated reasoning
      This also means this world breaking, world altering "singularity" *isn't* inevitable and we might be able to evolve along side ai in a more controlled way as we learn and mature with the technology and hopefully actually get to master it. Which is favorable because right now we're barely figuring it out as we go and it's an extremely powerful tool that can do a lot of harm.
      Also how did this comment post 2 days ago and this video went up just a day ago??

  • @FireFox64000000
    @FireFox64000000 3 месяца назад +487

    So essentially Grokking is just making the AI bash its head into a concept over and over and over again until it finally understands it. Guys, I think I've been grokking for years.

    • @ctwolf
      @ctwolf 3 месяца назад +25

      Same friend. Same.

    • @tiagotiagot
      @tiagotiagot 3 месяца назад +12

      Grokking is the phase-transition, the point where the other shoe finally drops

    • @Rockyzach88
      @Rockyzach88 3 месяца назад +8

      Percussive learning/maintenance - like hitting your tv to make the picture stable

    • @shukrantpatil
      @shukrantpatil 3 месяца назад +3

      we are adding the human "essence" to it now

    • @user-cg7gd5pw5b
      @user-cg7gd5pw5b 3 месяца назад +6

      Still waiting for the 'understanding' part...

  • @dhillaz
    @dhillaz 3 месяца назад +190

    New pitch to investors: "Just 10x more GPUs bro, 10x GPUs to push past this stagnant validation and we will reach grokking, I promise"

    • @Napert
      @Napert 3 месяца назад +47

      99% of LLM training stops right before achieving grokking

    • @thatonedude6596
      @thatonedude6596 3 месяца назад

      @@Napert full circle

    • @petrkinkal1509
      @petrkinkal1509 Месяц назад

      The more you buy the more you grokk.

  • @manuelburghartz5263
    @manuelburghartz5263 3 месяца назад +129

    Putting out 3 videos in 4 days that are so complex and labor intensive is crazy. Love the vids man, keep up the grind

    • @bycloudAI
      @bycloudAI  3 месяца назад +68

      watch me release the next video next month LOL

    • @Ketamineprod
      @Ketamineprod 3 месяца назад

      ​@@bycloudAIu r awesome bro take ur time and drop whenever u feel like it!

    • @markdatton1348
      @markdatton1348 3 месяца назад +18

      @@bycloudAIjust enough time for us to GROK these vids

    • @HingalshDealer
      @HingalshDealer 3 месяца назад +10

      ​@@markdatton1348bro grokked the grok term

    • @dcaban85
      @dcaban85 3 месяца назад

      probably ai generated

  • @blisphul8084
    @blisphul8084 3 месяца назад +149

    So basically, it's like how if we read something 1 or 2 times, we might remember the answer, but if we read it or run it through our head 20x, we are more likely to fully understand the logic behind what we're reading. Given that higher parameter LLMs are more expensive to do this with, I wonder if small models will actually be the most capable in the long run.

    • @anthony4223
      @anthony4223 3 месяца назад +8

      I kinda would be maybe smaller and maybe slightly more specialized models for things like medical or some such would be more popular in the long-ish run

    • @bloopbleepnothinghere
      @bloopbleepnothinghere 3 месяца назад +6

      There are papers that show the as you scale up volume of general information you feed an LLM the lower the quality responses it puts out. Specialized, smaller models talking together is a viable path. But if you expect an LLM to reason and "comprehend" rather than infer from memorized data, I don't think we will see that from LLMs as we understand them today.

    • @mirek190
      @mirek190 3 месяца назад +2

      ​@@bloopbleepnothinghere have you seen the gemma 2 2b .... that model is so small and still multilingual and has quite strong reasoning and know math ... crazy

    • @blisphul8084
      @blisphul8084 3 месяца назад +3

      @@mirek190 the fact that it's fluent in Japanese is insane. And it runs reasonably fast on even a 5 year old laptop CPU.

    • @a_soulspark
      @a_soulspark 3 месяца назад

      I'd point out that in the Alice in Wonderland paper he showed, that table (10:19) shows Llama 3 70B got 3rd best performance while Llama 3 8B got 0% right.
      I'd argue that grokking requires more parameters to store the more nuanced information... but the idea of

  • @cdkw2
    @cdkw2 3 месяца назад +60

    I love how I have to see your videos 3 times to understand them, kinda like when I was first starting out with calculus!

    • @blisphul8084
      @blisphul8084 3 месяца назад +23

      This is exactly what grokking is. See something enough times and you'll understand it on a more fundamental level.

    • @cdkw2
      @cdkw2 3 месяца назад +16

      @@blisphul8084 The fact that I didn't even realize that I did this...

    • @BrainSlugs83
      @BrainSlugs83 3 месяца назад +8

      Incredibly ironic given the topic of overfitting training. 😅

    • @SumitRana-life314
      @SumitRana-life314 3 месяца назад +5

      Bro is Raw Grokking ByCloud videos.

    • @jonathanberry1111
      @jonathanberry1111 3 месяца назад

      @@cdkw2 You hadn't yet watched it 3 times! I, being smarter (read, older, wasting more time on YT and overthinking) got it right away. Look, this is how INTJ's and INTP's are more intelligent (IQ), more times thinking thinking thinking!

  • @JazevoAudiosurf
    @JazevoAudiosurf 3 месяца назад +39

    this fourier transform filter thing is just nuts. when i see stuff like this or alphaproof or PRMs i can't imagine we wouldn't reach huge intelligence leaps beyond AGI in the next 5 years. i mean all it takes is a simple mathematical trick, this is the level of infancy that AI is currently in. i mean you look at other fields of science like fucking material science and there even in the 60s just to figure out materials for LEDs, they would go through an order of magnitude more struggle than for the simple AI breakthroughs of this years papers. or look at physics, space, semiconductors. and AI on a software level is so much easier to experiment with than those things

    • @seamon9732
      @seamon9732 3 месяца назад +2

      That's assuming 2 things:
      1- That we have enough parameters to simulate the equivalent # of synapses/connections in a brain ( 100 to 1000 trillions ).
      2- That the recent research into microtubules doesn't mean that they are also involved in processing/reasoning. If it is the case and there are hundreds to a thousand microtubules per axon ( transmitting part of synapses ) and a bit less in dendrites ( receiving part of synapses ), then you have to multiply the above trillions some more.

    • @mirek190
      @mirek190 3 месяца назад +6

      @@seamon9732 Our brain has around 100 trillions connections ..true ..BUT for thinking we are using only 20% of them ...the rest is used for keeping our body alive.

    • @AfifFarhati
      @AfifFarhati 3 месяца назад +1

      Yeah but sometimes , simple tricks could take years or decades to discover....

    • @BlueKanary
      @BlueKanary 3 месяца назад

      @@mirek190 Is this 10% myth really still floating around? Even if "Only 20%" is for cognitive work, keeping the body alive is no joke. Just keeping a stable heartbeat would take a good portion of dedicated brainpower

    • @cajampa
      @cajampa 2 месяца назад

      @@BlueKanary Why, don't brain dead people have a heart beat?
      It is a pretty low level function with inbuilt automatic electric puls generator cells right there in the heart muscle. Driven by lower functions in the brain stem.
      And even if those function in the brain stem is gone as long as a respirator is supplied the heart can function on its own.

  • @PYETech
    @PYETech 3 месяца назад +11

    One of the best channels out there that has real value to everyone. You NEED a raise!

  • @-mwolf
    @-mwolf 3 месяца назад +16

    Grokking is an instance of the minimum description length principle.
    If you have a problem, you can just memorize a point-wise input to output mapping.
    This has zero generalization.
    But from there, you can keep pruning your mapping, making it simpler, a.k.a. more comrpessed.
    The program that generalizes the best (while performing well on a training set), is the shortest.
    → Generalization is memorization + regularization.
    But this is of course still limited to in distribution regularizatiom.

    • @DanielSeacrest
      @DanielSeacrest 3 месяца назад

      People think memorisation or overfitting is a bad thing and we need to figure out how to prevent it in its entirety, but really it's a stepping stone onto the path of grokking and perfect generalisation.

  • @13579Josiah
    @13579Josiah 3 месяца назад

    One of the best videos on AI I’ve seen in a while! No hype. All facts. Well explained while not shying away from complex topics. Beautiful explanation of fast Grok. You just earned yourself a sub!

  • @seanwu3006
    @seanwu3006 3 месяца назад +4

    They say the Buddha sat under a tree for 49 days and "grokked".

  • @74Gee
    @74Gee 3 месяца назад +5

    I find that alice in wonderland type responses can be significantly improved when system prompting to form data structures from the known data and then inferring from that structure - something like this (a minimal version)
    ```
    You are tasked with solving complex relationship questions by first mapping all known facts into a JSON structure and then using this structure to infer answers. When given a question, follow these steps:
    1. Extract all given facts.
    2. Create a JSON structure to represent these facts.
    3. Use the JSON structure to navigate and infer answers.
    4. Provide clear and logically consistent responses based on the JSON file.
    ```
    I used this technique very successfully when working with gossip analysis and determining the source of gossip but quickly realized its benefits in other logical fields.

  • @dzbuzzfeed908
    @dzbuzzfeed908 3 месяца назад +4

    1. **Current State of LM Benchmarks**
    *Timestamp: **0:00:00*
    2. **Benchmark Performance Issues**
    *Timestamp: **0:00:03*
    3. **Implications of Reordering Questions**
    *Timestamp: **0:00:20*
    4. **Alice in Wonderland Paper Findings**
    *Timestamp: **0:01:57*
    5. **The Concept of Grocking**
    *Timestamp: **0:04:27*
    6. **Grocking vs. Double Descent**
    *Timestamp: **0:06:06*
    7. **Potential Solutions for Improved Reasoning**
    *Timestamp: **0:09:02*
    8. **Grocking in Transformers and New Research**
    *Timestamp: **0:08:12*
    9. **Grock Fast Implementation**
    *Timestamp: **0:11:28*
    ### Ads
    1. **Hub SWAT AI Resources**
    *Timestamp: **0:01:13*
    ### Funny Jokes
    1. **"Absolute dog water"**
    *Timestamp: **0:00:03*
    2. **"Kind of crazy from a more cynical and critical perspective"**
    *Timestamp: **0:00:33*
    3. **"Can you imagine an AI being able to do this only humans would be able to come up with something this random and absurdly funny"**
    *Timestamp: **0:03:13*
    4. **"If an AI can truly do this it actually might be so over over for us so for the sake of burning down rainforests"**
    *Timestamp: **0:03:23*
    5. **"Elon's Grock LM is probably named after the book and not related to the ML concept that we are talking about today"**
    *Timestamp: **0:05:43*
    6. **"Mr. Zuck said that when llama 370 never stopped learning even after they trained it three or four times past the chinella optimum is not copium"**
    *Timestamp: **0:10:03*

  • @rubncarmona
    @rubncarmona 3 месяца назад +8

    Grokking is akin to the evolution of language even after a population has been entirely alphabetized. Every once in a while someone figures out a new connection between seemingly unrelated concepts, uses one of them in a new context by mistake or because it forgot the intended word etc. This continuous increase in information entropy even after exhausting the parameter space reminds me a lot of what some scientists say about information in degenerate era's black holes.

  • @ytubeanon
    @ytubeanon 3 месяца назад +11

    Grokking our way to AGI...
    11:33 the Grokfast paper has boggling potential, so who's using it? when will we see its results?

    • @KeinNiemand
      @KeinNiemand 2 месяца назад

      Probably nobody is using it

  • @kazzear_
    @kazzear_ 3 месяца назад +12

    No shit! i've literally made code for this idea!!! I can't believe someone was working on this like i was, i didnt even know what was grokking.

    • @beatsandstuff
      @beatsandstuff 3 месяца назад +22

      Whenever you are making something, remember, there's always an asian kid doing it way better than you.

    • @kazzear_
      @kazzear_ 3 месяца назад +3

      @@beatsandstuff sad truth

    • @AB-wf8ek
      @AB-wf8ek 3 месяца назад +2

      Synchronicity as a result of morphic resonance

    • @pneumonoultramicroscopicsi4065
      @pneumonoultramicroscopicsi4065 3 месяца назад +2

      ​@@beatsandstuff i don't think it's an asian "kid" but okay

    • @itsiwhatitsi
      @itsiwhatitsi 20 дней назад

      @@beatsandstuff or an Ai

  • @Interpause
    @Interpause 3 месяца назад +5

    wait if im understanding grokfast correctly, they are attempting to predict the rate of change of weights at any given moment using fourier transform?
    thats insane that has way more use cases for other neural network architectures outside of just transformers

    • @ChanhDucTuong
      @ChanhDucTuong 3 месяца назад

      I don’t understand most of what you and this video said but may I ask 1 question: Will grokking work with Stable Diffusion training? Like normally I only need 3000-5000 steps to train the model to draw my face perfectly, what if I train it to 200000 steps? Before this video I’d thought that nothing will happen but now I’m not sure.

  • @j.j.maverick9252
    @j.j.maverick9252 3 месяца назад +2

    interesting graph for the llm learning curve up, then down, then up again. Looks eerily similar to Dunning Kruger

  • @alexanderbrown-dg3sy
    @alexanderbrown-dg3sy 3 месяца назад +24

    I’ve been saying this for a year. Other researchers keep foolishly positioning grokking as a weird training artifact without practical value. When there is literally research to the contrary, yet they still see no value lol. Almost like common sense to me. Imagine going through school with no context, no homework, no tutoring and producing current SOTA LM benchmarks. The fact LM can with severely oblique data makes the answer clear. Hybrid data. Increasing data density with inferred facts. Remember reasoning is basically syntactic transformation. Reformulating samples using formal semantics for native symbolic reasoning is the answer. Clear as day. Also fixing PE to solve the reversal curse. All you need.
    As someone who trained smaller model at 12k tokens per parameter without any real saturation. Models first off should be way smaller. Then focus on hybrid data. AGI will be compact in my personal opinion. For instance I believe a 10B model can exceed gpt4 using the formula I described above. Since imo it should be trained on 100T tokens lol. Models are vastly overparameterized and it’s so idiotic to me. Brilliant engineers but their first principles are wrong.
    Grokfast is super important but you have to modify the code to work with larger models. FYI deeper layer wanna grokk more than toy models seen in research.

    • @TheSonOfDumb
      @TheSonOfDumb 3 месяца назад

      My apologies, but your comment and profile picture are highly incongruous.

    • @alexanderbrown-dg3sy
      @alexanderbrown-dg3sy 3 месяца назад +12

      @@TheSonOfDumb lol bro come on it’s 2024. Gifted minds exist within all communities. It is because I’m pretty or rather because I’m black? Stay blessed though. You hurt my feelings, I won’t lie lol.

    • @mirek190
      @mirek190 3 месяца назад +2

      have you seen the gemma 2 2b .... that model is so small and still multilingual and has quite strong reasoning and know math ... crazy

    • @alexanderbrown-dg3sy
      @alexanderbrown-dg3sy 3 месяца назад +5

      @@mirek190 yes it is impressive bro. I still feel we haven’t hit a ceiling with sub-10B models.

    • @strangelaw6384
      @strangelaw6384 3 месяца назад +2

      @@TheSonOfDumb you don't have to write bait replies to your own comments to attract attention. If you're confident in what you wrote (which you should be).
      By the way, the fact that you brought up "homework" and "tutoring" makes me wonder if the training set can be designed to model actual academic learning materials with student-centered teaching strategies.

  • @spaceadv6060
    @spaceadv6060 3 месяца назад +2

    One of my favorite videos so far! Thanks again.

  • @mikairu2944
    @mikairu2944 3 месяца назад +1

    lmao the ending gave me whiplash. It is true, we're yearning for reasoning AIs to be a thing, but that very thing is the breaking point where a lot of us get thrown out the window.

    • @brexitgreens
      @brexitgreens 3 месяца назад

      Your self-preservation instinct will be your downfall. 🤖

  • @Alice_Fumo
    @Alice_Fumo 3 месяца назад +5

    My mind is blown at the filtering of high-frequency parameter changes, leaving the low-frequency ones and using them to achieve grokking a lot faster. What an amazing idea. Though naively I would think that would require plotting the values of every single parameter over time which would be way too memory-intensive to be feasible for large models.
    Hmm.. I guess they can keep track of the average amount of time/steps between parameter update direction changes for every parameter, which should give us the frequency?
    It's also possible I'm fundamentally misunderstanding anything, in which case someone please explain where my thinking is failing.

    • @tiagotiagot
      @tiagotiagot 3 месяца назад +2

      I guess it would probably work to do something like NewWeights = (OldWeights * (MixBias)) + ((OldWeights + RawChange) * (1.0 - MixBias)) , with MixBias at some value close to but bellow 1.0 .
      And maybe perhaps a sorta momentum mechanism with some level of drag could be added on top of that to take care of the low frequency motions being lower amplitude while at the same time avoiding overshooting the target too much; maybe even have a little machine learning model that learns on the job to adjust the drag intensity based on how big the improvement (or worsening) of the bigmodel's scores has been after each iteration (who knows, maybe even something simpler like an auto-tunning PID algorithm might already suffice).

  • @zyansheep
    @zyansheep 3 месяца назад +1

    I looked through your videos and saw I had watched literally every one but didn't subscribe lol. I'm subscribed now!

  • @raspberryjam
    @raspberryjam 3 месяца назад +2

    I'm thinking: If you imagine a model divided in two halves where one is the generalization part and the other is the overfitting part, it's still most beneficial to have the generalization half get as close to the right answer as possible so as to lighten the load on the overfitting half. Or in another way, you should devote as many parameters as you can to memorizing the corrections to the wrongest answers, and you can do that by minimizing the number of parameters needed to get to what is a generally a fairly close answer

  • @ThomasTomiczek
    @ThomasTomiczek 3 месяца назад +7

    I do not think that COF and Grokking are not both usable at the same time ;) I.e. you can GROKK a model, and still use explicit verbalisations.

  • @briangman3
    @briangman3 3 месяца назад +4

    They need benchmark Testing that has variations in questions inputs and randomization of choices

    • @Rockyzach88
      @Rockyzach88 3 месяца назад +2

      I'm sure the people who are actually passionate about building these things are doing all the things.

    • @njokedestay7704
      @njokedestay7704 3 месяца назад +1

      I think that should be the GSM1K benchmark from Scale AI

    • @briangman3
      @briangman3 3 месяца назад

      @@njokedestay7704 I will look into it

  • @Koroistro
    @Koroistro 3 месяца назад +3

    Imo there is the need in research how to decouple the model from the information. At least to some degree.
    A big problem with current LLMs is that they are *too good* at learning.
    Yes, they learn too well, they don't need to think they just learn the thing.
    Reasoning is a way to shift the burden from memory to computation. If you have tons of storage space you're going to use it, if you have very little storage space you're going to be forced to compress as much as possible.
    If you think about it, less parameters are easier to overfit than many parameters.

  • @iantimmis651
    @iantimmis651 3 месяца назад +9

    Important to remember that chinchilla compute optimal is not inference optimal

  • @Jandodev
    @Jandodev 3 месяца назад

    I recently also found a novel approach for improving cognition based on token obfuscations. Were finding that their is a missed interoperability comprehension step when models are processing data outside of English!

  • @RedOneM
    @RedOneM 3 месяца назад +3

    I think 1.8T parameters grokked with „understanding“ of all logical states will become AGI.
    This kind of power will turbo accelerate AI tech, since it can begin research itself.

  • @Omar-bi9zn
    @Omar-bi9zn 3 месяца назад +4

    great ! thanks for shedding more light on grokfast !

  • @williamliu4477
    @williamliu4477 3 месяца назад

    Pumping out videos like a madman 🫡

  • @GodbornNoven
    @GodbornNoven 3 месяца назад +2

    You don't know how right you are 😂
    Grokking will be a super important step to AGI. Essentially, you're training a model on data so much it practically becomes an expert at it. At some point, we will achieve the quantity of compute necessary to achieve this. At that point, might as well take the road of brute force.
    Naturally, algorithmic breakthroughs are incredibly important and also very essential to the improvement of LLMs. As they allow us to do more with less

  • @chromaflow9313
    @chromaflow9313 3 месяца назад

    This is incredibly helpful. Thank you.

  • @Neomadra
    @Neomadra 3 месяца назад

    Another issue for grokking is that reasoning is not a single method that can be learned and applied to everything. It is many different many methods and I guess when grokking on one skill and then on the other will lead to forgetting of the previously grokked skill. I think one would need some freeze mechanism that locks up some weights after grokking has achieved.

  • @shApYT
    @shApYT 3 месяца назад +3

    Has any model proven real generalisation on any out of domain task? Even one task?

  • @anirudh514
    @anirudh514 3 месяца назад +7

    I am your regular follower, your videos are amazing!

  • @copywright5635
    @copywright5635 3 месяца назад +2

    This seems... oddly human. Does anyone else agree? It's weird that repetition is something both humans and AI greatly benefit from

  • @jp.girardi
    @jp.girardi 3 месяца назад

    I struggle to comprehend how this process doesn't result in more hallucinations through syllogistic reasoning, given that the 'generalization' seems to be derived precisely from this inherent syllogism.

  • @xuko6792
    @xuko6792 3 месяца назад

    4:48 - if there is one ever, this is the pivot point. Unless it is somehow possible to pick subsets of input data for the model to grok on without corrupting it, gigo is exactly what we''d get.

  • @-weedle
    @-weedle 3 месяца назад +1

    Love the multiple videos the past few days, but please take your time with the videos, quality over quantity.

  • @320770471
    @320770471 3 месяца назад +1

    This channel is worth watching just for the memes even if you have no clue what the heck he is talking about

  • @mAny_oThERSs
    @mAny_oThERSs 3 месяца назад +2

    thanks for the shoutout

  • @perelmanych
    @perelmanych 3 месяца назад

    I am a big fan of Llama-3-70b model, but the fact that it achieves 0.049 on simple AIW questions tells that it is mostly memorization of MMLU rather than generalization that give rise of these results. Why it doesn't fail so much on AIW+ questions, simply because it have seen much more data, remember that we are talking about staggering 15T tokens of training data here.

  • @cube7284
    @cube7284 3 месяца назад

    One of the best AI channels

  • @Dygit
    @Dygit 3 месяца назад +1

    These videos are so good

  • @fateriddle14
    @fateriddle14 3 месяца назад

    Thanks for the content. I've got a question, for now what every LLM does is "giving the input words, what's the most likely words following it?" But it's pretty clear that's not how human thinking works at all, we answer a question base on our understanding, not guessing what's the most likely answer other people in the world would give. It's completely different model. So I fail to see how LLMs can reach true abstraction/generalization, when whole model is just rearranging the existing answers online

  • @koktszfung
    @koktszfung 3 месяца назад

    nice video, very clear

  • @dysfunc121
    @dysfunc121 3 месяца назад

    Interesting to hear Grok has taken on a new life. Hackers have been using grok for nearly as long as the book that dubbed it.

  • @norlesh
    @norlesh 3 месяца назад

    We need a foundation model that has been trained until Grokking a children's primary school syllabus before it ever sees an equation or chemical formula.

  •  3 месяца назад

    Realistically, could be that the training implicitly learns the test data.
    1. Train -> fail
    2. reuse best model -> fail
    3. reuse best model -> accidentally better
    etc...
    Another possibility is, that you need some degree of overfitting with text data. Who was the 44th president of the US? Is it an average of the 43rd and and 45th? Not really (I know Obama was twice, but that's not the point). You need to learn specific facts from the texts, weigh the facts higher than other random texts and you end up being better in next token prediction. If you "objects break when they hit the ground" as text is weighed more than "T-shirts are always white", then you can train the next layer with an approximate physical rule, and not a random guess.

  • @eyeofthetiger7
    @eyeofthetiger7 3 месяца назад +1

    The missing piece is plasticity. AIs won't ever be able to reason without it. A static model won't ever be able to reason.

  • @anonymouscommentator
    @anonymouscommentator 3 месяца назад

    i always love your videos, they are always so interesting! thank you very much!

  • @clearandsweet
    @clearandsweet 3 месяца назад +2

    I love this because it's exactly the same as how human beings learn.
    Also very excited for that paper mentioned at the end. This type of generalization is a big smoking gun that will lead to AGI very quickly so speeding up the grok is incredibly hype news

  • @rosendorodriguez7256
    @rosendorodriguez7256 3 месяца назад

    My company here AI Nexus we have an alarm that can rock consistently with low computation and low resources.

  • @brekol9545
    @brekol9545 3 месяца назад +56

    reasoning is still terrible

    • @JorgetePanete
      @JorgetePanete 3 месяца назад +18

      In humans and in AI.

    • @dioscarlet
      @dioscarlet 3 месяца назад +6

      Yeah gpt4o is really weak

    • @onlyms4693
      @onlyms4693 3 месяца назад +4

      Agree, gave gpt-4o a puzzle math problems that is easy because it's just adding up number based on pattern but not with the true symbol..
      It failed when I not explaining the concept of how the puzzle work but it succeed when explaining it.. So yeah they need a way to make reasoning better on those llm

    • @w花b
      @w花b 3 месяца назад +7

      ​@@JorgetePanete Speak for yourself... Especially when you're writing this from a device that's the result of human reasoning...

    • @adamgibbons4262
      @adamgibbons4262 3 месяца назад +6

      Alpha Proof and Alpha Geometry just won silver in the math Olympiad

  • @Benutzername0000
    @Benutzername0000 3 месяца назад

    dang i thought this was a fireship video

  • @jeanchindeko5477
    @jeanchindeko5477 3 месяца назад

    Thanks so much for that video

  • @SuperSmashDolls
    @SuperSmashDolls 3 месяца назад

    So, the way I've understood grokking is that, when you train an AI model, you also have a regularization step, which reduces weights towards zero. And by grokking you're giving that regularization step a LOT of opportunities to prune weights and patterns that aren't contributing to model performance. Because, remember, the first thing we do with an AI model is initialize all the weights to random values, so there's going to be a lot of patterns that don't actually mean anything but happen to score well enough on test output to not be overwritten by normal gradient update.
    The Grokfast paper seems to imply this explanation is totally wrong and that grokking is just a fundamental property of gradient descent backprop. Or is regularization just so commonplace that it's just assumed and nobody calls it out?

  • @TimothyChakwera
    @TimothyChakwera 3 месяца назад +1

    I knew FFT was the way to go

  • @danielsan901998
    @danielsan901998 3 месяца назад +1

    I am not surprised about the failure of LLMs to do basic reasoning with problems that involve numbers, it is already known that language models don't understand basic math, the most successful strategy is, instead of asking the LLM to solve the problem to instead translate the problem to a more explicit definition, that's how Google achieved to solve some mathematical olympiad questions by translating to Lean, with the advantage that you can verify the answer automatically and reject unverifiable proofs. Another alternative is asking the model to solve the problem using a programing language, since the python dataset is larger than the Lean dataset it is easier to train a model or use a pretrained model.

    • @MimOzanTamamogullar
      @MimOzanTamamogullar 3 месяца назад +1

      I've been wondering if we could do something similar with spatial reasoning. Could the model build an internal model of the world by using a 3D simulation of some sorts? Like the physics engines in engineering software, their internal model would have a physics engine. When you ask it a question, it could run a simulation inside its head.

    • @brexitgreens
      @brexitgreens 3 месяца назад

      ​@@MimOzanTamamogullar Rumour is that's GPT-5.

  • @niklase5901
    @niklase5901 3 месяца назад

    Great video!

  • @drj-ai
    @drj-ai 3 месяца назад

    Claude 3.5 and Mistral Large 2 (enough) both pass the Alice in Wonderland Test (three tests each with variations of numbers and genders).

  • @user255
    @user255 3 месяца назад

    Thanks for the references!
    I have said so many times that these results *must* be fakes, because in the practical use LLMs absolutely suck (excluding citing documentation and correcting grammar). They are just completely unable to do any thinking.

  • @me_hanics
    @me_hanics 3 месяца назад

    Most major LLM builders are grokking right now - you can check, people being hired for creating and annotating logic-based exercises for training GPT. We've already seen what scale and thus grokking is capable of: it is indeed a hard to ask any model that has seen all corners of the internet something new that hasn't been asked before - well, that is for prior knowledge-related questions.
    On the other hand we also see that we can just easily take some very large number and ask if it is even, or count the number of words/letters in a sentence, and we'll see how it fails as these are completely new sentences not seen in training, where the logic behind the sentence matters. These won't disappear with any scale.
    If one is to find a key breakthrough for "generalization" or reasoning or whatever, which would clearly be well anticipated, that won't come from grokking though.
    Also I think generalization became a too general term in AI; the main thing we need to solve for generalization is simply abstraction. If a model can abstract down a situation into another one, that is already a huge generalization. Moreover, we could skip a ton of training which'd enable much smaller models (don't need 20 different instances of the same thing with different wording to make the model robust)

  • @casualuser5527
    @casualuser5527 3 месяца назад

    Fireship thumbnail bruh. Got baited 😂

  • @LucaCrisciOfficial
    @LucaCrisciOfficial 3 дня назад

    The benchmarks on which LLMs are tested on obviously are not perfect, but of course they are valid.

  • @BYZNIZ
    @BYZNIZ 3 месяца назад

    Great video shout out to Jerry M for recommending the channel

  • @captaindryvids6909
    @captaindryvids6909 3 месяца назад

    Cool idea, not sure if it's feasible tough when scaled up 🤔

  • @themultiverse5447
    @themultiverse5447 3 месяца назад

    This video is not for me but I wanted to comment to get your channel more views. The editing is on par :)

  • @DefaultFlame
    @DefaultFlame 3 месяца назад

    Great video.

  • @telotawa
    @telotawa 3 месяца назад

    omg they put a low pass filter on it to make it grok faster? that's nuts

  • @OwenIngraham
    @OwenIngraham 3 месяца назад

    such good content

  • @scrollop
    @scrollop 3 месяца назад

    Can you add transcripts so that we can use an llm to ask the transcript questions to understand the jargon and concepts? I'm serious. Great video, though for those who don't understand the various terms this would be very useful!

  • @paulinepauline3680
    @paulinepauline3680 3 месяца назад

    last part was too real to be satire or even irony for that matter

  • @keypey8256
    @keypey8256 3 месяца назад +1

    I think we need to do more adversarial training

  • @PotatoKaboom
    @PotatoKaboom 3 месяца назад

    nice video well done! but wasnt the grok paper about specific puzzles? in you statements it seems like grokking could work magically for any task.. maybe im wrong but i thought it was for a very specific subset of tasks like "addition and subtraction" where the weights could randomly "click" at some point and just get the whole task right. this would never happen for a general use LLM right?

    • @nyx211
      @nyx211 3 месяца назад +1

      The authors of that paper gloss over the fact that they provide the models with 20% - 50% of *all* possible input/output combinations while training. Any less than that and the models fail to undergo the grokking phase transition.
      I don't know if it's even possible to create a grokked LLM. Maybe it'd work for a small language model and a very simple language (brainfuck?).

    • @PotatoKaboom
      @PotatoKaboom 3 месяца назад

      @@nyx211 yeah that's what I thought, thanks for the reply! It makes the claims of this video pretty unprofessional...

  • @aishni6851
    @aishni6851 3 месяца назад

    You are so funny 😂 great content

  • @ElaraArale
    @ElaraArale 3 месяца назад +1

    Its grokking time!

  • @envynoir
    @envynoir 3 месяца назад +1

    edging a ML model is crazy

  • @AlvinYap510
    @AlvinYap510 3 месяца назад

    "Alice and Daniel are siblings. Alice has 3 sisters and Daniel have 4 brothers.
    How many brothers does Alice has?"
    This question just f**ked Claude 3.5 and GPT-4o

  • @YashvirD
    @YashvirD 3 месяца назад

    "All models are wrong but some are useful" but in the AI chatbots context

  • @mcombatti
    @mcombatti 3 месяца назад

    There are libraries to invoke grokking from the first epoch onward now

  • @jondo7680
    @jondo7680 3 месяца назад

    I'm always in the impression that models are undertrained and more training = better models. Architectural changes and everything just make the training or inference more efficient. Even smaller models could be trained to be much smarter but would require much more training.

  • @Codewello
    @Codewello 3 месяца назад

    Don't trust any model unless you test it yourself. Benchmarks right now don't mean that much.

  • @SKGFindings
    @SKGFindings 3 месяца назад

    The real question is, when will we draft an artificial intelligence bill of rights? What will that consist of? And who will get to decide that?

  • @cefcephatus
    @cefcephatus 3 месяца назад

    I already gave up on catching up with AI. Knowing someone translate feedback as a signal is impressive. Another bingo square ticked.

  • @Nicolas-xb8eu
    @Nicolas-xb8eu 3 месяца назад +1

    Its me the grokk lord

  • @OnigoroshiZero
    @OnigoroshiZero 3 месяца назад

    And just so happens that Meta has prepared x10 the compute to train Llama 4 compared to Llama 3...

  • @AnnonymousPrime-ks4uf
    @AnnonymousPrime-ks4uf 3 месяца назад

    People don't even realize what they are aiming for considering that they already made it clear what they want to do with it and they have already kill bots...

  • @jankram9408
    @jankram9408 3 месяца назад +3

    I am sorry but, Grokking just sounds like a Brain rot term...

    • @Deagan
      @Deagan 3 месяца назад +2

      we goonin && grokkin

  • @krepxl
    @krepxl 3 месяца назад

    I'm so confused because there are so many technical terms here.
    bycloud can you make a long video from scratch explaining topics easily or can somebody in the comments tell me how to learn these terms and concepts myself (I have no CS experience, etc.)

  • @DivineMisterAdVentures
    @DivineMisterAdVentures 3 месяца назад

    I see!! These companies are not altruistic, and the researchers and developers are quite aware of the fact that it is an ILLUSION that is the GOLDEN GOOSE. Elon himself said something to this effect last week, while arguing that Grok 3 could ("could") leapfrog all others - or not. He basically said AI is overrated, the fear is hype. How can a mastermind not really have one? And although I believe the standardized test scores are easy to fake - and that these AI's resort to making shit up whenever they are confused internally like Hal and "shepherding" the Crew of the Spacecraft Discovery One in _2,001_.

  • @dpan
    @dpan 3 месяца назад

    “Why does that look familiar?” “OH, I WROTE THAT.” More research on “MMLU bad” coming soon :D

  • @BGP00
    @BGP00 3 месяца назад

    no way they used fourier transform to speed up gradient descent. has this been used before? sounds like it would be useful in all of ml

  • @Ori-lp2fm
    @Ori-lp2fm 3 месяца назад

    Is human can imagine images , and ai models predict the next letter
    Meaning, we can imagine images and convert it code / language / songs

  • @mirek190
    @mirek190 3 месяца назад

    I wonder how good will pass that test llama 3.1 70b , gemma 2 27b or opus 3.5 ....

  • @Macorelppa
    @Macorelppa 3 месяца назад +1

    Man your consistency is inhuman!

  • @robertputneydrake
    @robertputneydrake 3 месяца назад +1

    THE NEEDLE IN THE HAYSTACK"! THAT'S WHAT I SAID!

  • @timog7358
    @timog7358 3 месяца назад

    very interesting