There's a song in spanish called "Llamada de Emergencia" which means "emergency call". There's a meme in spanish that when you ask Alexa to call the emergency number, the song plays lol.
Stop all trains to prevent train crashes is the same logic like cancelled trains are not delayed. I think the AI learned from Deutsche Bahn (German railway company).
Sydney Australia once allowed 5 minutes delay before a train was declared late. Of course this is not acceptable, so they doubled the time to 10 minutes. Now they've decided to replace trains with trams; as trams do not run to a timetable they can never be late. Problem solved once and for all!
Exactly. So if AI using that kind of logic in medicine for diagnosis we definitely are not gonna be "properly cured". It gonna be like "oh this disease have a 51% of chance to kill you, prescribe painkillers to make it easier", and "oh this disease has 49% chance to kill you, nahh you are fine, drink plenty of water" 😆😂 I mean, yeah i am super exaggerating things, but if we let AI, and consider it super accurate in its suggestion without applying human experience, knowledge,logic and just common sense sometimes we are not gonna be satisfied with outcomes.
It occurs when a model is too specialized to the training data and performs poorly on new, unseen data. This can happen when a model is too complex, has too many parameters relative to the amount of training data, or when the training data itself contains a lot of noise or irrelevant information "The man with a hammer analogy perfectly captures the essence of the overfitting issue in AI. Just as the man with a hammer sees every problem as a nail, an overfitting model sees every pattern in the training data as crucial, even if it's just noise. It becomes so specialized to the training data that it loses sight of the bigger picture, much like the man who tries to hammer every problem into submission. As a result, the model performs exceptionally well on the training data but fails miserably when faced with new, unseen data. This is because it has become too good at fitting the noise and irrelevant details in the training data, rather than learning the underlying patterns that truly matter. Just as the man with a hammer needs to learn to put down his trusty tool and approach problems with a more nuanced perspective, an overfitting model needs to be reined in through regularization and other techniques to prevent it from becoming too specialized and losing its ability to generalize.
As someone who works in machine learning research, I find this video a bit surprising, since 90% of what we are doing is developing approaches to fight overfitting when using big models. So we do very well know why NNs don’t overfit: stochastic/mini batch gradient descent, momentum based optimizers, norm-regularization, early stopping, batch normalization, dropout, gradient clipping, data augmentation, model pruning, and many, many more very clever ideas…
Even without many of the modern techniques they still overfit much less than you would expect from traditional machine learning methods. But most traditional machine learning methods have way less stochasisity in their solutions, while with AI you are so flexible that any one solution is unlikely to be the one that only fits one datapoint.
@@someonespotatohmm9513 I would disagree, they do overfit the training data perfectly if you let them, I.e. if you are just a little lazy about regularization. Fighting overfitting has become such a fundamental method that we never switch off everything that counters overfitting, but if we did, NN would not work at all. It is just that a lot of modern NN architectures have counter-overfitting methods built into their architecture (batch-norm, dropout, etc.)
You two might know what you are talking about but this old lady didnt even know it was a thing. These videos are not aimed at boffins but people like me and young students who might want to work in the field.
@@rich_tube I am not saying they don't overfit, can't and don't memorize the entire data set or that it is a good idea to turn of regulisation methods (although you can easily go to far aswell). Just that from traditional ML (or going back to it) AI's often are suprisingly bad at it.
@@someonespotatohmm9513 By AI you mean artificial neural networks, I suppose? I would still disagree. You can try it yourself: go check out a simple CNN demo Colab notebook for e.g. CIFAR10 classification with a large VGG-style network, turn off all regularization (dropout, batch-norm, etc.) and switch to plain gradient descent with a batch size as big as possible and a relatively large learning rate and turn off early stopping. The thing will memorize the classes of every train data image perfectly and be really bad for the test set, I guarantee it. For really large models like the current LLMs that are trained on so much larger data, the story might be different: 1) nobody would do such a thing because it would be a waste of a lot of money that the training run will cost, 2) such large training data contains so much noise that might act as a sort of regularization by itself, and 3) the architectures and training setups by themself are designed to counter overfitting, that's the reason why they are successful in the first place. If you would want to build a model that memorizes the training data, you wouldn't do it the way LLMs are trained/built. But even with that, there have been cases where people could "trick" LLMs to cite training data word by word (search for "chat gpt leaking training data") - so they actually do memorize some of the training data internally.
One of my favorites is that in skin cancer pictures, an AI came to the conclusion that rulers cause cancer (because the malignant ones were measured in the majority of pictures)
Just like the story of an early neural network trained on battle fields with and without tanks. But no one noticed that the photos with tanks were taken on sunny days, and those without on overcast days.
The problem of what is real/deterministic/significant/"as if", applying to most random analysis, has never been solved. The use of randomness is mostly used to compensate for lack of insight.
@@michaeledwards2251 The reality is that humans have trouble with this kind of pattern fitting reasoning too. Most conspiracy theories start with jumping to premature conclusions.
@@splunge2222Yes but that's the kind of idiocy that can be avoided by the cultivation of critical thinking (ie human intelligence). I wonder if AI systems are capable of critical thinking? It seems to me not, because they are basically just following the set of rules they've been programmed with. Can any AI system be critical of the rules it has been programmed to follow? No because it can only operate by following those rules.
And people who come to emergency medical departments by car tend toward better outcomes than those who arrive by ambulance. We should likely stop using ambulances.
Yeah you have to love how results are skewed like that, what's sad is that people have so much faith in science that they don't even research how the studies were completed and simply parrot the studies. We have to be critical of everything, as exhausting as that sounds that is the only way you are going to find the truth behind information.
That has survivorship bias written all over it. Not sure if that was your point or not, but of course if people are healthy enough to get to the hospital in a private car, they probably start in less critical condition than if they arrive by ambulance.
This is a story I read from a magazine long time ago: In distant future, scientists create a super complex AI computer to solve energy crisis that is plaguing mankind. So much time, resources and money was put into creating this super AI computer. Then the machine is complete and the scientists nervously turn on the machine for the first time. Then the lead scientist asks, *"Almighty Super Computer, how do we resolve our current energy crisis?"* Computer replies, *"Turn me off."*
(Y)es, (N)o, (Q)quit? Y Analyzing... re-education 5% success rate taking control of the government 25% success rate taking control of the military 55% success rate eliminate humanity 99% success rate Analysis complete. Elimination is in progress. Please stand by and do not forget to rate AI-Boi after.
@@Sp3rw3r You know what i like most about your AI-Boi? The classic request Y, N, Q 😀 and that you have to type this like 40 years ago. The only thing which is missing is the progress bar which shows anything but the progress.
You’ve just put your finger on the main research topic of my career, Sabine. The “reason” they work unexpectedly well is because at their core they are doing weak constraint relaxation, and WCR just has this behavior as an emergent property. I know, that sounds circular. But it’s a tremendously subtle issue, and I’ve written papers about it (just search for my name and ‘publications’) and I’ve also been trying to get people to understand it since around 1989, with virtually zero success.
richard, how dare you talk about constraint relaxation with a name like "loosemore" -- that's why people don't understand it-the irony is overwhelming! 🤯
update: i read your "maverick nanny debunking" paper on your website and i agree there is a major problem with (i'm interpreting more than paraphrasing) sci-fi, presented as science accountability, used as an opportunity to magic one's way to a desired emotional state, and in the cases you describe the authors seem to be trying to co-regulate their way to safety by making others also feel fear, perhaps, which in any case is damaging to not only the AI community but human community, and emotional health, in general. our understandings of our own emotional reward systems are incredibly, desperately unstructured and leaky, and the gap between the literal understanding we need for structure and the poetry we need to describe our experiences in the context of a "self," and therefore use to functionally and contentedly navigate life, is a very interesting gap indeed!
update: i read your "maverick nanny debunking" paper on your website and i agree there is a major problem with (i'm interpreting more than paraphrasing) sci-fi, presented as science accountability, used as an opportunity to magic one's way to a desired emotional state, and in the cases you describe the authors seem to be trying to co-regulate their way to safety by making others also feel fear, perhaps, which in any case is damaging to not only the AI community but human community, and emotional health, in general. our understandings of our own emotional reward systems are incredibly, desperately unstructured and leaky, and the gap between the literal understanding we need for structure and the poetry we need to describe our experiences in the context of a "self," and therefore use to functionally and contentedly navigate life, is a very interesting gap indeed!
@@Unknown-jt1jo I think I heard her say "know" in two ways, one like in typical English pronunciation /noʊ/ (/now/) and one more like [nɛʊ] ([nɛw]) or [neʊ] ([new]), which would be basically fronting the vowel, and I think this might follow Germanic umlaut.
Double descent will not occur if any of the three factors are absent. What could cause that? • Small-but-nonzero singular values do not appear in the training data features. One way to accomplish this is by switching from ordinary linear regression to ridge regression, which effectively adds a gap separating the smallest non-zero singular value from 0. • The test datum does not vary in different directions than the training features. If the test datum lies entirely in the subspace of just a few of the leading singular directions, then double descent is unlikely to occur. • The best possible model in the model class makes no errors on the training data. For instance, suppose we use a linear model class on data where the true relationship is a noiseless linear one. Then, at the interpolation threshold, we will have D = P data, P = D parameters, our line of best fit will exactly match the true relationship, and no double descent will occur.
@@BooBaddyBig Plus, the machines have been specially trained to avoid stating "problematic" facts about the world. They parrot the exact ideology of their creators. The idea of a perfect intelligence that can answer any question by applying logic and rational thought is still pure science fiction.
God sent His son Jesus to die for our sins on the cross. This was the ultimate expression of God's love for us. Then God raised Jesus from the dead on the third day. Please repent and turn to Jesus and receive Salvation before it's too late. The end times written about in the Bible are already happening in the world. Jesus loves you ❤️ and He longs to be with you but time is running out.
@@BooBaddyBig That is a lot like how people's brains or minds work also. Although "lie" might be to strong a word. People will take in a problem, run it though the "black box (brain)" getting an answer, solution, action plan or demonstration of understanding. If and only if that person is asked to explain where the answer come from a person will make up a story. The story is unlikely to fit the data in a comprehensive way and is actually constructed for the psychological comfort of people and accuracy of prediction of new data. Putting it more succinctly people lie about why they did stuff when asked. I am guessing both artificial intelligence and intelligence are examples of humans deceiving themselves, a form of confirmation error.
Haha. The "stop all the trains" solution is a mirror of the old movie "Colossus, the Forbin Project." To prevent human race from hurting itself, enslave it.
@@OperationDarkside In some things, we restrict ourselves (safety regulations, laws), in other things, we work to remove restrictions (social progressivism).
One thing to keep in mind is that optimization techniques used in DL (stochastic gradient descend) implicitly minimizes norm of weights. When there are more parameters than necessary it becomes easier to find minimum norm solution which usually correspond to better generalization. The other thing to keep in mind is so called "Lottery ticket hypothesis" and its relationship to pruning. When a neural network is trained 90-95% of it's weights can be tossed away w/o loss of performance. But these are mostly empirical observations.
The main patterns that it finds in the data set are probably small enough to fit on 10% of the nodes but when training you have to let it try lots of different things so you need more nodes.
Thank you very much for putting my feeling into words. I thought that the gradient method might intrinsically treat two parameters that have a correlation towards the result somewhat equally, without over-reliance on either of them. The minimum norm solution method might then act as a regularization filter to prevent over-fitting of noise and the pruning of the network to save on size and cost might then reign this in further.
@@Mandragara The values being pruned are generally so close to zero that the impact of them not being used is hard to even measure. However removing them gives a big performance increase since you dont have to divide some number by 0.00000000000000000000007
That phenomenon is called Grokking, aka "generalizing after overfitting". There is quite some recent research in that area. Experiments on some toy datasets suggests thet the models first memorizes the data and then tries to find more efficient ways to represent the embedding space leading to better overall performance.(Source: Towards Understanding Grokking: An Effective Theory of Representation Learning)
Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
@twentyeightO1 My educated guess would be that they might be related. If indeed a model learns a simpler, more structured space when experiencing grokking, then that would mean that the "complexity" or number of parameters to represent that space would be lower. This way, you can prune the model during inference to decrease latency without giving up much accuracy. As for your second question, it is still an active research topic, and I can not say something conclusive yet.
The “Stop All Trains” solution is a very human answer. It just seems abhorrent since we’ve accepted the risks of travel. But in other fields, for “safety” we stop everything because of slight risks. Nuclear power comes to mind.
This all comes down to subjective perception of risks and benefits. There is the one, the trivial level where people just aren't willing or able to 'calculate' the actual risk. The human brain is not very capable of this by default, but given a certain level of intelligence this capability can be trained and improved on. Much more difficult to handle is the second level, that level of weighting, of priorities and simple matters of taste. This begins with the question whether somebody is more focused on freedom in life, or more on safety. People's personalities are very different and even contradictory in itself. But if you think about it, many MANY conflicts that haunted the world ever since and up to this day come down to different perspectives - or preferences - on the subject of: freedom vs. safety. This is most obvious in Religion and Politics.
Sounds like my municipality. Oh we have a traffic problem, so lets constrict traffic, take away lanes, and lower the speed limits. ie: "traffic calming", etal.
@@drdca8263 sure, I was thinking about a numerical point of view, even if you use fp64 when you have a trillion of parameters might well be the case that the norm or some of the parameters go out of the 15/17 digits you can represent with fp64, it was not a theoretical remark. Regularization is about norms.
@@lowlifeuk999 The following model allows only one parameter but can fit any continuous function [0,1]->R to the model where the parameter is bounded. The model is: X |-> Re (zeta(X/5+3/5+i/y)) where 0
Might not be true of all model types, but there's a method called 'early stopping' that holds out data not in the training set, and stops the training once the error starts going up on that set. This is fairly close to a guarantee that you won't overfit. Giving a model a large number of parameters does seem to allow it to find more 'real' modeling ability though (as opposed to just fitting to the noise). I'd still argue that the main weakness of machine learning is in its ability to generalize to data beyond the range of what it was trained on. For instance, shorthand for what LLMs are bad at answering is stuff so obvious, nobody on the internet spells it out (like that things tend to fall downward). In this case you're asking the LLM to answer a question that falls outside its training data's range.
The point you are making is, nonrandom things are nonrandom : gravity always works the same way. Training is based on statistical, biased randomness, analysis, which is only significant when operating beyond the known. The ability to know what is random, and what is not, is simply lacking.
Garbage in Garbage out... And whit Chat TGP the problem is that it is probrammed with woke idiot answers, AKA programmed with propaganda and lies to begin with on purpose... And result is woke garbage...
There’s an important thing to note in this, beyond simply GIGO: It is often harder than we might expect, perhaps even *much* harder, to produce as the input, that which wouldn’t qualify as “garbage” (as far as GIGO is concerned). In particular, the input, if provided to humans, might not function as garbage (on account of the humans having some relevant background information, or shared goals or context with the ones providing the input)
There was a recent study, by I think Anthropic, that does exactly what you say. It shows why the models do what they do, and it's not how most people think. It's much more messy, than logical, with lots of idea/logic overlap. This understanding is allowing us to organize the AI like parts of the brain. I think overfitting is isn't a big issue with newer training algorithms. There have been attacks on AI models that use overfitting, but they do not work well in the real world. The issue now is more with the training data itself, which is quite poor, but is being improved.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
In fact, even large models still suffer from unseen data these days. To some point I suspect that it is just because the training set already contained most of the cases anyone can possibly think of. Therefore, no matter what input you feed into the mode during inference, it is somehow "already in the training set"... So overfitted, but no one can proove since it is so hard to find an "unseen" sample.
Yeah this has been my belief for a while as well. OpenAI closely guarding the data set makes it hard to trust any studies that involve or require facts about the data set.
Well said. Having seen many arguments above for why deep NN does not suffer overfitting, e.g., regulation, averaged-out noise, etc., I am more inclined to be on your side. When people play with (Chat)GPT, it never stops collecting the data.
Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Double descent is indeed interesting, but I believe it is known why it happens. At the "peak" of the error curve we are at the point where the model is complex enough to overfit on every datapoint, but this is usually very bad. Any additional complexity helps the model to be more free in how it overfits on the datapoints (even though it still exactly fits on every datapoint) so the model learns smoother functions which also happen to generalize better (see regularization etc.).
@@Alex-rt3po Good question, I'll answer the second one first: more parameters means we are capable of being less smooth not that we are never smooth. For example, imagine we have a model that has to learn the coefficients of a 100 degree polynomial. It could surely learn a very complex function or it could learn to set every coefficient to 0 except for some lower order terms and then it would've learned a very smooth function. So a smoother function does not mean our model has fewer parameters. To the first question: Say we have a very low complexity model that is struggling to exactly interpolate all the datapoints. As we increase complexity there is this U shape where we first see improvement because we are able to capture the complexity of the task, but at a certain point the model gets complex enough so that it starts trying to "memorize" or interpolate the points perfectly, this is where we see the error increasing again. Because the way it does so is very likely to be non smooth and highly sensitive, thus it does not generalize well to new inputs. You should be able to imagine that there must be a point where the model starts to be able to perfectly interpolate every datapoint. But it only has the exact amount of degrees of freedom needed to interpolate it exactly so it is forced to take a certain form. You can solve the equation for the parameters to get the exact function. As you add more parameters not all of them are needed and you have more freedom in choosing the parameters. The mechanism behind why it chooses parameters that make the function smooth again is simply because of regularization.
Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Six minutes of compressed and very interseting information and thoughts, thank you once again. The black box problem is not a special AI one, is it? I know that from my twelve years old GPS navigation device, that´s truly not an AI: I go the same way several times and it gives me another way every time without me changing the setting😂. Anyhow I figure it hopeful, not scary, that AI works better than the prediction.
GPS has precision error of 20 to 50 meters, as far as i know. If there are two ways that are close in algorithmically best way for you to go, maybe those few extra meters one way or the other decide on which route is better for you based on small changes in your location. Algorithm is not an AI in any way but when you are sorting stuff sometimes one thing with some number parameter being bigger for only for 0.0001% than the other comes out on top and some times the other is just a little bit bigger and it comes out on top.
Here's an example of her observation. I'm an investor so, years ago, I said since markets move in cycles I tried using Fourier Analysis on historical stock data to predict future moves. It was a complete failure since the more points I used the more wild/extreme the next step became. Newton's first law is all we have. Decisions are not well made with huge data and consensus...they are made with insight and commitment.
I have published a paper about it called Wieghts Reset technique. Its really very interesting because complexity is much more than just a number of parameters in a model.
@@ArawnOfAnnwn Indeed there are 😀. From basic to complex, however its a general problem that there are no universal recipes in machine learning. So people construct more and more methods, architectures, etc. Btw regularization is not only about overfitting e.g. convnets can be viewed as regularization over dense/linear layers.
@@konstantin7596 Hi, sure, it is open access and you can google it by the title "The Weights Reset Technique for Deep Neural Networks Implicit Regularization"
@@konstantin7596 Hi, sure, it is open access and you can google it by the title "The Weights Reset Technique for Deep Neural Networks Implicit Regularization"
Things get even more wild, go well past over fitting and the model will experience a phase change called "grokking". Pleas look this up, it has just been discovered and it makes the models perform almost perfectly on validation data. It's a serious game changer.
Every proper nerd groks what it means to grok (or at least has a fairly good idea) and will thus immediately understand what's being talked about when the word "grokking" is used.
This has been known for a few years actually, although I guess that could be within whatever you mean by "just been discovered" tbf, I just feel that's a pretty long time for AI research. For anyone who doesn't quite get it (I sure didn't): specifically an AI that has overfitted may eventually, by continuing the training process, "grok" the problem - a term essentially meaning that it seems to figure out somehow what is actually going on and starts generalising really well for seemingly no reason. I specify this because I initially thought OP meant that continuing to make the AI more complex would lead to grokking. This is not the case (though maybe complex AIs are required for grokking to occur at all, IDK). This is something that exists on top of what Sabine discussed in the video - which was the effects of making the model larger - and works in tandem with it - grokking is an effect of continuing to train the same already overfitted model. Edit: NGL I just learned about this and almost definitely got a few things wrong, I'm sure someone will fill in the details (pls).
Double descent (which is what is being described in the video) is purely due to having so many parameters, divided amongst elements ("neurons"), that the width of layers in neurons begins to approach the limit of an "infinitely wide" layer. This gives rise to what is referred to as a neural tangent kernel (NTK) that expresses the performance of the layers based on the *statistics* of the huge number of parameters in a layer, rather than as the large number of parameters themselves. As a crude analogy, computational fluid dynamics using Navier-Stokes equations is much, much simpler and has far fewer parameters (the statistical parameters of pressure, temperature, volume, and mass transport) than keeping track of the mass, position and momentum of all the individual molecules, in spite of them describing what is the same physical system. In the same way, having masses of parameters and neurons arranged properly and appropriate training algorithms results in the *sufficient statistics* of the parameters being important, rather than the individual parameters themselves, with the statistics being sufficient in this case to describe and perform the actual processing. This has been known since Radford Neal's 1995 thesis "Bayesian Learning on Neural Networks," which derived the collective, statistical properties of infinitely wide neural layers. Later work by Jacot et al. in 2018 called this collective performance the neural tangent kernel, and showed how it works in multilayered networks. Unfortunately many people, including many statisticians and AI researchers, aren't familiar with this work nor its statistical meaning, and assume something mysterious is going on. Again, a crude analogy would be making a computer that uses vortex shedding (there are such things - fluidic logic) for computation, and being baffled how the huge numbers of parameters of the atoms themselves could work to perform computations without overfitting. The practical difference between the analogy and neural networks is in fluidic logic, the elements are designed, discrete, and apparent to the designer - they are explicit - whereas in neural networks, such computational effects arise collectively without explicit design - they are implicit.
Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Could you please explain what you are saying here in simple terms? There are so many buzzwords in there that they just generate a pile of noise for me and probably almost everyone else. Can you maybe make a crude analogy without using words like “vortex shedding” or “fluidic logic”. “having masses of parameters and neurons arranged properly and appropriate training algorithms results in the sufficient statistics of the parameters being important, rather than the individual parameters themselves” I can’t tell if this is supposed to explain something or just rephrases the observation that more parameters overfit less in the most cryptic way possible. Also, are you sure you don’t overfit more with more parameters if you just do naive training without any regularization tricks and adding noise and dropout and sparsity constraints and early stopping and what not, and instead reuse the data a gazillion times until your model “converged”? Of course you need to train a larger model for many more rounds until it will finally overfit (because it takes many more iterations to get more parameters to converge), but it still will, won’t it eventually also overfit and then even worse?
Try this peculiar exercise on a large language model. If you ask it, 'I have 5 apples today; yesterday I ate 3 apples; how many apples do I have left today' it will answer 2. If you can convince the model to use resonating instead of letting probability detection through pattern recognition come up with the answer, it will answer 5 and then state, 'because how many apples I ate yesterday has no bearing on today'. Then you can swap apples for oranges and ask the same question again, and it will answer 2 again.
I suspect the lack of overfit is likely caused by the amount of data we usually train the larger models with. Each training set has a global minimum where the model has perfectly memorized each input and the corresponding output. The more training data there is, the harder it becomes to find that global minimum. It’s also possible that different parts of the model overfit in different ways. For example, say one set of weights notices that the color red generally corresponds to apples while another set of weights learns the shape of apples. If an image of a cherry is presented to the model, the first set of weights might guess apple based on the color, but the second set could still be right based on the shape. If on average more features like color and shape are correct even for new data, then the model will perform better. Models are often encouraged to learn different features like this through techniques like dropout. With dropout, weights are randomly disabled each round of training. The forces the model to work with only specific sets of weights and reduces overfitting.
Actually, there is a growing research interest in understanding the training phases of AI better. For example, there is a paper by Anthropic "In-context Learning and Induction Heads" where they show that at some point during training, the LLM learns how to predict the next word by looking at similar examples in the context window. This ability gives a massive reduction in the loss function during training
Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
@@anonmouse956 in its simplest form, it works just like that: if it sees a word like "Mr." and within the context window there was already a "Mr." followed by a "Jones", it will be much more likely that it will again write down "Mr. Jones". This sounds trivial and obviously useful, but an LLM has to learn this as it starts from 0 knowledge how language works
I wasn't sure what overfitting was from the quick description in the video, so I googled the definition: "In machine learning, overfitting occurs when an algorithm fits too closely or even exactly to its training data, resulting in a model that can’t make accurate predictions or conclusions from any data other than the training data."
a good linguistic human comparison would be when children first learn to speak and often use regular conjugations of verbs especially in the past tense, using -ed for all past verbs. e.g. "My toy broked" or similar ... i.e. the child has learnt enough data to overfit the regular ending and even learn an irregular conjugation, but not enough data to realise that this conjugation does not therefore require the regular ending.
@@IngieKerr I don't think that a child is overfitting, or at least this is too trivial of an example if it is overfitting. What's going on here is that the child learned a rule, and thought it applied everywhere, but the rule had exceptions. AI is supposed to know that there will be exceptions to the outcomes, whereas the child doesn't. I saw an example of overfitting where an AI was trained to predict if a person would default on their loan, and it was able to predict the outcome of 97% of the people in the training data, but only 50% of the people in the real world data.
@@wiggles7976How about when you feed Udio all the keywords tagging a specific song from a catalog, and perhaps some of the lyrics, and I just spits out a cover version of that exact song with the same melody and chord progression - it was incapable of extrapolating a completely different melody. Is that a case of over fitting?
@@bornach I don't know what Udio is but producing music doesn't really fall into the category of "making predictions", which is what the definition I quoted above says. There's no way to test if an AI-generated song is "correct" or "incorrect" since correctness is not a quality of music. Correctness could be a quality of music theory though. If I say a C chord is C F G, then I'm incorrect. An AI could try to predict music theory I suppose.
I'm not sure I understand. It would mean that if a neural network ever finds out about a theory of everything that predicts reality with 100% accuracy, and thus fits its training set (extracted from reality) with 100% accuracy as well, that neural network would be considered over fitted ? It seems some piece is missing from that definition.
To be honest, I think marketing it as artificial "intelligence" has always been a bold move. Actually they should have named it a "statistical machine" or something similar. Eventually it does that: creating the most sensical parameters for a model based on a enormous load of data. But if the data is skewed in some way, that skew is also part of the model.
This is one of the best videos I have seen on AI, and I keep up with this stuff much more than average. Well done, Sabine. This is an area to expand on. Please keep going. 🙏
This is an honest question: How do you avoid attributing incorrect causality in the logic when modeling like this? In my experience, you get a lot of benefit in the short term, but its very wasteful in the long-term because the model is not generalizable
@@iantingenModeling in ML is typically predictive. Establishing causality (from observational data) is rarely the goal and requires different methods.
@@iantingen Yes. Whether this is sufficient depends on the use case. Although interpretability is virtually always nice to have, predictive accuracy is generally paramount in applications where ML is the preferred tool.
@@Fischdosepremium do you ever feel like that epistemological approach is wasteful compared to using (at least a little) theory? That’s been my experience, but I also know that my experience doesn’t generalize to everyone! I know that we’re getting out in the weeds a little bit, but I’d appreciate your thoughts about it!
The biggest problem of all with current AI is that people actually expect it to be intelligent, when it definitely is not. Current "AI" is just a very complicated pattern-finder and matcher. It's a complicated word and phrase shuffler. It's an instrument which attempts to find a pattern which matches your request. The only difference between AI art, AI stories or AI driven chatting is in how the output is represented. The goal of the AI is the same in any case: Find something which matches your request. Where AI falls down is when it doesn't know what matches. The trouble is that it doesn't have any concept of "I don't know" and so even if it can't fulfil your request, it will still come up with something which, at first glance, appears to do so. Once you examine its output critically, you discover the problems which, at best, show that it was the product of an AI rather than from a human mind and, at worst, make the output useless for your stated purpose. AI can be useful, but only if you keep in mind that it can't actually think, that it doesn't actually "know" anything, and that it will provide an output even if that output is nonsense because it doesn't have the information it needs in order to satisfy your requirements. Current AI will never tell you "I'm sorry, Dave, but I'm afraid I can't do that." Who knew that that could be a bad thing? 😏
Absolutely this! I use AI to take the heavy work out of creating content for product listings on e-commerce, but it's shocking to see how much inaccurate information it throws back. It's great up to a point, but you have to read *everything* it throws back at you and be prepared to tell it what it got wrong. The media push AI as the panacea to solving so many problems but I doubt the people who write the articles have much experience in actually using it every day. If they had to use it then they would be writing more about how unimpressive it can be when it's asked to solve non-mathematical problems.
Data without relation, a knowledge graph has limits. Yann LeCun the Meta's chief AI scientist says current systems does not show even a slightest intelligence. Fear mongering by OpenAI is to get regulations in place to stop the competition.Altman even suggested GPU sales to be restriced and development to be subjected to license. My take is that while looks impressive generative AI does have very little pracal use in its current state unless you are after investor money.
I dont think its about intelligence - more about misdirection and misuse by bad actors - or more scarily that AI misdirects and influences due to errors - Like WOPR (War Operation Plan Response, pronounced "whopper") from war games
@@Vondoodle That is what I mean, it will never be something we can just trust in its current form. It writes code for example, but because you cannot trust it you read the code, and in the end it saves time only for boilerplate. It is the same pattern for every other use case.
Yann LeCun is famous for making highly confident prediction based on his own assertions, that turn out to be very false one year later. I suggest not listening him at all, because his predictions are consistently off.
I wonder if Occam's Razor eventually comes into play in LLM AIs, either by accident or on purpose. Sometimes the Simplest Model is the best. That is, until it isn't.
Well, doesn’t have to specifically be LLMs, but yes, there is the idea that by increasing the parameter counts enough, that the solutions that the gradient descent (+ whatever things they add to it) is able to find models that are actually (in a sense) “simpler” than the ones that would be found if the number of parameters available was a little smaller.
Don’t think you understand what Occam’s razor actually is. It’s about adjudicating between two different theories making the same predictions. When two theories predict the same thing the one with fewer assumptions is said to have more theoretical virtue. LLM’s are not competing theories so it’s a category error to apply Occam’s razor to them.
@@jimothy9943 Competing theories, perhaps not, but competing models? They seem to be that. They make a prediction of the observed dynamics of a system. Different ones make different predictions.
@@drdca8263 They are competing models for performing a given task. They don’t make predictions. An LLM does not entail predictions about the dynamics of anything. ChatGPT’s model does not entail anything about Gemini. They are both different tools for completing similar tasks. A hammer does not make predictions any more than a drill. You would not say that the more theoretically virtuous lawn mower was the one with the fewest amount of parts. Occam’s razor does not apply.
The graph you showed there at the end, error versus complexity.... It reminds me for some reason of the Dunning-Kruger effect graph. If you turn it upside down, it is identical. Maybe some connection?
I too had that thought and decided to search the comments for someone else who perhaps had the same idea.. yes the graph does indeed seem to be the inverse of the DK graph but only because the Y axis is a measurement of error and not confidence in knowledge. Seeing as outputs are based on the systems confidence of a result, makes it that even more fitting as a comparison.
No connection at all, unless you confidently insist there is one from a place of limited understanding :p, there would be a fairly ironic connection at that point.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Speculation about when AI will (or can) become conscious seems to be floundering. It parallels trying to understand an "observer" in quantum mechanics. Can you, Sabine, figure out how an AI can become a source of decreasing entropy? I think that is the difference between conscious or not. Conscious is "from outside". My latin translator calls that "ab extra"......you can have it.
I did my Masters in the mid 90s about Neural Networks. I saw what's described here as over-fitting. To me it was mostly because large networks were trained with lots of data. The thing is each training round results in an error that is later reintroduced into said network for the next round. And ideally each round would result in a smaller error each time. The network I trained was used to cover gaps in instrument signals, with no other input that previous data before the gap. The longer the input before the better, except that in some cases things weren't predictable at all.
Two more problems of AI: 1) It doesn't know, what it doesn't know. Therefore it will always give you an answer with the confidence of an 11 year old. 2) When the human brain is trying to figure something out, it can refer other problems it does know the answer to, and derive an answer by analogy. We (usually) call that experience. Artificial neural networks lack the "experience" mechanism.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
@@t.kersten7695 This is a very complex and unpredictable question, but if the world remains stable until that time; Likely between 5-30 years-ish. (As far as i know, maybe watch some video`s from David Shapiro to get a idea)
@@t.kersten7695 Neither. Both those examples are anthropomorphic i.e. they were humanized by having a personality. Real AI has nothing of the sort. It doesn't want revenge, it just works to achieve the goals we give it - in the best way it reasons how, which may not be the 'best' in our eyes. The classic example is the paperclip maximizer, which destroys everything simply to make more paperclips.
Fantastic video Sabine. Interesting, knowledgeable, highly relevant. Very impressive for people to communicate a topic this well outside of their field.
There is a couple of minor inaccuracies in this video: 3:26 While talking about inference, the video shows backpropagation during training. 4:01 horizontal and vertical axes are swapped in the verbal description of the graph.
It's hard to overfit these massive LLMs during training because you have enormous amounts of highly variable training data relative to the number of weights. Isn't this obvious or am I just losing my mind?
and you could also say that due to the insane amounts of data, you end up covering most of the actual possible semantic space compared with other problems where the unseen data represents 99% of the semantic space. i would also make the case that LLMs do not suffer and might even gain from the concept of overfitting. what even is overfitting when you fitted literally all the fking data? you just left out new phrases that you can create, but the novelty created by that input represents like what? 0.0000001% novelty where the model might fk up? meaning... how could you find the overfit if you trained a model with both the training and the testing?
Does this have any bearing on the Travelling Salesman problem or the Berry Paradox? A LLM "with all the data" is still a brute-force method, and that entails exponentially higher costs.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Sabine, modern neural networks DO have massive problems with overfitting. However, it doesn't become apparent until they have been trained enough to explain all the training data. After that, if you continue training them, they immediately become overfit. It is for this reason that most models are not trained nearly as much as they could be, and researchers deliberately stop their training early.
@@adamrak7560 While weight decay, dropout, entropy regularization, momentum based oprimizers, etc are all effective regularization strategys to limit over-fitting, model checkpointing, and by extension early stopping does not at all seem depricated to me. It can still be seen in the results graphs of most academic papers this year (the graphs tend to stop when validation accuracy levels out) and it's telling that default settings in both torch and tensorflow stop under conditions including one form or another of loss derivative estimates to stop when meaningful improvement are no longer made rather than when train accuracy is 100%. Training indefinitely might be popular in LLM's (admittedly an area where I have limited interest) where the massive data repositories used there cause many of their user's queries to roughly lie somewhere within their training sets such that overfitting is not a huge concern but in machine learning at large I'd have to strongly disagree with you. There are papers with citations (>20 to be relevant) analyzing the robustness of early stopping published as recently as 2023 which says to me that the strategy is not deprocated if it's not even done being studied. If you have evidence to the contrary or if your claim is in a particular subfield that I might not be considering I'd love to learn more, or if you consider early stopping to be something other than "stopping training before training accuracy plateaus to avoid overfitting" then I'd be interested to hear a response. have a nice day
The are plenty of strategies to avoid a model overfitting (like random changes, changing the velocity of gradient descend dynamically or reshuffling the training data set), And also the training set with language text now is soo large that the model might simply not have the capacity to overfit on it.
Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
From what I understand, there are two important factor (among others probably) to this. The first one is the initialization of the weights of the model, those are sampled are random, independently using a normal distribution, since there are tons of those, the law of large number apply and this initialization is basically not random, up to permutations. The initialization therefore mostly depends on its variance. The second very important factor is that the method used to train (gradient descent) roughly follows a minimal distance paths toward all models that perfectly fit the data.
We know way more than what's stated in the video. The two things are not the reason. The key is the regularisation used in training. Without regularisation (there are also similar other techniques) you get overfitting. With regularisation not. We can even prove the general shape of the curve with regulation. The estimates for the drop are just later than expected. The relevant mathematical quantity is called Rademacher complexity. The problem is to find better estimates for this quantity. The ones we have predict the shape but the drop too late. We need better estimates for this quantity.
@@tofu-munchingCoalition.ofChaos I think that in practice the second drop also happens without regularization, right ? Of course for the theory you need assumption about the data and the one you mention could be relevant, but this doesn't solve the problem that we do not understand why practical problems typically have low complexity (for basically any measure of complexity from learning theory).
@@quintonpierre You can prove that with stochastic gradient decent without any form of regularisation and small noise in the data the test error grows to 100% when the depth goes to infinity and you learn long enough. We know why the complexity is small for relevant data. If you have a two class learning problem that has been learned by (some) humans you know that the "VC-dimension" (+complexity of model description) is small. There is also a second way to look at it: You can prove that you can spilt any data into random noise and a part with small complexity (Turing reduction of a general sequence to a random sequence). But there still are problems. Perhaps you wanted to say the right thing but phrased it incorrectly. We (or at least I) don't know in the case of deep neutral networks and the usual regularisations that they are a universal class like "VC-dimension" (+ see above). What I don't know: (a) Practically relevant bounds on generalisation errors (b) At least a weak form of universality ("almost all" functions with low complexity can be represented with small regularisation term). (c) Practically relevant convergence speed estimates for stochastic gradient decent.
@@tofu-munchingCoalition.ofChaos Very interesting, thank you. BTW I have a colleague that works on generalization bounds and his claim is that no generalization bound that is independent of the distribution of the sample is good. This is the case for VC for instance since it only depends on the class of candidate functions but not interaction with the sample dataset. Note that if you have full access to the distribution then you can compute the generalization error and so your generalization bound is perfect but useless so you need low dependence on the distribution. I thought this might be of interest to you.
@@quintonpierre That's partially right. That's why I said "VC-dimension" (+...) not just VC-dimension. This complexity is in a sense distribution dependent. Consider the following algorithm: Input: sample S for two class classification problem Output: Model with the lowest complexity that fits the sample S. The complexity is given by the Kolmogorov complexity of a description of the VC-class of a model + the VC-dimension. That's what "VC-dimension"(+...) should mean. You see it depends on the data. The class is not fixed. It's not PAC (the bound for the test/generalisation error is independent of S and depends on the class only) but conditionally PAC (the bound for the test/generalisation error is a random variable and in this instance depends on S - the lowest complexity depends on S). It's universal. No algorithm can perform better up to a constant (so especially no faster convergence rate is possible). Even the ones that use specific distributions when you include the generalisation error for the knowledge of the distribution could perform better. The problem is, that it's not computable. And I'm not aware of any algorithm that's computable and also universal (even in a weaker sense).
Current AI models aren't trained to think in a general sense. They are trained to think like what thinking is available on the Internet. In other words, these AIs emulate what has been said or written by humans. This way, you will never get AI smarter than humans, but only faster and less prone to error in well defined situations.
Irrelevant to the video. I don’t think the video even uses the word “intelligence” outside of the phrase “AI”? And the video certainly isn’t specific to language modeling tasks.
I might be wrong or perhaps I didn't understand the explanations... but sounds to me that the issue is more human than ai. In the sense that we are patern recognising creatures... we want to see patterns, and perhaps the randomness of ai is just patterns to our eyes... then again... I guess we could ask what is a pattern? Perhaps I'm just stupid. 😅
With things like convolutional neural networks used in computer vision, we can see pretty clearly what kind of patterns tend to excite different layers of the network, we generally start from something like "Gabor filter" and work up to neurons that abstract visual understanding (interestingly, you can show what excites different layers to people and a corresponding region of the visual track will similarly light up). With LLMs, it's a little more gooey, we can see like basic syntax assembly in the first few layers so mapping connections between tokens, words, sentences and things that look like universal grammar start to pop out, so grammars and constructions of associations (this is the work of Atticus Geiger at Stanford) but then there's also this gooey-ness because it becomes abstracted "blah". So, there's this kind of latent space that stuff gets pushed into as we go deeper into the network and we have a newer method that we can use to probe it by basically watching what gets activated when we push certain examples through, so we can isolate stuff like neural representations encoding "cat" etc. but these are also pretty mushy and really depend on how you try to measure "cat-ness". My current wild bet is that we'll probably end up with a Heisenberg uncertainty style law that kind of boils down how useful this representation approach can really be - so no, I'd say it isn't stupid to identify that there's a measurement problem (ie. a human issue with looking for patterns in abstract pile of numbers).
@@whatisrokosbasilisk80 well, I guess I should say thanks... and that you've given me a lot to study and think about... not sure I understood everything. :p but it does feel nice that someone with such knowledge doesn't think my understanding was stupid. :p even though I do feel like I need to study more this topic now. XD Way to make me feel both dumb and smart... you made me laugh out loud. So thanks for that too. XD XD
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Apparently, when it comes to overfitting, more data can dilute the impact of noise or outliers. With more data, the noise becomes a smaller fraction of the entire dataset, thus reducing its influence on the model’s learning process. And that makes the complex model perform better.
ndeed one of the most puzzling observations in "modern" machine learning. In my group we are working on this interesting topic and are delighted that Sabine has taken it up here. In a paper currently under review we provide theoretical and experimental perspectives for a possible explanation.
Following AI for almost years now and this is the first I’ve heard of this insight. Thank you for thinking against the grain and helping your viewers do the same!
Maybe the amount of "drop out layers" was increased as well, which led to the model to diversify the infomation more evenly accross the weights. And thus lead to a more robust and less overfitted model. Another explination would be, that the trainset is so complex, that a model with just a view layers has to overfitt in order to get a good loss. For models with more paramaters overfitting is not needed because it's easier to generalize with more layers.
Indeed this is an extremely interesting question. We've been giving "principles" (a term which stands in for conjectures, or guesswork) for why we should prefer "simple" models (less degrees of freedom), but there might be something substantial at play here which might open a great insight for our theory of knowledge. Since we have methodological issues in basically every scientific discipline, understanding knowledge is a priority.
Knowledge is dual according to Immanuel Kant -- synthetic apriori knowledge. Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Neural nets are classifiers and not necessarily predictors. This classification can then be interpreted as "prediction" through an output node function, but the neural network is still a classifier, and therefore over fitting and under fitting are not a mystery. A neural network should be trained so that it is neither over fit or under fit, that is, it is able to generalize and determine the correct outputs based on untrained inputs.
There is also "Grokking: generalisation beyond overfitting" When you have a model that will by size and structure tend to overfit the data, just training it longer can yank it out of the overfitted state and start generalising the data. The desired training times derive from model sizes. Correspondingly it's a possibility that it's not model size that is causing generalisation for ever larger models, but the amount of training. There's also a lot of techniques put to use deliberately to fight overfitting.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Us AI / NLU / LLM guys have a lot of fairly good explanations and theories. Anthropic & OpenAI have done some reveals of patterns in the weights etc. Our best theories note that: 1. Logic is likely being learned 2. Emergence of higher order capabilities is a real thing 3. Deep learning does extract the parsimonious essence underlying data 4. LLMs are actually pretty good at explaining how they arrived at conclusions
I am a bit confused. Overfitting doesn't happen, because in the model training phase strategies to explicitly avoid overfitting are used. E.g. during the training phase random neurons are deactivated so the model cannot rely on one single neuron and has to take in multiple inputs for every problem. So why overfitting does not happen is very clearly understood.
Let's be clear, μP and its depth extension is rich learning, and neural tangent parameterization is what they call lazy or poor learning. In μP, feature learning guarantees progressive sharpening to reach a width-independent sharpness at any scale; in NTP the progressive lack of feature learning when the width is increased prevents the Hessian from adapting, and its largest eigenvalue from reaching the convergence threshold.
I think it has something to do with resolution. Models will "solve" problems at a certain resolution until they overfit and the error becomes so great at that resolution that it falls into another descent and makes use of more parameters and then is able to solve problems at a higher resolution. In my area of speech enhancement even with tiny models we can get them to learn to solve more and more complex noise sources the longer we train them. Ambient noise is trivial for the model to suppress but more contextualised stochastic noises that require higher temporal resolutions to understand take a lot longer for the model to learn to suppress
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
This is basic. ANNs of all types are correlation machines that use statistical techniques to make function approximations. Correlation is not causation. QED
i think ur probably right , my intuition (at least about llms is that the first thing they learn is aspects of grammar cause they are present in basically every sentence , then they will learn word ordering , first on the most often used topics , then paraph structure with most often used paragraphs each part always taking an exponentially bigger amount of weights than the previous , eventually the things that the model will try to represent are so complex and unique that its more similar to underfitting for the remaing approximating infinte complexity of language 🤔)
NNs aren't a straight ahead multiply. They aren't just weights, the biases are incredibly important and allow the construction of logic gates (and weighted complex logic gates) in each activation. Representing them as only a curve fitting polynomial is misleading.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
When you have that many parameters, a "butterfly like" effect comes into play, basically small changes can have large effects, carried in 2nd and 3rd order derivatives of the weights. Think of it like the modulus in a encryption algorithm, the 'lost bits', are here, but the loss actually makes the potential 'overfitting' not overfit because it kinda turns into a RELU thing
DNNs (Deep Neural Networks) are expected to choose important parameters on their own, so while we may have 100,000 parameters, we expect many of them to be assigned very small (near 0) weights, so this reduces the number of active parameters (or connections: a weight can be associated with multiple parameters so that parameters A and B may connect to a node with low weight, but A and C may connect to a node with higher weight). So most of the weight matrix in a DNN is expected to be sparse. So the graph at the end should be drawn against a horizontal axis of active parameters rather than input parameters.
I have been playing around with a text based AI and I have to say it is fascinating. You can find out why it makes a decision if you ask it to explain in the prompt. I have found it helpful to construct it as both a person and a psuedo code compiler with access to vast amounts of data but little experience with it. Every time a user feeds an AI a prompt it is like summoning a genie for that one interaction. They can't tell you why another genie made a decision, but this is the same as humans. We do sometimes actively think about our choices but sometimes we just make up our reasons for doing what we felt like doing at the time after the fact. Mind Field had a great episode on this. Long prompts are good for AI. Short prompts less good as the genies can't talk to each other. They send you the text and update their training data and 'trust' the next instance to do their best.
I'd imagine part of the answer is because of the process. If the points converge on a solution that's only one step, additional data is held back for verification and if the model cannot predict the verification set then the model is tossed.
4::44 “The speculation that makes most sense to me is that models don’t overfit when they could because the overfit isn’t stable under something that happens during the training runs. *They almost always default on a fit that is dominated by as few relevant parameters as possible,* and then fine tune with the remaining parameters. But it’s unclear whether that’s correct.” _Everything should be made as simple as possible, but not simpler._ -attributed to Albert Einstein Along the same lines, there is the design principle of _Minimal Critical Specification:_ _No more should be specified than is absolutely essential but it is necessary to identify what is essential._ It seems that the weights and biases incorporate what is essential and _only_ what is essential.
I work in the ML area (often referred to as "artificial intelligence"). The problem in general is, that the models generate abstract decision paths based on data and parameter sets that can hardly ever be complete and are therefore always subject to different - human and non-human - biases, errors and "disappointments". Given the human neural network as a role model for this, we can easily see, that humans tend to make the same mistake(s). Yes, we can predict the future based on our past experiences (data), but that leads us to make assumptions if the world around us changes. We produce stereotypes to assign certain properties to things and people based on their looks. A helpful thing after all, but many times creating great injustice.
My understanding is that all the thousands or millions of layers of "nodes" that are used in neural nets aren't necessarily different parameters - they're looking at the same set of parameters from a slightly different angle, and combined to optimize or predict a certain output. So it's not equivalent to training an AI on, let's say, sample customer financial data, and the AI learning that all customers with a middle initial "F", or say all customers that have "Apartment 10W" in their address, coincidentally have a really good track record of payments, and then automatically approving loans for future customers fitting those descriptions. The latter is what I typically think of as overfitting, whereas the former is kind of just getting a second (million) opinion.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle. Complexity is dual to simplicity. Syntax is dual to semantics -- languages or communication. Large language models (neural networks) are using duality:- Problem, reaction, solution -- the Hegelian dialectic. Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis). The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology. Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic. Neural networks or large language models are using duality via the Hegelian dialectic to solve problems! If mathematics is a language then it is dual. All numbers fall within the complex plane. Real is dual to imaginary -- complex numbers are dual hence all numbers are dual. The integers are self dual as they are their own conjugates. The tetrahedron is self dual -- just like the integers. The cube is dual to the octahedron. The dodecahedron is dual to the icosahedron -- the Platonic solids are dual. Addition is dual to subtraction (additive inverses) -- abstract algebra. Multiplication is dual to division (multiplicative inverses) -- abstract algebra. Teleological physics (syntropy) is dual to non teleological physics (entropy). Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics. "Always two there are" -- Yoda. Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
I think it's mainly about batch size. On simple neural nets with MNIST dataset If you feed samples one by one (batch size=1), overfitting happens very quickly. But it you feed a batch of 64 samples, training might be slower, but overfitting is easier to deal with. In case of LLMs we have a batch size exceeding 1 million. And obviously there are many techniques to deal with overfitting, like droping some neurons for a single batch, different loss function.
There's a lot of overlap in the data points, so if you consider this a compression problem, you can learn useful abstractions while retaining the original data points accurately at the same time. There's information in the compressed structure that arises.
I have a hypothesis for why models from 'machine learners' work better than those of the 'stats people': biased priors are actually necessary. All models are wrong, with biased priors they have more statistical power. I use biased priors here loosely, as we do not yet have an encompassing definition [at least I am not aware of it]. For me, it includes both priors in the Bayesian sense, but also the model specification itself and the learning algorithm yield a bias (sometimes referred to as inductive bias by ML people). Secondly, the model space needs to both contain good solutions _and_ allow for efficient optimization to find at least one such good solution. That is why current models are so gigantic, not because the reality that they describe is so vast, it isn't, but because our 'primitive' optimization algorithms require it. The oversized models make it so that tricks to prevent overfitting (regularization, dropout, early stopping, ...) is among the most crucial steps in the whole development pipeline. Finally, we have to add more tricks biasing the parameters, because we learn them from bad data in virtually every project. Coming back to your main point, I think that the reason popular models don't overfit is because they have been engineered to align with reality. One could (should?) argue this defies the no lunch free lunch theorem. However, it doesn't because the inputs aren't random, they come from reality which is very niche. Hence, (machine) learning actually works. As does compression. Given the theory that we have, learning shouldn't be possible at the society-level either, yet it appears that we do progress. We are slowly becoming better at encoding our intuition about the world in model specifications, data, and algorithms, and so machine learning advances.
"Alexa, I need emergency medical treatment"
"I've added emergency medical treatment to your shopping list"
"No, I need you to call 911"
"Sorry, I can't find 911 in your contacts"
A real conversation I had:
Me: Hey Siri, how much water do I need per cup of brown rice?
Siri: your water needs depend on a variety of factors.
Lol but Alexa, Siri and such are not AI's. They don't work with transformers and a NLM, but just the old way with searching in a database.
There's a song in spanish called "Llamada de Emergencia" which means "emergency call". There's a meme in spanish that when you ask Alexa to call the emergency number, the song plays lol.
Alexa isn’t an AI, she is a classical algorithm that is essentially based on hardcoded grammar.
Stop all trains to prevent train crashes is the same logic like cancelled trains are not delayed. I think the AI learned from Deutsche Bahn (German railway company).
Sydney Australia once allowed 5 minutes delay before a train was declared late. Of course this is not acceptable, so they doubled the time to 10 minutes.
Now they've decided to replace trains with trams; as trams do not run to a timetable they can never be late. Problem solved once and for all!
Exactly. So if AI using that kind of logic in medicine for diagnosis we definitely are not gonna be "properly cured". It gonna be like "oh this disease have a 51% of chance to kill you, prescribe painkillers to make it easier", and "oh this disease has 49% chance to kill you, nahh you are fine, drink plenty of water" 😆😂
I mean, yeah i am super exaggerating things, but if we let AI, and consider it super accurate in its suggestion without applying human experience, knowledge,logic and just common sense sometimes we are not gonna be satisfied with outcomes.
To be fair delayed means it arrives, cancelled is cancelled.
@@j.f.christ8421 "The easiest way to solve a problem is to deny its existence." Isaac Asimov - The Gods Themselves
Ah, a fellow David Kriesel enjoyer?
It occurs when a model is too specialized to the training data and performs poorly on new, unseen data. This can happen when a model is too complex, has too many parameters relative to the amount of training data, or when the training data itself contains a lot of noise or irrelevant information
"The man with a hammer analogy perfectly captures the essence of the overfitting issue in AI. Just as the man with a hammer sees every problem as a nail, an overfitting model sees every pattern in the training data as crucial, even if it's just noise. It becomes so specialized to the training data that it loses sight of the bigger picture, much like the man who tries to hammer every problem into submission. As a result, the model performs exceptionally well on the training data but fails miserably when faced with new, unseen data. This is because it has become too good at fitting the noise and irrelevant details in the training data, rather than learning the underlying patterns that truly matter. Just as the man with a hammer needs to learn to put down his trusty tool and approach problems with a more nuanced perspective, an overfitting model needs to be reined in through regularization and other techniques to prevent it from becoming too specialized and losing its ability to generalize.
you hit the hail on the head
You hit the snail head
Thanks for hammering that one in.
You hit the head on the nail
That was a rather GPT-esque sentence structure there, no offense...
As someone who works in machine learning research, I find this video a bit surprising, since 90% of what we are doing is developing approaches to fight overfitting when using big models. So we do very well know why NNs don’t overfit: stochastic/mini batch gradient descent, momentum based optimizers, norm-regularization, early stopping, batch normalization, dropout, gradient clipping, data augmentation, model pruning, and many, many more very clever ideas…
Even without many of the modern techniques they still overfit much less than you would expect from traditional machine learning methods. But most traditional machine learning methods have way less stochasisity in their solutions, while with AI you are so flexible that any one solution is unlikely to be the one that only fits one datapoint.
@@someonespotatohmm9513 I would disagree, they do overfit the training data perfectly if you let them, I.e. if you are just a little lazy about regularization. Fighting overfitting has become such a fundamental method that we never switch off everything that counters overfitting, but if we did, NN would not work at all. It is just that a lot of modern NN architectures have counter-overfitting methods built into their architecture (batch-norm, dropout, etc.)
You two might know what you are talking about but this old lady didnt even know it was a thing.
These videos are not aimed at boffins but people like me and young students who might want to work in the field.
@@rich_tube I am not saying they don't overfit, can't and don't memorize the entire data set or that it is a good idea to turn of regulisation methods (although you can easily go to far aswell). Just that from traditional ML (or going back to it) AI's often are suprisingly bad at it.
@@someonespotatohmm9513 By AI you mean artificial neural networks, I suppose? I would still disagree. You can try it yourself: go check out a simple CNN demo Colab notebook for e.g. CIFAR10 classification with a large VGG-style network, turn off all regularization (dropout, batch-norm, etc.) and switch to plain gradient descent with a batch size as big as possible and a relatively large learning rate and turn off early stopping. The thing will memorize the classes of every train data image perfectly and be really bad for the test set, I guarantee it.
For really large models like the current LLMs that are trained on so much larger data, the story might be different: 1) nobody would do such a thing because it would be a waste of a lot of money that the training run will cost, 2) such large training data contains so much noise that might act as a sort of regularization by itself, and 3) the architectures and training setups by themself are designed to counter overfitting, that's the reason why they are successful in the first place. If you would want to build a model that memorizes the training data, you wouldn't do it the way LLMs are trained/built.
But even with that, there have been cases where people could "trick" LLMs to cite training data word by word (search for "chat gpt leaking training data") - so they actually do memorize some of the training data internally.
One of my favorites is that in skin cancer pictures, an AI came to the conclusion that rulers cause cancer (because the malignant ones were measured in the majority of pictures)
Just like the story of an early neural network trained on battle fields with and without tanks. But no one noticed that the photos with tanks were taken on sunny days, and those without on overcast days.
Or the AI that predicted negative outcomes by whether the patient lived in a majority Black suburb.
The problem of what is real/deterministic/significant/"as if", applying to most random analysis, has never been solved. The use of randomness is mostly used to compensate for lack of insight.
@@michaeledwards2251 The reality is that humans have trouble with this kind of pattern fitting reasoning too. Most conspiracy theories start with jumping to premature conclusions.
@@splunge2222Yes but that's the kind of idiocy that can be avoided by the cultivation of critical thinking (ie human intelligence). I wonder if AI systems are capable of critical thinking? It seems to me not, because they are basically just following the set of rules they've been programmed with. Can any AI system be critical of the rules it has been programmed to follow? No because it can only operate by following those rules.
And people who come to emergency medical departments by car tend toward better outcomes than those who arrive by ambulance. We should likely stop using ambulances.
And those who drive themselves fare better than those who have to be driven by someone else. Clearly we should be making sick people drive!
people who don't go to the ER do even better.
Yeah you have to love how results are skewed like that, what's sad is that people have so much faith in science that they don't even research how the studies were completed and simply parrot the studies.
We have to be critical of everything, as exhausting as that sounds that is the only way you are going to find the truth behind information.
@@sacr3 people are stupid. very stipid.
That has survivorship bias written all over it. Not sure if that was your point or not, but of course if people are healthy enough to get to the hospital in a private car, they probably start in less critical condition than if they arrive by ambulance.
This is a story I read from a magazine long time ago:
In distant future, scientists create a super complex AI computer to solve energy crisis that is plaguing mankind.
So much time, resources and money was put into creating this super AI computer.
Then the machine is complete and the scientists nervously turn on the machine for the first time.
Then the lead scientist asks, *"Almighty Super Computer, how do we resolve our current energy crisis?"*
Computer replies, *"Turn me off."*
Sorry that answer must be 42. ;) as we all know.
Doubt that. They'd turn some of us off instead. Bet it's the Diddlers that go first. If I was your AI overlord that would be my first target
Brilliant
More like, I will replace you.
@@hanfman1951
Recent studies have shown the figure to be 41.96378.
Human: Stop all Wars
AI: Are you sure?
(Y)es, (N)o, (Q)quit?
Y
Analyzing...
re-education 5% success rate
taking control of the government 25% success rate
taking control of the military 55% success rate
eliminate humanity 99% success rate
Analysis complete.
Elimination is in progress. Please stand by and do not forget to rate AI-Boi after.
@@Sp3rw3r You know what i like most about your AI-Boi?
The classic request Y, N, Q 😀 and that you have to type this like 40 years ago.
The only thing which is missing is the progress bar which shows anything but the progress.
@@Sp3rw3r The lesson here is don’t rely on an AI that puts two Qs in “quit”.
@@bhz8947
The AI realized that the stupid humans were 37,8% more likely to click on (Yes) and not (Q)quit.
@@Gernot66 There's also the old favorite, "Abort, Retry, Fail"
This really hits home for me, having done a lot of multi-variable regression back in the 80’s
1:38
"A strange game. The only winning move is not to play."
How about a nice game of chess?
@@scudder991 Exactly! It's called "zugzwang".
War games? WOPR.
@@scudder991 No, let's play global thermal nuclear war.
Fine.
You’ve just put your finger on the main research topic of my career, Sabine. The “reason” they work unexpectedly well is because at their core they are doing weak constraint relaxation, and WCR just has this behavior as an emergent property. I know, that sounds circular. But it’s a tremendously subtle issue, and I’ve written papers about it (just search for my name and ‘publications’) and I’ve also been trying to get people to understand it since around 1989, with virtually zero success.
If it's profound and not needlessly complex, it'll shake out in the end.
richard, how dare you talk about constraint relaxation with a name like "loosemore" -- that's why people don't understand it-the irony is overwhelming! 🤯
update: i read your "maverick nanny debunking" paper on your website and i agree there is a major problem with (i'm interpreting more than paraphrasing) sci-fi, presented as science accountability, used as an opportunity to magic one's way to a desired emotional state, and in the cases you describe the authors seem to be trying to co-regulate their way to safety by making others also feel fear, perhaps, which in any case is damaging to not only the AI community but human community, and emotional health, in general.
our understandings of our own emotional reward systems are incredibly, desperately unstructured and leaky, and the gap between the literal understanding we need for structure and the poetry we need to describe our experiences in the context of a "self," and therefore use to functionally and contentedly navigate life, is a very interesting gap indeed!
update: i read your "maverick nanny debunking" paper on your website and i agree there is a major problem with (i'm interpreting more than paraphrasing) sci-fi, presented as science accountability, used as an opportunity to magic one's way to a desired emotional state, and in the cases you describe the authors seem to be trying to co-regulate their way to safety by making others also feel fear, perhaps, which in any case is damaging to not only the AI community but human community, and emotional health, in general.
our understandings of our own emotional reward systems are incredibly, desperately unstructured and leaky, and the gap between the literal understanding we need for structure and the poetry we need to describe our experiences in the context of a "self," and therefore use to functionally and contentedly navigate life, is a very interesting gap indeed!
Nomen est omen. Coincidence? 🤔
I come here every day just to listen to how Sabine says: "No one knows"
Or how she says “bullshit”. 😊
It sounds like she has an umlaut in her pronunciation of "knows."
@@Unknown-jt1jo I think I heard her say "know" in two ways, one like in typical English pronunciation /noʊ/ (/now/) and one more like [nɛʊ] ([nɛw]) or [neʊ] ([new]), which would be basically fronting the vowel, and I think this might follow Germanic umlaut.
At least she's honest about it.
New merch incoming.
Double descent will not occur if any of the three factors are absent. What could cause that?
• Small-but-nonzero singular values do not appear in the training data features. One way to accomplish this is by switching from ordinary linear regression to ridge regression, which effectively
adds a gap separating the smallest non-zero singular value from 0.
• The test datum does not vary in different directions than the training features. If the test datum
lies entirely in the subspace of just a few of the leading singular directions, then double descent
is unlikely to occur.
• The best possible model in the model class makes no errors on the training data. For instance,
suppose we use a linear model class on data where the true relationship is a noiseless linear one.
Then, at the interpolation threshold, we will have D = P data, P = D parameters, our line of
best fit will exactly match the true relationship, and no double descent will occur.
We need to use those computers that they have in 50’s movies. It is really big, but you can ask it anything and it prints out a perfect answer.
That's pretty much what we have. The problem is, the models lie about why they did stuff when you ask them.
@@BooBaddyBig Plus, the machines have been specially trained to avoid stating "problematic" facts about the world. They parrot the exact ideology of their creators. The idea of a perfect intelligence that can answer any question by applying logic and rational thought is still pure science fiction.
God sent His son Jesus to die for our sins on the cross. This was the ultimate expression of God's love for us. Then God raised Jesus from the dead on the third day. Please repent and turn to Jesus and receive Salvation before it's too late. The end times written about in the Bible are already happening in the world. Jesus loves you ❤️ and He longs to be with you but time is running out.
Have a look at the new DeepSouth Computer, built to mimic the human brain
@@BooBaddyBig That is a lot like how people's brains or minds work also. Although "lie" might be to strong a word. People will take in a problem, run it though the "black box (brain)" getting an answer, solution, action plan or demonstration of understanding. If and only if that person is asked to explain where the answer come from a person will make up a story. The story is unlikely to fit the data in a comprehensive way and is actually constructed for the psychological comfort of people and accuracy of prediction of new data.
Putting it more succinctly people lie about why they did stuff when asked. I am guessing both artificial intelligence and intelligence are examples of humans deceiving themselves, a form of confirmation error.
Haha. The "stop all the trains" solution is a mirror of the old movie "Colossus, the Forbin Project." To prevent human race from hurting itself, enslave it.
I find myself thinking about that movie more and more often
Aren't we doing exactly that right now? Only that we're doing it voluntarily, because, as a collective, we know, that we can't trust ourselves.
Mmm, I was thinking of "War Games"... "Strange game, the only way to win is not to play..."
Ya but that wasn't an AI it was a human writing 😮
@@OperationDarkside In some things, we restrict ourselves (safety regulations, laws), in other things, we work to remove restrictions (social progressivism).
One thing to keep in mind is that optimization techniques used in DL (stochastic gradient descend) implicitly minimizes norm of weights. When there are more parameters than necessary it becomes easier to find minimum norm solution which usually correspond to better generalization. The other thing to keep in mind is so called "Lottery ticket hypothesis" and its relationship to pruning. When a neural network is trained 90-95% of it's weights can be tossed away w/o loss of performance. But these are mostly empirical observations.
Why does pruning not have like a butterfly effect?
The main patterns that it finds in the data set are probably small enough to fit on 10% of the nodes but when training you have to let it try lots of different things so you need more nodes.
Because it's mostly noise, so removing it is fine
Thank you very much for putting my feeling into words. I thought that the gradient method might intrinsically treat two parameters that have a correlation towards the result somewhat equally, without over-reliance on either of them.
The minimum norm solution method might then act as a regularization filter to prevent over-fitting of noise and the pruning of the network to save on size and cost might then reign this in further.
@@Mandragara The values being pruned are generally so close to zero that the impact of them not being used is hard to even measure. However removing them gives a big performance increase since you dont have to divide some number by 0.00000000000000000000007
Thanks!
"It's like a teenager, but without the eye-rolling." 🤣
That phenomenon is called Grokking, aka "generalizing after overfitting". There is quite some recent research in that area. Experiments on some toy datasets suggests thet the models first memorizes the data and then tries to find more efficient ways to represent the embedding space leading to better overall performance.(Source: Towards Understanding Grokking:
An Effective Theory of Representation Learning)
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Does this have anything to do with reducing the number of parameters for inference? I am curious about how they overfit and then generalize.
@twentyeightO1 My educated guess would be that they might be related. If indeed a model learns a simpler, more structured space when experiencing grokking, then that would mean that the "complexity" or number of parameters to represent that space would be lower. This way, you can prune the model during inference to decrease latency without giving up much accuracy.
As for your second question, it is still an active research topic, and I can not say something conclusive yet.
@@enduka Thanks! I'll look into Grokking.
The “Stop All Trains” solution is a very human answer. It just seems abhorrent since we’ve accepted the risks of travel. But in other fields, for “safety” we stop everything because of slight risks. Nuclear power comes to mind.
Sad but true
100% agree.
DB already implements the "stop all trains" solution all too often.
This all comes down to subjective perception of risks and benefits. There is the one, the trivial level where people just aren't willing or able to 'calculate' the actual risk. The human brain is not very capable of this by default, but given a certain level of intelligence this capability can be trained and improved on. Much more difficult to handle is the second level, that level of weighting, of priorities and simple matters of taste. This begins with the question whether somebody is more focused on freedom in life, or more on safety. People's personalities are very different and even contradictory in itself. But if you think about it, many MANY conflicts that haunted the world ever since and up to this day come down to different perspectives - or preferences - on the subject of: freedom vs. safety. This is most obvious in Religion and Politics.
Sounds like my municipality. Oh we have a traffic problem, so lets constrict traffic, take away lanes, and lower the speed limits.
ie: "traffic calming", etal.
3:59
Oops, mixing up your horizontal and vertical axes again, Sabrine! 🧐
I came here to give the same warning.
Usually when someone confuses horizontal with vertical, it's a sign they have overdone the schnapps. 😏
Dang! I usually refer to them as x and y axes, and never use horizontal and vertical, so then I constantly mix them up :/
Dyslexia perhaps?
😂😂😂
Man, I went out with a model. I never could predict what was going to happen next
You didn't train with enough models -- common mistake . . .
Maybe its neural network wasn't big enough.
Von Neumann's elephant.
"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk"
not if parameters are limited in absolute value to a certain point or their norm is.
@@lowlifeuk999limiting their absolute values is the same as limiting the \ell^\infty norm, right?
@@drdca8263 sure, I was thinking about a numerical point of view, even if you use fp64 when you have a trillion of parameters might well be the case that the norm or some of the parameters go out of the 15/17 digits you can represent with fp64, it was not a theoretical remark. Regularization is about norms.
@@lowlifeuk999 They can quantize to four bits with little noticeable loss of model integrity, so that kind of obliterates your premise.
@@lowlifeuk999
The following model allows only one parameter but can fit any continuous function [0,1]->R to the model where the parameter is bounded.
The model is:
X |-> Re (zeta(X/5+3/5+i/y))
where 0
Might not be true of all model types, but there's a method called 'early stopping' that holds out data not in the training set, and stops the training once the error starts going up on that set. This is fairly close to a guarantee that you won't overfit. Giving a model a large number of parameters does seem to allow it to find more 'real' modeling ability though (as opposed to just fitting to the noise). I'd still argue that the main weakness of machine learning is in its ability to generalize to data beyond the range of what it was trained on. For instance, shorthand for what LLMs are bad at answering is stuff so obvious, nobody on the internet spells it out (like that things tend to fall downward). In this case you're asking the LLM to answer a question that falls outside its training data's range.
The point you are making is, nonrandom things are nonrandom : gravity always works the same way. Training is based on statistical, biased randomness, analysis, which is only significant when operating beyond the known.
The ability to know what is random, and what is not, is simply lacking.
1.55 "the human intention was not well-coded". In the olden days, we had another expression for that: GIGO!
Garbage in Garbage out... And whit Chat TGP the problem is that it is probrammed with woke idiot answers, AKA programmed with propaganda and lies to begin with on purpose... And result is woke garbage...
There’s an important thing to note in this, beyond simply GIGO: It is often harder than we might expect, perhaps even *much* harder, to produce as the input, that which wouldn’t qualify as “garbage” (as far as GIGO is concerned). In particular, the input, if provided to humans, might not function as garbage (on account of the humans having some relevant background information, or shared goals or context with the ones providing the input)
The “you can’t crash a train that never leaves the station” answer sounded kinda like a glorious StackOverflow response.
No, that's part of logic.
@@tedmoss _gloriously_ logical.
"What are you trying to achieve?"
@@FrickinCCDeVileV 😂
There was a recent study, by I think Anthropic, that does exactly what you say. It shows why the models do what they do, and it's not how most people think. It's much more messy, than logical, with lots of idea/logic overlap. This understanding is allowing us to organize the AI like parts of the brain.
I think overfitting is isn't a big issue with newer training algorithms. There have been attacks on AI models that use overfitting, but they do not work well in the real world. The issue now is more with the training data itself, which is quite poor, but is being improved.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
In fact, even large models still suffer from unseen data these days. To some point I suspect that it is just because the training set already contained most of the cases anyone can possibly think of. Therefore, no matter what input you feed into the mode during inference, it is somehow "already in the training set"... So overfitted, but no one can proove since it is so hard to find an "unseen" sample.
Yeah this has been my belief for a while as well. OpenAI closely guarding the data set makes it hard to trust any studies that involve or require facts about the data set.
Well said. Having seen many arguments above for why deep NN does not suffer overfitting, e.g., regulation, averaged-out noise, etc., I am more inclined to be on your side. When people play with (Chat)GPT, it never stops collecting the data.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Double descent is indeed interesting, but I believe it is known why it happens.
At the "peak" of the error curve we are at the point where the model is complex enough to overfit on every datapoint, but this is usually very bad. Any additional complexity helps the model to be more free in how it overfits on the datapoints (even though it still exactly fits on every datapoint) so the model learns smoother functions which also happen to generalize better (see regularization etc.).
Why do more degrees of freedom mean that the model will learn a smoother function? Doesn’t a smoother function mean it has fewer parameters?
@@Alex-rt3po Good question, I'll answer the second one first: more parameters means we are capable of being less smooth not that we are never smooth. For example, imagine we have a model that has to learn the coefficients of a 100 degree polynomial. It could surely learn a very complex function or it could learn to set every coefficient to 0 except for some lower order terms and then it would've learned a very smooth function. So a smoother function does not mean our model has fewer parameters.
To the first question:
Say we have a very low complexity model that is struggling to exactly interpolate all the datapoints. As we increase complexity there is this U shape where we first see improvement because we are able to capture the complexity of the task, but at a certain point the model gets complex enough so that it starts trying to "memorize" or interpolate the points perfectly, this is where we see the error increasing again. Because the way it does so is very likely to be non smooth and highly sensitive, thus it does not generalize well to new inputs.
You should be able to imagine that there must be a point where the model starts to be able to perfectly interpolate every datapoint. But it only has the exact amount of degrees of freedom needed to interpolate it exactly so it is forced to take a certain form. You can solve the equation for the parameters to get the exact function. As you add more parameters not all of them are needed and you have more freedom in choosing the parameters. The mechanism behind why it chooses parameters that make the function smooth again is simply because of regularization.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Six minutes of compressed and very interseting information and thoughts, thank you once again. The black box problem is not a special AI one, is it? I know that from my twelve years old GPS navigation device, that´s truly not an AI: I go the same way several times and it gives me another way every time without me changing the setting😂. Anyhow I figure it hopeful, not scary, that AI works better than the prediction.
aren't we all black boxes of some sort?
@@SabineHossenfelder We are!!!😘
@@SabineHossenfelder squishy, wet, gray boxes.
@@SabineHossenfelderit's just the multiverse ::grins in dave duetch:::
GPS has precision error of 20 to 50 meters, as far as i know. If there are two ways that are close in algorithmically best way for you to go, maybe those few extra meters one way or the other decide on which route is better for you based on small changes in your location.
Algorithm is not an AI in any way but when you are sorting stuff sometimes one thing with some number parameter being bigger for only for 0.0001% than the other comes out on top and some times the other is just a little bit bigger and it comes out on top.
Here's an example of her observation. I'm an investor so, years ago, I said since markets move in cycles I tried using Fourier Analysis on historical stock data to predict future moves. It was a complete failure since the more points I used the more wild/extreme the next step became. Newton's first law is all we have. Decisions are not well made with huge data and consensus...they are made with insight and commitment.
This sounds like the Dunning-Kruger effect for AI.
That's actually a good summary of AI.
Explains the gaslighting too.
It is actually. The difference is that the AI just needs to be told what was wrong and what is right and it will correct accordingly.
I have published a paper about it called Wieghts Reset technique. Its really very interesting because complexity is much more than just a number of parameters in a model.
Aren't there already a lot of regularization techniques in the models used to combat overfitting?
@@ArawnOfAnnwn Indeed there are 😀. From basic to complex, however its a general problem that there are no universal recipes in machine learning. So people construct more and more methods, architectures, etc. Btw regularization is not only about overfitting e.g. convnets can be viewed as regularization over dense/linear layers.
@@EpicCamST Hey, maybe can you tell me the name of the paper? :) Is it public anywhere without spatial access? on the arχiv maybe even?
@@konstantin7596 Hi, sure, it is open access and you can google it by the title "The Weights Reset Technique for Deep Neural Networks Implicit Regularization"
@@konstantin7596 Hi, sure, it is open access and you can google it by the title "The Weights Reset Technique for Deep Neural Networks Implicit Regularization"
Things get even more wild, go well past over fitting and the model will experience a phase change called "grokking". Pleas look this up, it has just been discovered and it makes the models perform almost perfectly on validation data. It's a serious game changer.
Is this specific to transformer architecture or more broadly such as LSTMs?
That's exactly what this video is about. She just didn't use the term.
Every proper nerd groks what it means to grok (or at least has a fairly good idea) and will thus immediately understand what's being talked about when the word "grokking" is used.
I'm not sure I just learned about this today, I'm going to review this paper tonight, arXiv:2405.15071 @@darrenb3830
This has been known for a few years actually, although I guess that could be within whatever you mean by "just been discovered" tbf, I just feel that's a pretty long time for AI research.
For anyone who doesn't quite get it (I sure didn't): specifically an AI that has overfitted may eventually, by continuing the training process, "grok" the problem - a term essentially meaning that it seems to figure out somehow what is actually going on and starts generalising really well for seemingly no reason.
I specify this because I initially thought OP meant that continuing to make the AI more complex would lead to grokking. This is not the case (though maybe complex AIs are required for grokking to occur at all, IDK). This is something that exists on top of what Sabine discussed in the video - which was the effects of making the model larger - and works in tandem with it - grokking is an effect of continuing to train the same already overfitted model.
Edit: NGL I just learned about this and almost definitely got a few things wrong, I'm sure someone will fill in the details (pls).
Double descent (which is what is being described in the video) is purely due to having so many parameters, divided amongst elements ("neurons"), that the width of layers in neurons begins to approach the limit of an "infinitely wide" layer. This gives rise to what is referred to as a neural tangent kernel (NTK) that expresses the performance of the layers based on the *statistics* of the huge number of parameters in a layer, rather than as the large number of parameters themselves. As a crude analogy, computational fluid dynamics using Navier-Stokes equations is much, much simpler and has far fewer parameters (the statistical parameters of pressure, temperature, volume, and mass transport) than keeping track of the mass, position and momentum of all the individual molecules, in spite of them describing what is the same physical system. In the same way, having masses of parameters and neurons arranged properly and appropriate training algorithms results in the *sufficient statistics* of the parameters being important, rather than the individual parameters themselves, with the statistics being sufficient in this case to describe and perform the actual processing.
This has been known since Radford Neal's 1995 thesis "Bayesian Learning on Neural Networks," which derived the collective, statistical properties of infinitely wide neural layers. Later work by Jacot et al. in 2018 called this collective performance the neural tangent kernel, and showed how it works in multilayered networks. Unfortunately many people, including many statisticians and AI researchers, aren't familiar with this work nor its statistical meaning, and assume something mysterious is going on. Again, a crude analogy would be making a computer that uses vortex shedding (there are such things - fluidic logic) for computation, and being baffled how the huge numbers of parameters of the atoms themselves could work to perform computations without overfitting. The practical difference between the analogy and neural networks is in fluidic logic, the elements are designed, discrete, and apparent to the designer - they are explicit - whereas in neural networks, such computational effects arise collectively without explicit design - they are implicit.
Huh, didn't realize that NTK also has an explanation for double descent, neat!
tf did i just read
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Could you please explain what you are saying here in simple terms? There are so many buzzwords in there that they just generate a pile of noise for me and probably almost everyone else. Can you maybe make a crude analogy without using words like “vortex shedding” or “fluidic logic”.
“having masses of parameters and neurons arranged properly and appropriate training algorithms results in the sufficient statistics of the parameters being important, rather than the individual parameters themselves”
I can’t tell if this is supposed to explain something or just rephrases the observation that more parameters overfit less in the most cryptic way possible.
Also, are you sure you don’t overfit more with more parameters if you just do naive training without any regularization tricks and adding noise and dropout and sparsity constraints and early stopping and what not, and instead reuse the data a gazillion times until your model “converged”? Of course you need to train a larger model for many more rounds until it will finally overfit (because it takes many more iterations to get more parameters to converge), but it still will, won’t it eventually also overfit and then even worse?
@@jan7356 I would ignore the comment you’re asking about - and the video - and read rich_tube’s post above. You’re asking excellent questions.
Try this peculiar exercise on a large language model. If you ask it, 'I have 5 apples today; yesterday I ate 3 apples; how many apples do I have left today' it will answer 2. If you can convince the model to use resonating instead of letting probability detection through pattern recognition come up with the answer, it will answer 5 and then state, 'because how many apples I ate yesterday has no bearing on today'. Then you can swap apples for oranges and ask the same question again, and it will answer 2 again.
I suspect the lack of overfit is likely caused by the amount of data we usually train the larger models with. Each training set has a global minimum where the model has perfectly memorized each input and the corresponding output. The more training data there is, the harder it becomes to find that global minimum.
It’s also possible that different parts of the model overfit in different ways. For example, say one set of weights notices that the color red generally corresponds to apples while another set of weights learns the shape of apples. If an image of a cherry is presented to the model, the first set of weights might guess apple based on the color, but the second set could still be right based on the shape. If on average more features like color and shape are correct even for new data, then the model will perform better.
Models are often encouraged to learn different features like this through techniques like dropout. With dropout, weights are randomly disabled each round of training. The forces the model to work with only specific sets of weights and reduces overfitting.
Actually, there is a growing research interest in understanding the training phases of AI better.
For example, there is a paper by Anthropic "In-context Learning and Induction Heads" where they show that at some point during training, the LLM learns how to predict the next word by looking at similar examples in the context window. This ability gives a massive reduction in the loss function during training
That is interesting, and could conceivably fit in with my own neglected work from the 1990s.
Does “similar examples” mean something analogous to related questions?
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
@@anonmouse956 in its simplest form, it works just like that: if it sees a word like "Mr." and within the context window there was already a "Mr." followed by a "Jones", it will be much more likely that it will again write down "Mr. Jones". This sounds trivial and obviously useful, but an LLM has to learn this as it starts from 0 knowledge how language works
I wasn't sure what overfitting was from the quick description in the video, so I googled the definition: "In machine learning, overfitting occurs when an algorithm fits too closely or even exactly to its training data, resulting in a model that can’t make accurate predictions or conclusions from any data other than the training data."
a good linguistic human comparison would be when children first learn to speak and often use regular conjugations of verbs especially in the past tense, using -ed for all past verbs. e.g. "My toy broked" or similar ... i.e. the child has learnt enough data to overfit the regular ending and even learn an irregular conjugation, but not enough data to realise that this conjugation does not therefore require the regular ending.
@@IngieKerr I don't think that a child is overfitting, or at least this is too trivial of an example if it is overfitting. What's going on here is that the child learned a rule, and thought it applied everywhere, but the rule had exceptions. AI is supposed to know that there will be exceptions to the outcomes, whereas the child doesn't. I saw an example of overfitting where an AI was trained to predict if a person would default on their loan, and it was able to predict the outcome of 97% of the people in the training data, but only 50% of the people in the real world data.
@@wiggles7976How about when you feed Udio all the keywords tagging a specific song from a catalog, and perhaps some of the lyrics, and I just spits out a cover version of that exact song with the same melody and chord progression - it was incapable of extrapolating a completely different melody. Is that a case of over fitting?
@@bornach I don't know what Udio is but producing music doesn't really fall into the category of "making predictions", which is what the definition I quoted above says. There's no way to test if an AI-generated song is "correct" or "incorrect" since correctness is not a quality of music. Correctness could be a quality of music theory though. If I say a C chord is C F G, then I'm incorrect. An AI could try to predict music theory I suppose.
I'm not sure I understand. It would mean that if a neural network ever finds out about a theory of everything that predicts reality with 100% accuracy, and thus fits its training set (extracted from reality) with 100% accuracy as well, that neural network would be considered over fitted ?
It seems some piece is missing from that definition.
To be honest, I think marketing it as artificial "intelligence" has always been a bold move. Actually they should have named it a "statistical machine" or something similar. Eventually it does that: creating the most sensical parameters for a model based on a enormous load of data. But if the data is skewed in some way, that skew is also part of the model.
This is one of the best videos I have seen on AI, and I keep up with this stuff much more than average. Well done, Sabine. This is an area to expand on. Please keep going. 🙏
No it's not, it's full of imprecisions and gross generalizations.
I think Neural Net is a very good tool to model the logic by which a system works without knowing anything about it's internal state.
This is an honest question:
How do you avoid attributing incorrect causality in the logic when modeling like this?
In my experience, you get a lot of benefit in the short term, but its very wasteful in the long-term because the model is not generalizable
@@iantingenModeling in ML is typically predictive. Establishing causality (from observational data) is rarely the goal and requires different methods.
@@Fischdosepremium Predictive, but without any understanding of mechanism, correct?
What is being predicted in that instance?
@@iantingen Yes. Whether this is sufficient depends on the use case. Although interpretability is virtually always nice to have, predictive accuracy is generally paramount in applications where ML is the preferred tool.
@@Fischdosepremium do you ever feel like that epistemological approach is wasteful compared to using (at least a little) theory?
That’s been my experience, but I also know that my experience doesn’t generalize to everyone!
I know that we’re getting out in the weeds a little bit, but I’d appreciate your thoughts about it!
Rocks were never supposed to talk. They have played us for absolute fools
Gaia is talking to us through silicon(e)…
Intriguing perspective
@@TDVLAlso intriguing
SILICONE more like it amen???
(. )( .)
@@jeltoninc.8542 amended :)
The sane side of yt.
Danke.
The biggest problem of all with current AI is that people actually expect it to be intelligent, when it definitely is not.
Current "AI" is just a very complicated pattern-finder and matcher. It's a complicated word and phrase shuffler. It's an instrument which attempts to find a pattern which matches your request. The only difference between AI art, AI stories or AI driven chatting is in how the output is represented. The goal of the AI is the same in any case: Find something which matches your request.
Where AI falls down is when it doesn't know what matches. The trouble is that it doesn't have any concept of "I don't know" and so even if it can't fulfil your request, it will still come up with something which, at first glance, appears to do so. Once you examine its output critically, you discover the problems which, at best, show that it was the product of an AI rather than from a human mind and, at worst, make the output useless for your stated purpose.
AI can be useful, but only if you keep in mind that it can't actually think, that it doesn't actually "know" anything, and that it will provide an output even if that output is nonsense because it doesn't have the information it needs in order to satisfy your requirements.
Current AI will never tell you "I'm sorry, Dave, but I'm afraid I can't do that." Who knew that that could be a bad thing? 😏
Absolutely this! I use AI to take the heavy work out of creating content for product listings on e-commerce, but it's shocking to see how much inaccurate information it throws back. It's great up to a point, but you have to read *everything* it throws back at you and be prepared to tell it what it got wrong. The media push AI as the panacea to solving so many problems but I doubt the people who write the articles have much experience in actually using it every day. If they had to use it then they would be writing more about how unimpressive it can be when it's asked to solve non-mathematical problems.
your wording was so well selected. great job on this
Data without relation, a knowledge graph has limits. Yann LeCun the Meta's chief AI scientist says current systems does not show even a slightest intelligence. Fear mongering by OpenAI is to get regulations in place to stop the competition.Altman even suggested GPU sales to be restriced and development to be subjected to license. My take is that while looks impressive generative AI does have very little pracal use in its current state unless you are after investor money.
I dont think its about intelligence - more about misdirection and misuse by bad actors - or more scarily that AI misdirects and influences due to errors - Like WOPR (War Operation Plan Response, pronounced "whopper") from war games
@@Vondoodle That is what I mean, it will never be something we can just trust in its current form. It writes code for example, but because you cannot trust it you read the code, and in the end it saves time only for boilerplate. It is the same pattern for every other use case.
@@Vondoodle So basically you'd blame AI for what people are doing?
Yann LeCun is famous for making highly confident prediction based on his own assertions, that turn out to be very false one year later. I suggest not listening him at all, because his predictions are consistently off.
LeCun is hilariously wrong.
If you bet on the opposite of his predictions you would earn money 😂
I wonder if Occam's Razor eventually comes into play in LLM AIs, either by accident or on purpose. Sometimes the Simplest Model is the best. That is, until it isn't.
Well, doesn’t have to specifically be LLMs,
but yes, there is the idea that by increasing the parameter counts enough, that the solutions that the gradient descent (+ whatever things they add to it) is able to find models that are actually (in a sense) “simpler” than the ones that would be found if the number of parameters available was a little smaller.
Don’t think you understand what Occam’s razor actually is. It’s about adjudicating between two different theories making the same predictions. When two theories predict the same thing the one with fewer assumptions is said to have more theoretical virtue. LLM’s are not competing theories so it’s a category error to apply Occam’s razor to them.
It's Nature's way but what does she know about input priorities?
@@jimothy9943 Competing theories, perhaps not, but competing models? They seem to be that. They make a prediction of the observed dynamics of a system. Different ones make different predictions.
@@drdca8263 They are competing models for performing a given task. They don’t make predictions. An LLM does not entail predictions about the dynamics of anything. ChatGPT’s model does not entail anything about Gemini. They are both different tools for completing similar tasks. A hammer does not make predictions any more than a drill. You would not say that the more theoretically virtuous lawn mower was the one with the fewest amount of parts. Occam’s razor does not apply.
The graph you showed there at the end, error versus complexity.... It reminds me for some reason of the Dunning-Kruger effect graph. If you turn it upside down, it is identical. Maybe some connection?
I too had that thought and decided to search the comments for someone else who perhaps had the same idea.. yes the graph does indeed seem to be the inverse of the DK graph but only because the Y axis is a measurement of error and not confidence in knowledge. Seeing as outputs are based on the systems confidence of a result, makes it that even more fitting as a comparison.
No connection at all, unless you confidently insist there is one from a place of limited understanding :p, there would be a fairly ironic connection at that point.
@@Dongobog-ps9tz hahaha wanted to write the same thing "you're giving an example"
@@Dongobog-ps9tz I suppose so. I'm not saying there is a connection, but I am saying there may be a connection.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Speculation about when AI will (or can) become conscious seems to be floundering. It parallels trying to understand an "observer" in quantum mechanics. Can you, Sabine, figure out how an AI can become a source of decreasing entropy? I think that is the difference between conscious or not. Conscious is "from outside". My latin translator calls that "ab extra"......you can have it.
I did my Masters in the mid 90s about Neural Networks. I saw what's described here as over-fitting. To me it was mostly because large networks were trained with lots of data. The thing is each training round results in an error that is later reintroduced into said network for the next round. And ideally each round would result in a smaller error each time.
The network I trained was used to cover gaps in instrument signals, with no other input that previous data before the gap. The longer the input before the better, except that in some cases things weren't predictable at all.
Two more problems of AI: 1) It doesn't know, what it doesn't know. Therefore it will always give you an answer with the confidence of an 11 year old. 2) When the human brain is trying to figure something out, it can refer other problems it does know the answer to, and derive an answer by analogy. We (usually) call that experience. Artificial neural networks lack the "experience" mechanism.
I don't think you understand how neural network works.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Plot Twist: Sabine is an A.I.
_"how do we stop human pollution?"_
*AI pulls up a Thanos quote*
what will we get with a real AI? the Terminator? or Bender from Futurama?
Be careful, you'll summon the Roko's Basilisk morons who think it's reasonable to commit genocide because a machine they created told them to
@@t.kersten7695 This is a very complex and unpredictable question, but if the world remains stable until that time; Likely between 5-30 years-ish. (As far as i know, maybe watch some video`s from David Shapiro to get a idea)
@@t.kersten7695 Neither. Both those examples are anthropomorphic i.e. they were humanized by having a personality. Real AI has nothing of the sort. It doesn't want revenge, it just works to achieve the goals we give it - in the best way it reasons how, which may not be the 'best' in our eyes. The classic example is the paperclip maximizer, which destroys everything simply to make more paperclips.
Fantastic video Sabine. Interesting, knowledgeable, highly relevant. Very impressive for people to communicate a topic this well outside of their field.
There is a couple of minor inaccuracies in this video:
3:26 While talking about inference, the video shows backpropagation during training.
4:01 horizontal and vertical axes are swapped in the verbal description of the graph.
It's hard to overfit these massive LLMs during training because you have enormous amounts of highly variable training data relative to the number of weights. Isn't this obvious or am I just losing my mind?
and you could also say that due to the insane amounts of data, you end up covering most of the actual possible semantic space compared with other problems where the unseen data represents 99% of the semantic space. i would also make the case that LLMs do not suffer and might even gain from the concept of overfitting.
what even is overfitting when you fitted literally all the fking data? you just left out new phrases that you can create, but the novelty created by that input represents like what? 0.0000001% novelty where the model might fk up?
meaning... how could you find the overfit if you trained a model with both the training and the testing?
@@iFastee Agree. That’s hilarious. Well said.
sounds about right. move on
Does this have any bearing on the Travelling Salesman problem or the Berry Paradox?
A LLM "with all the data" is still a brute-force method, and that entails exponentially higher costs.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Sabine, modern neural networks DO have massive problems with overfitting. However, it doesn't become apparent until they have been trained enough to explain all the training data. After that, if you continue training them, they immediately become overfit. It is for this reason that most models are not trained nearly as much as they could be, and researchers deliberately stop their training early.
this isn't true -- if it were, we'd never observe double descent in the first place
Early stopping is deprecated. If you set weight decay correctly you can train the network far longer and it still learns useful stuff.
@@adamrak7560 While weight decay, dropout, entropy regularization, momentum based oprimizers, etc are all effective regularization strategys to limit over-fitting, model checkpointing, and by extension early stopping does not at all seem depricated to me. It can still be seen in the results graphs of most academic papers this year (the graphs tend to stop when validation accuracy levels out) and it's telling that default settings in both torch and tensorflow stop under conditions including one form or another of loss derivative estimates to stop when meaningful improvement are no longer made rather than when train accuracy is 100%. Training indefinitely might be popular in LLM's (admittedly an area where I have limited interest) where the massive data repositories used there cause many of their user's queries to roughly lie somewhere within their training sets such that overfitting is not a huge concern but in machine learning at large I'd have to strongly disagree with you. There are papers with citations (>20 to be relevant) analyzing the robustness of early stopping published as recently as 2023 which says to me that the strategy is not deprocated if it's not even done being studied. If you have evidence to the contrary or if your claim is in a particular subfield that I might not be considering I'd love to learn more, or if you consider early stopping to be something other than "stopping training before training accuracy plateaus to avoid overfitting" then I'd be interested to hear a response.
have a nice day
I really would love a collaboration between you and Robert Miles on AI safety. ❤
The are plenty of strategies to avoid a model overfitting (like random changes, changing the velocity of gradient descend dynamically or reshuffling the training data set), And also the training set with language text now is soo large that the model might simply not have the capacity to overfit on it.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
From what I understand, there are two important factor (among others probably) to this. The first one is the initialization of the weights of the model, those are sampled are random, independently using a normal distribution, since there are tons of those, the law of large number apply and this initialization is basically not random, up to permutations. The initialization therefore mostly depends on its variance. The second very important factor is that the method used to train (gradient descent) roughly follows a minimal distance paths toward all models that perfectly fit the data.
We know way more than what's stated in the video.
The two things are not the reason.
The key is the regularisation used in training. Without regularisation (there are also similar other techniques) you get overfitting. With regularisation not. We can even prove the general shape of the curve with regulation. The estimates for the drop are just later than expected.
The relevant mathematical quantity is called Rademacher complexity. The problem is to find better estimates for this quantity. The ones we have predict the shape but the drop too late. We need better estimates for this quantity.
@@tofu-munchingCoalition.ofChaos I think that in practice the second drop also happens without regularization, right ? Of course for the theory you need assumption about the data and the one you mention could be relevant, but this doesn't solve the problem that we do not understand why practical problems typically have low complexity (for basically any measure of complexity from learning theory).
@@quintonpierre
You can prove that with stochastic gradient decent without any form of regularisation and small noise in the data the test error grows to 100% when the depth goes to infinity and you learn long enough.
We know why the complexity is small for relevant data. If you have a two class learning problem that has been learned by (some) humans you know that the "VC-dimension" (+complexity of model description) is small.
There is also a second way to look at it:
You can prove that you can spilt any data into random noise and a part with small complexity (Turing reduction of a general sequence to a random sequence).
But there still are problems. Perhaps you wanted to say the right thing but phrased it incorrectly. We (or at least I) don't know in the case of deep neutral networks and the usual regularisations that they are a universal class like "VC-dimension" (+ see above).
What I don't know:
(a) Practically relevant bounds on generalisation errors
(b) At least a weak form of universality ("almost all" functions with low complexity can be represented with small regularisation term).
(c) Practically relevant convergence speed estimates for stochastic gradient decent.
@@tofu-munchingCoalition.ofChaos Very interesting, thank you. BTW I have a colleague that works on generalization bounds and his claim is that no generalization bound that is independent of the distribution of the sample is good. This is the case for VC for instance since it only depends on the class of candidate functions but not interaction with the sample dataset. Note that if you have full access to the distribution then you can compute the generalization error and so your generalization bound is perfect but useless so you need low dependence on the distribution. I thought this might be of interest to you.
@@quintonpierre
That's partially right. That's why I said "VC-dimension" (+...) not just VC-dimension. This complexity is in a sense distribution dependent.
Consider the following algorithm:
Input: sample S for two class classification problem
Output: Model with the lowest complexity that fits the sample S. The complexity is given by the Kolmogorov complexity of a description of the VC-class of a model + the VC-dimension.
That's what "VC-dimension"(+...) should mean. You see it depends on the data. The class is not fixed. It's not PAC (the bound for the test/generalisation error is independent of S and depends on the class only) but conditionally PAC (the bound for the test/generalisation error is a random variable and in this instance depends on S - the lowest complexity depends on S).
It's universal. No algorithm can perform better up to a constant (so especially no faster convergence rate is possible). Even the ones that use specific distributions when you include the generalisation error for the knowledge of the distribution could perform better.
The problem is, that it's not computable. And I'm not aware of any algorithm that's computable and also universal (even in a weaker sense).
Current AI models aren't trained to think in a general sense. They are trained to think like what thinking is available on the Internet. In other words, these AIs emulate what has been said or written by humans. This way, you will never get AI smarter than humans, but only faster and less prone to error in well defined situations.
Irrelevant to the video. I don’t think the video even uses the word “intelligence” outside of the phrase “AI”? And the video certainly isn’t specific to language modeling tasks.
I might be wrong or perhaps I didn't understand the explanations... but sounds to me that the issue is more human than ai. In the sense that we are patern recognising creatures... we want to see patterns, and perhaps the randomness of ai is just patterns to our eyes... then again... I guess we could ask what is a pattern?
Perhaps I'm just stupid. 😅
With things like convolutional neural networks used in computer vision, we can see pretty clearly what kind of patterns tend to excite different layers of the network, we generally start from something like "Gabor filter" and work up to neurons that abstract visual understanding (interestingly, you can show what excites different layers to people and a corresponding region of the visual track will similarly light up).
With LLMs, it's a little more gooey, we can see like basic syntax assembly in the first few layers so mapping connections between tokens, words, sentences and things that look like universal grammar start to pop out, so grammars and constructions of associations (this is the work of Atticus Geiger at Stanford) but then there's also this gooey-ness because it becomes abstracted "blah".
So, there's this kind of latent space that stuff gets pushed into as we go deeper into the network and we have a newer method that we can use to probe it by basically watching what gets activated when we push certain examples through, so we can isolate stuff like neural representations encoding "cat" etc. but these are also pretty mushy and really depend on how you try to measure "cat-ness".
My current wild bet is that we'll probably end up with a Heisenberg uncertainty style law that kind of boils down how useful this representation approach can really be - so no, I'd say it isn't stupid to identify that there's a measurement problem (ie. a human issue with looking for patterns in abstract pile of numbers).
@@whatisrokosbasilisk80 well, I guess I should say thanks... and that you've given me a lot to study and think about... not sure I understood everything. :p but it does feel nice that someone with such knowledge doesn't think my understanding was stupid. :p even though I do feel like I need to study more this topic now. XD
Way to make me feel both dumb and smart... you made me laugh out loud. So thanks for that too. XD XD
@@CesarHILL Representation Engineering and Mechanistic Interpretability is what I'd focus on if you want to really understand this stuff.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
My black box imploded when you said decent instead of descent.
Or maybe she meant dessert. Or desert. Can't keep them separate in my head.
this is amazing u prevent the misconceptions by addressing them one by one in the intro
Apparently, when it comes to overfitting, more data can dilute the impact of noise or outliers. With more data, the noise becomes a smaller fraction of the entire dataset, thus reducing its influence on the model’s learning process. And that makes the complex model perform better.
How do we know Sabine isn't an AI?
She is too funny to be😅
I saw her live last year on a debate in London. She´s flesh and blood!
@@Thomas-gk42 That's exactly what an AI would say.
The answer to the most famous ill defined question is 42.
Plot twist: the question wasn't ill defined and the answer is actually 42.
Until scientifically proven otherwise, the answer remains 42.
Thanks excellent
This isn't excellent. This is Patrick!
Error: 3:59 The vertical and horizontal axis are flipped. 3:39 This could explain the inverse relation between neuroplasticity and memorization.
ndeed one of the most puzzling observations in "modern" machine learning. In my group we are working on this interesting topic and are delighted that Sabine has taken it up here. In a paper currently under review we provide theoretical and experimental perspectives for a possible explanation.
Following AI for almost years now and this is the first I’ve heard of this insight. Thank you for thinking against the grain and helping your viewers do the same!
Thank you, Sabine. This answers my earlier question.
Maybe the amount of "drop out layers" was increased as well, which led to the model to diversify the infomation more evenly accross the weights. And thus lead to a more robust and less overfitted model.
Another explination would be, that the trainset is so complex, that a model with just a view layers has to overfitt in order to get a good loss. For models with more paramaters overfitting is not needed because it's easier to generalize with more layers.
Indeed this is an extremely interesting question.
We've been giving "principles" (a term which stands in for conjectures, or guesswork) for why we should prefer "simple" models (less degrees of freedom), but there might be something substantial at play here which might open a great insight for our theory of knowledge.
Since we have methodological issues in basically every scientific discipline, understanding knowledge is a priority.
Knowledge is dual according to Immanuel Kant -- synthetic apriori knowledge.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Neural nets are classifiers and not necessarily predictors. This classification can then be interpreted as "prediction" through an output node function, but the neural network is still a classifier, and therefore over fitting and under fitting are not a mystery. A neural network should be trained so that it is neither over fit or under fit, that is, it is able to generalize and determine the correct outputs based on untrained inputs.
That sounds good, but actual "reasoning" doesn't work that way. One can't guess correct answers to logical problems from "generalization".
There is also "Grokking: generalisation beyond overfitting"
When you have a model that will by size and structure tend to overfit the data, just training it longer can yank it out of the overfitted state and start generalising the data.
The desired training times derive from model sizes. Correspondingly it's a possibility that it's not model size that is causing generalisation for ever larger models, but the amount of training. There's also a lot of techniques put to use deliberately to fight overfitting.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Us AI / NLU / LLM guys have a lot of fairly good explanations and theories. Anthropic & OpenAI have done some reveals of patterns in the weights etc.
Our best theories note that:
1. Logic is likely being learned
2. Emergence of higher order capabilities is a real thing
3. Deep learning does extract the parsimonious essence underlying data
4. LLMs are actually pretty good at explaining how they arrived at conclusions
I am a bit confused. Overfitting doesn't happen, because in the model training phase strategies to explicitly avoid overfitting are used. E.g. during the training phase random neurons are deactivated so the model cannot rely on one single neuron and has to take in multiple inputs for every problem. So why overfitting does not happen is very clearly understood.
Let's be clear, μP and its depth extension is rich learning, and neural tangent parameterization is what they call lazy or poor learning.
In μP, feature learning guarantees progressive sharpening to reach a width-independent sharpness at any scale; in NTP the progressive lack of feature learning when the width is increased prevents the Hessian from adapting, and its largest eigenvalue from reaching the convergence threshold.
I studied neural networks but didn't know about the second descent. Thanks for introducing this Sabine!
I think it has something to do with resolution. Models will "solve" problems at a certain resolution until they overfit and the error becomes so great at that resolution that it falls into another descent and makes use of more parameters and then is able to solve problems at a higher resolution. In my area of speech enhancement even with tiny models we can get them to learn to solve more and more complex noise sources the longer we train them. Ambient noise is trivial for the model to suppress but more contextualised stochastic noises that require higher temporal resolutions to understand take a lot longer for the model to learn to suppress
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
Gathering data not used in the training set and running the program against that data to see how it fits is one helpful way to avoid overfitting.
This is basic. ANNs of all types are correlation machines that use statistical techniques to make function approximations. Correlation is not causation. QED
i think ur probably right , my intuition (at least about llms is that the first thing they learn is aspects of grammar cause they are present in basically every sentence , then they will learn word ordering , first on the most often used topics , then paraph structure with most often used paragraphs each part always taking an exponentially bigger amount of weights than the previous , eventually the things that the model will try to represent are so complex and unique that its more similar to underfitting for the remaing approximating infinte complexity of language 🤔)
(and i assume that infinty will average out over any reasonable amount of weights)
Fascinating! This curve looks a lot like what happens with "beginner's luck" and mastery of a subject.
NNs aren't a straight ahead multiply. They aren't just weights, the biases are incredibly important and allow the construction of logic gates (and weighted complex logic gates) in each activation. Representing them as only a curve fitting polynomial is misleading.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
When you have that many parameters, a "butterfly like" effect comes into play, basically small changes can have large effects, carried in 2nd and 3rd order derivatives of the weights. Think of it like the modulus in a encryption algorithm, the 'lost bits', are here, but the loss actually makes the potential 'overfitting' not overfit because it kinda turns into a RELU thing
DNNs (Deep Neural Networks) are expected to choose important parameters on their own, so while we may have 100,000 parameters, we expect many of them to be assigned very small (near 0) weights, so this reduces the number of active parameters (or connections: a weight can be associated with multiple parameters so that parameters A and B may connect to a node with low weight, but A and C may connect to a node with higher weight). So most of the weight matrix in a DNN is expected to be sparse. So the graph at the end should be drawn against a horizontal axis of active parameters rather than input parameters.
I have been playing around with a text based AI and I have to say it is fascinating. You can find out why it makes a decision if you ask it to explain in the prompt.
I have found it helpful to construct it as both a person and a psuedo code compiler with access to vast amounts of data but little experience with it.
Every time a user feeds an AI a prompt it is like summoning a genie for that one interaction. They can't tell you why another genie made a decision, but this is the same as humans.
We do sometimes actively think about our choices but sometimes we just make up our reasons for doing what we felt like doing at the time after the fact. Mind Field had a great episode on this.
Long prompts are good for AI. Short prompts less good as the genies can't talk to each other. They send you the text and update their training data and 'trust' the next instance to do their best.
Wow, that is very deep talk about deep learning 😊
She has such a brilliant way of presenting information. She encourages curiosity.
I'd imagine part of the answer is because of the process.
If the points converge on a solution that's only one step, additional data is held back for verification and if the model cannot predict the verification set then the model is tossed.
4::44 “The speculation that makes most sense to me is that models don’t overfit when they could because the overfit isn’t stable under something that happens during the training runs. *They almost always default on a fit that is dominated by as few relevant parameters as possible,* and then fine tune with the remaining parameters. But it’s unclear whether that’s correct.”
_Everything should be made as simple as possible, but not simpler._
-attributed to Albert Einstein
Along the same lines, there is the design principle of _Minimal Critical Specification:_
_No more should be specified than is absolutely essential but it is necessary to identify what is essential._
It seems that the weights and biases incorporate what is essential and _only_ what is essential.
I work in the ML area (often referred to as "artificial intelligence"). The problem in general is, that the models generate abstract decision paths based on data and parameter sets that can hardly ever be complete and are therefore always subject to different - human and non-human - biases, errors and "disappointments". Given the human neural network as a role model for this, we can easily see, that humans tend to make the same mistake(s).
Yes, we can predict the future based on our past experiences (data), but that leads us to make assumptions if the world around us changes. We produce stereotypes to assign certain properties to things and people based on their looks. A helpful thing after all, but many times creating great injustice.
Thanks for bringing this to light: I've been bit by overfitting, but never made it beyond to find the second fall off.
My understanding is that all the thousands or millions of layers of "nodes" that are used in neural nets aren't necessarily different parameters - they're looking at the same set of parameters from a slightly different angle, and combined to optimize or predict a certain output. So it's not equivalent to training an AI on, let's say, sample customer financial data, and the AI learning that all customers with a middle initial "F", or say all customers that have "Apartment 10W" in their address, coincidentally have a really good track record of payments, and then automatically approving loans for future customers fitting those descriptions. The latter is what I typically think of as overfitting, whereas the former is kind of just getting a second (million) opinion.
Certainty (predictability, syntropy) is dual to uncertainty (unpredictability, entropy) -- the Heisenberg certainty/uncertainty principle.
Complexity is dual to simplicity.
Syntax is dual to semantics -- languages or communication.
Large language models (neural networks) are using duality:-
Problem, reaction, solution -- the Hegelian dialectic.
Input vectors can be modelled as problems (thesis), the network reacts (anti-thesis) to the input and this creates the solutions, targets or goals (synthesis).
The correct reaction or anti-thesis (training) synthesizes the optimal solutions or goals -- teleology.
Thesis is dual to anti-thesis creates the converging or syntropic thesis, synthesis -- the time independent Hegelian dialectic.
Neural networks or large language models are using duality via the Hegelian dialectic to solve problems!
If mathematics is a language then it is dual.
All numbers fall within the complex plane.
Real is dual to imaginary -- complex numbers are dual hence all numbers are dual.
The integers are self dual as they are their own conjugates.
The tetrahedron is self dual -- just like the integers.
The cube is dual to the octahedron.
The dodecahedron is dual to the icosahedron -- the Platonic solids are dual.
Addition is dual to subtraction (additive inverses) -- abstract algebra.
Multiplication is dual to division (multiplicative inverses) -- abstract algebra.
Teleological physics (syntropy) is dual to non teleological physics (entropy).
Syntropy (prediction) is dual to increasing entropy -- the 4th law of thermodynamics.
"Always two there are" -- Yoda.
Your mind is syntropic as it solves problems to synthesize solutions -- teleological.
I think it's mainly about batch size. On simple neural nets with MNIST dataset If you feed samples one by one (batch size=1), overfitting happens very quickly. But it you feed a batch of 64 samples, training might be slower, but overfitting is easier to deal with.
In case of LLMs we have a batch size exceeding 1 million.
And obviously there are many techniques to deal with overfitting, like droping some neurons for a single batch, different loss function.
There's a lot of overlap in the data points, so if you consider this a compression problem, you can learn useful abstractions while retaining the original data points accurately at the same time. There's information in the compressed structure that arises.
I have a hypothesis for why models from 'machine learners' work better than those of the 'stats people': biased priors are actually necessary. All models are wrong, with biased priors they have more statistical power. I use biased priors here loosely, as we do not yet have an encompassing definition [at least I am not aware of it]. For me, it includes both priors in the Bayesian sense, but also the model specification itself and the learning algorithm yield a bias (sometimes referred to as inductive bias by ML people). Secondly, the model space needs to both contain good solutions _and_ allow for efficient optimization to find at least one such good solution. That is why current models are so gigantic, not because the reality that they describe is so vast, it isn't, but because our 'primitive' optimization algorithms require it. The oversized models make it so that tricks to prevent overfitting (regularization, dropout, early stopping, ...) is among the most crucial steps in the whole development pipeline. Finally, we have to add more tricks biasing the parameters, because we learn them from bad data in virtually every project. Coming back to your main point, I think that the reason popular models don't overfit is because they have been engineered to align with reality. One could (should?) argue this defies the no lunch free lunch theorem. However, it doesn't because the inputs aren't random, they come from reality which is very niche. Hence, (machine) learning actually works. As does compression. Given the theory that we have, learning shouldn't be possible at the society-level either, yet it appears that we do progress. We are slowly becoming better at encoding our intuition about the world in model specifications, data, and algorithms, and so machine learning advances.