Thank you, Andre! And thank you for your article. Apologies we couldn't recall your name on the fly; we did make sure to show your name in video though ;-) I'm very curious, how did the discussion change you views?
Hey Andre, we really appreciate you dropping in here. Great article! For the benefit of folks -- here it is medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126
@@nomenec For sure - Twitter really isn't the best platform to exchange nuanced perspectives, so when the Twitter conversation began, I took the disagreement (i.e. between LeCun, Pinker, Marcus, Booch, etc.) to be a sign that it was one of those types of ambiguous problems that one can't really confidently their mind up about. A lot of the Twitter thread content seemed pretty speculative or pulled willy-nilly without much organization. When I first read the paper on extrapolation, I was even more unsure of what to think - I was actually wondering many of the questions that you all asked in the interview, e.g. why choose the convex hull instead of another definition? Does this mean that neural networks are actually extrapolating? etc. After listening to LeCun and Balestriero's responses, I have a much more well-informed perspective of the paper's context and argument, and I think it's probably correct. Thanks guys for all the work you do arranging context and asking insightful questions!
This is incredible! Ms. Coffee Bean's dream came true: the extrapolation interpolation beef explained in a verbal discussion! 🤯 Cannot wait to watch this. So happy about a new episode from MLST. You kept us waiting.
Thank you, Letitia! We burned the midnight oil for weeks on this one; we are looking forward to the community enjoying (hopefully!) the effort. We are grateful to both Yann LeCun and Randall Balestriero for spending time with us!
Starting off the new year with a bang. Tim, Keith and Yannic - thank you so much for this quality work. You can clearly tell how much love and dedication goes into every episode. Also the intros just continue to amaze me - the level of understanding you approach the variety of topics with is extremely inspiring.
Couple of minutes into the video and you break some of the fundamentals assumptions I had about deep learning/Neural nets, Jeez man. Excited for this 3hrs long video. And as usual the production quality of the videos keeps getting better. Happy New Year Guys
Happy New Year! Tim and I certainly walked away with very different (upgraded, in my opinion) view on neural nets. Would love to learn how, if at all, your views change after watching.
The content in this channel is just mind blowing. But the main reason I come back is the thoughtful editing and introductions and reflections of the content by dr Tim. I cannot keep up yet in grasping all the content in real time but that is exactly why it's so awesome. Thanks!
Thank you guys, I've not been more amazed by anything in AI than this completely brand new revelation of neural network's internal working. Insanely interesting and beautiful.
Thanks for the shoutout at 38:01 Tim! The Discord channel rocks 😆 An additional note on extrapolation that people might find interesting: - In effect, the ReLU activation function prevents value extrapolation to the left. So when these are stacked, they serve as "extrapolation inhibitors". - This clipping could be applied to other activation functions to improve generalization (or forewarn excessive extrapolation)! - I.e., clipping the inputs to all activation functions within a neural network to be in the range seen at the end of training time will reduce large extrapolation errors at evaluation time (and counting the number of times an input point is clipped throughout the network could indicate how far "outside the relevant convex hull" it is). The clipping shouldn't be introduced until training is done (because we don't have a reason to assume the initialization vectors are "good" at identifying the relevant parts of the convex hull). But I'd be willing to bet that this "neuron input clipping" could improve generalization for many problems, is part of why ReLU works well for so many problems, and can prevent predictions from being made at all for adversarial inputs.
"[Activation clipping] ... can prevent predictions from being made at all for adversarial inputs." Would love to hear more about this line of thinking! Both practical side and what this illuminates on the theory side about "what does it mean to be adversarial / robust / etc". You guys didn't get a chance to discuss adversarial stuff on your chat episode much at all but it seems to abut the topic of generalization quite often which in turn tends to come up with geometric interpretation.
@@oncedidactic happy to clarify. One way I like to think about it is that every basis function inside an MLP (the activations at a node after applying the nonlinearity) generates a distribution. If you have 10k points at training time, then for every internal 1D function you can plot the distribution of the 10k values at those points. That should give a pretty precise definition of the CDF (from central limit theorem), and rather tight bounds of what is "in distribution" (/ likely given observations). The issue is that the generated distribution of values at internal nodes over training data is (obviously) not independent of the training process. So to get an accurate estimation of the distributions we withhold validation data, which provides a true estimation of the error function (the error of the model over the space covered by the validation data). Now when you apply the model to new data, you can look at the values produced at internal nodes relative to the distributions seen at training / validation time. If you observe that a single evaluation point produces "out-of-distribution" (extrapolative) values for a substantial number of nodes in the model, then we know for certain that the point is not "nearby" to our training data. Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱 One of the core mechanisms for making approximations outside the bounds of training data is projecting new points back into the region of space where you can make an approximation (usually on to the convex hull). So in practice we can project points onto the convex hull of the 1D basis functions by clipping all values to the minimum and maximum seen at training time. We would want to do this mainly because we have no reason to assume that the linear fit produced by one node (and it's infinite linear extrapolation to the right) is correct! No training data justified that behavior. If we let our basis functions extrapolate without bounds then our error *definitely* grows without bounds. If we prevent infinite extrapolation, then we *might* be bounding our error too. To tie it all together, the distributions of values seen at validation time (more validation data ➞ better distribution estimates) should *precisely* match the distributions for testing. If they do not, then you know that something about the data has changed (from training & validation time) and your error will change in a commensurate fashion (in an unknown way). This relates to another important fact: we can never modify a model based on validation error. If we make decisions based on validation error, then we entirely undo the (necessary) orthogonality of the validation set (and hence remove our ability to estimate error).
@@tchlux Thanks for the detailed reply! Any further reading you can point to? It makes perfect sense to me you would want to use the clipping / projection to learned convex hull to prevent wild extrapolation that leaves you at the mercy of "out-of-distribution", be that natural or adversarial. I can't think of an example where this is implemented but my knowledge is *not* deep. I imagine this curtails the "magic" of kinda-sorta extrapolating well sometimes, but you win the tradeoff because the limitation of your model is predictable. Or in other words predictably dumb is better than undependably intelligent, as a system component. "Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱" This is so insightful yet simple and really reframes the whole issue for me. Not to pile on too much, but this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before."
@@oncedidactic > Any further reading you can point to? I mostly just think about things in terms of basic linear algebra. If you get super comfortable with matrix multiplication, linear operations, and think really hard about (or better, implement) a principal component analysis algorithm (any method), then you'll start to form the same intuitions I have (for better or worse 😜). I try to think of everything in terms of directions, distances, and derivatives (/ rates of change). I can't think of any "necessary" knowledge in machine learning that you can't draw a nice 2D or 3D picture of, or at least produce a really simple example. I suggest aggressively simplifying anything until you can either draw a picture or clear example with minimal information. If it seems too complicated, it probably is. Stephen Boyd's convex optimization work (RUclips or book) is great. And 3blue1brown is wonderful too. > this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before." Exactly. People will probably continue to talk about it forever, but it only makes sense to *extrapolate* in very specific scenarios with relatively strong assumptions. What we really want in most cases is a model that identifies a low dimensional subspace of the input where it can accurately interpolate.
Fantastic discussion and explanation of the thinking behind interpolation, extrapolation and linearisation. This has really helped shift the needle towards towards the ultimate problem we all face, helping decipher what input is relevant to the task. If possible, please do V.2 covering some of the other concepts Prof LeCun was talking about. Could be a series on its own as so good! Mike Nash - The AI finder
Came here from Lex Fridman video, and gotta say these make the perfect combination (especially now that Lex has arched into some topics outside AI). Keep delivering this fantastically specified content👍
Thank you, Jusso! I really appreciate that. Tim and I often struggle with finding the right balance while keeping it (hopefully) entertaining. It's not easy and we are also trying to brainstorm on ways to improve. So, it's great to hear from a satisfied viewer!
Great episode! These long deep dives are amazing, I get a lot of intuition from them and they are a great point to start reading more papers on the topic (who except Yann can keep up with axiv these days...) Really appreciate the effort and have a great 2022 :)
I spend past two years in uni and attended all related classes in to ML and AI to try to understand the DNN, because noone in CS department can answer my question in the way which I can intuitively understand what the DNN is doing, how and why the DNN is doing. Tim, thanks for the enlighted explanation.
Thanks a lot Michael! But don't thank us too much, most of this wisdom is coming directly from Chollet, Balestriero and LeCun we are just digesting their fascinating ideas and presenting them in the best way we can.
Great episode, keep up the good work. Agree with the reasoning = optimization, at least the reasoning that we currently do with machine learning. There is also a well known result in optimization which states that separation = optimization, where separation means finding a separating hyperplane between a point and some convex hull. So in other words membership, or interpolation is optimization. Many of these concepts are well known in the optimization community for some time now. For instance linear vs nonlinear or discrete vs continuous are known to be of little difference, while convexity is the main concept that makes things tractable. Also the curse of dimensionality can be avoided if you formulate the problem combinatorially as a graph for instance which is dimensionless.
42:31 - 42:41 in 1993 there have been an architecture called ANFIS that is combination of interpretability, monotonicity, from Fuzzy Logic Inference System and combination of the adaptiveness of neural network ANFIS guaranteed smooth gradual change of prediction caused by slight modification of input because of the smoothness and monotonicity aspect from fuzzy logic while still being able to be optimized using gradient based optimizer if desired
@1:36min: if we create labels for important stuff. These can be used again. Kind of 'meta propagation'. To be able to take something up. Building up a vocabulary. Note: IF we can have a tiny center where a lot can happen. This can be applied on say: a hand or a foot. If we have A and B connected, we do not need all that happens in between. I guess one wants to create something that is applicable everywhere. Teleportation. @1:46min: something differentiated, and molded together with related stuff (not yet known). Like velocity and acceleration together with the images related to it. Next normalize such information, into single principles (i guess normalization and making objects with what is normalized might be a way of creating : concepts). Note: IF the will can be defined as 'one or a couple of objects, taken together at once', then you must be able to work with such (like how to work in a database). Perhaps apply it as a regular expression? This can become very very agressive, and thus interesting. Note: a language such that we can derive where the machine is about. Like: visualizing what happens. (disentangle). Normalization. To normalize a principle. PErhaps making a database of normalized principles. @1:56min: perhaps create classes, like : per dimension a way to go about. @02:00min: MAtch! Got the same idea somewhat. Note: a language that generates generation 5 programming languages (relational language). Then terms normalized, put in a dataset. So, with a proper 'calculus', one can create discrete' objects, like: if it repeats a pattern on itself again: one needs 2 circles. (example). You do not need to know everything, If you get a couple of dimensions you work in. Like: 1, 2 and 4. Then this can be called discrete because you solve it with (underneath), these. I label this will because you can let those 3 work together and learn like that.
Your build quality here is really high. Nice work. My only comment on this video was that I had to give parts of this video my full attention. That is probably a good thing.
Note@15 min: if you create hyperplanes, this, my guess, will partake into extra usable information per hyperplane. No proof though. Note@28 min: one OR at a time; not to give properties to objects such that you loose the 'single or instant'. Note@38min: Experience pays off. Note@:41min: "math lump", creating simple datasets and putting those together. Like a sentence of 'objects'. You play with the : "semantics". Note@45min: Can one throw an object through all of the information present at hand and see what it does? Like an analysis: (one object at a time (no dogma)), and see, which manifold is strong and which is not.. (to entangle time as it where (@ 46.50 min)) Note@46min: I simply love this video! Note@53min: So if we have a ball (lot of density), we could encode only its traits we want to have and work with that. Note@1:02min: You need to build from certain objects, only a single spot. Not an object you need to redraw in each case. Such that it can be applied. -(question) IF you are inspired at 50 minutes and see something at 60 for more inspiration and add it to the 50th minute inspiration. IS this wrong? -IS it possible to let some data collect some data over time and notice as it where where it is going. Perhaps even creating objects that are good in this and adding these to ones data analytic toolkit. Having one such single object, is interesting simply in itself. Perhaps creating a vocabulary of some kind??? Term: "dataplatonic" mindset @on the curse: : "jackpot ;-)" ,, Note@51min: i guess it is utile to acquire virtualized versions of objects. Such that the data takes account of 'objects', i.e. : terms. Like a circle or a square as circle and square. So, if we have a term, like a concept, we should generalize(?) it into something that we can use. So getting rid of 'drawing' objects... I guess a 'vocabulary' of a dataset is a nice concept as well.. How to make a concept. Keep track of it. Like: a point drawn, becomes a sphere. So if we create an animation, we re-encode this into data for data analysis... Perhaps even creating synesthesia for the sentences created. Such a 'gift', might parametrize for people watching. Current conclusion: Each thing you want to analyse needs to be built up itself, such that you do not take big objects but building block parameters.. Such the result is not about objects but building blocks that might be like bigger objects, but without the crap (data intensive). One wants to get rid of .. and let the computer do it. Building the right concepts by the computer and by guidence of the hand. Note@01:33min: if you got a function where the energy is understood (being zero). You can grow and shrink it and add it to 'a sentence'. Next you should be able to adapt (add substract) these and using such functions in line and create a kind of word sequence.
The talk was beautifully presented... Thank you all My question is: why are we considering that the new sample (the test set) lies outside the convex hull of the training data, considering the dataset strictly represents a domain like pictures with or without cats? My second question is: In signal processing, the impulse contains all the frequency content the reason why we have to characterize any form of the filter by its impulse response. Having said that, for a particular domain, can we have a training set that completely characterizes the problem and hence the ML model which means, any test data must then lie within the convex hull...???
Very very good episode guys - kudos, as always I have a problem with LeCun's strong statement that "reasoning = optimization" (that most reasoning can be simulated by minimizing some cost). Inference/deduction is not optimization. That's not true at all.
40:51 doesn't this suggest that the input data should to be transformed into a reduced dimension before training on it? Using MNIST digits, for example, the raw pixels could be transformed into the sequence of pen strokes that composed the written symbol. This might have dimensionality around a dozen rather than 784. Obviously, finding that transformation wouldn't be trivial. However, it could also allow generative models to create more realistic interpolations.
If you use relus, and simple ff networks yes they're tessellations but not non-linear act fns with inter-layer feedback connections. An example of the latter is the transformer hypothesis class.
~2:55:00 I think discrete vs continuous dichotomy is not so absolute. Human brain seems to be an analog system, but it can emulate discrete reasoning. Computers are discrete machines, but with neural networks they can emulate continuous reasoning. The main problem seems to be efficiency: emulating one via another is extremely inefficient, that's why dr. Balestriero noted that a hybrid system would be the most efficient. EDIT: Yup, a little later Keith noted that, too.
Great video! Here's a question I have after reading the papers, if anybody can help me: Hypothetically, if, say, the MNIST digits *did* lie on a lower dimensional manifold, then by definition all new data points would fall on that manifold, right? So in the Extrapolation paper, when they show in Table 1 that the test set data doesn't even fall within the convex hull of the Resnet **latent space **, this must mean either 1) Resnet is doing a poor job of learning the true latent space, or 2) MNIST digits do not actually fall on a lower dimensional manifold. Is that right?
Great, great talk! My reaction is based on the first 37' first, but before I go to sleep and forget… two (very non-expert) cents. 1) around 15', you say that NN basically try to find boundaries and don't care about the internal structure of classes. How far does this hold? Loss functions do take into account how far the data point is from the boundary of the class (how dog-typical this dog is, etc.). For sure this is only one tiny part of what 'class structure' can encompass. 2) (I'm quite sure I will find the answer in the remaining part, but) ReLU are different from previous, e.g. logistic, activation functions, which were basically smoothed separators, smoothed piecewise constant functions. ReLU are not constant on the x>0 side :-) - which I found dangerous at first (how far will this climb? how much will a single ReLU influence the outcome, on out-of-distribution test points?) - but doesn't *that* add to the ability to extrapolate, i.e. to say things about what happens far from the convex hull of training points?
A discrete attraction chaoform. Convergence to attractor locations as solutions of time series. Then a disjunct split and fold to exceptional zones surrounding expected precursors to exception. Then train for drop errors triggering exceptional close zone to chaoform large split discreet?
This is fucking crazy, there's just no other way to put it. The idea of piecewise linearity of a neural network is the single biggest opening of the deep learning black box that I have ever seen
The surface dividing the training set in two? How many would there be and are some better to consider as AND with the "search term"? Multiple max entropy cosearch parallelism?
Dr . Randall is saying that even in the generative setting, in GAN's latent space (which has large number of dimensions), there is no interpolation (due to the curse of dimensionality of course). What is then the explanation on why these models even work, and how come they manage to generate new examples? I can't quite figure it out. Great video, enjoyed it!
Great interviews, with many abstract ideas, made simple; I want to wish you all great success, and I will wait for more interesting conversations to come. I am coming for a computational engineering background. We are looking in my field for models that can extrapolate for problems that can be categorized as a mix of differentiable and discrete in nature. Is there any possibility to see a video in future that discusses the ideas of the current episode but more toward computational engineering and physics orientated problems? Thanks and Happy New Year
For inputs that lie within the training data it's an ellipsoid. For inputs that lie outside of the training data I imagine more of a paraboloid. It seems like data could lie both inside of training data in some dimensions and outside in other dimensions, which makes it some kind of ellipsoid paraboloid hybrid. Is this a thing?
Extrapolation is interpolation where one endpoint is magnified by some potentiate of infinity controlled by end zone locking. The outer manifold potentiate?
The reflectome of the outer manifold into the morphology of the inner trained manifold to achieve greater formance from the IM. The focal of the reflectome as a filter to multibatch the stasis of the correct?
If in high dimensional spaces only have varying gradient in 16 or fewer dimensions, doesn't that suggest that principle component analysis should always be run?
Do you mean to run PCA on the ambient space and then throw away all but the top-K eigen vectors? Or just run PCA and use the entire transformed vector as input data points instead of raw data points? If the former, I guess the fear (probably justified) is that we'd be subjecting the entire data set to a single linear transform and possibly throwing out factors that are only useful in smaller subsets of the data. Instead, NNs are able to chop up the space and use different transforms for different regions of the ambient data space. In a sense, they can defer and tweak decisions to throw out factors/linear-combinations. That chopping, ie piece-wise, capability seems an essential upgrade to using only a single transform for the entire data space. If the latter, we'd just be adding another matrix multiplication to a stack of such and it wouldn't change much beyond perhaps numerical stability or efficiency since NNs are of course capable of finding any single linear transform including a PCA projection. In a way, it's related to all the various efforts at improving learning algorithms by tweaking gradients, hessians, etc. In the end, in practice most found that doing something super simple at GPU scale was faster; I'm not sure about the state-of-the-art in numerical stability, though.
@@nomenec I mean throw away the input data that isn't significant. Among other things, it will make smaller faster models. I hadn't heard that for really high dimensional data only 16 or fewer dimensions matter. If I'm not misunderstanding this, which I may very well be, doing PCA first makes a lot of sense. It takes me time to wrap my head around anything, and I'm often far off the mark anyway. Still, this seems logical.
Enlightning episode! A bit long but exciting subject... I would had appreciate to get François Chollet in this debate. Unfortunately, the elephant is not in the room...
A tremendous amount of work went into this show let alone the MLST channel as a whole. Good things take time. Thank you for your patience and continued viewership!
Great and very inspring interviews. Thank you! I wonder how to explain the fact that CNNs learn very practical features in first layers like edge detectors and texture detectors in a persepective of a spline trees theory (I mentioned these because we know what they do and that they are present in NNs). Of course we know that they are used by NNs to split latent space but I think that the fact that NNs are able to figure out such specific features at all is enough qualitative difference comparing to decision trees to question if an analogy to decision trees makes sense at all. Yann LeCun claims that in high dimensional spaces everything is an extrapolation I think it's valid to ask if in high dimensional spaces everything is decion tree-like hyperplane splitting.
28:10 This isn't how humans understand physics. Really, really good video though. 34:00 54:30 It's cool that humans still understand a lot though. The possibilities in the universe are massively constrained by the fact that nothing is generated outside of physical laws. 1:00:00 The limitation that deep learning can't extrapolate 1:03:30 Extrapolation = reasoning? So can they reason? 1:03:50 No 1:05:50 Supervised learning is the thing that sucks 1:06:50 Geoff Hinton thinks general unsupervised first, specialization after 1:12:00 RBF network 1:15:00 Different definitions of interpolation 1:22:50 latent contrastive predictive models 1:25:00 New architectures that aren't contrastive have come out 1:29:30 No, they will be able to reason 1:30:00 What would prove that neural networks can reason? 1:35:30 RNNs are the networks that can train with variable number of layers 1:37:28 Nobody can train a neural net to print the nth digit of pi (I can). Yeah, once we figure out basic things we might be able to try mathematical concepts. 1:45:00 System 1 and 2 in chess and driving 2:07:10 Convolution is nothing more than a performance optimization by giving the network pre-knowledge that interesting features are spatially local A lot of tearing down of not 100% correct analogies of neural networks and what might actually model them well - 2:30:30 It's impossible for a neural network to find the nth digit of pi 2:34:45 Discrete vs smooth... Have both systems? (Actions distill, Jordan Peterson) 2:36:30 (The real world is limited) is it because neural nets only use textures? No, resolution is low, or it would blow up (Man that accent was tough for me) 2:45:30 Summary of that last interview. Intuition is fine, but mathematical rigor doesn't apply well with that definition 2:47:30 We need a better definition of what kind of interpolation is happening, and that will help us progress 2:50:00 It's hard to figure out where researchers exactly disagree because of politeness 2:53:00 It's all about pushing back on the limitation that neural networks can't extrapolate 2:54:40 Digits of pi again. It's not what he's talking about actually, too advanced. He's talking about a cat jumping in a place it's never seen before (Tesla predicts paths of cars). He thinks eventually we'll get there, but I'm not as optimistic. 2:56:00 There's an article by Andre Ye that annoyed him because it invoked interpolation vs extrapolation to say they'll never do it, which is the real question 2:57:10 At the end of the day, neuron signals are continuous functions, but somehow they produce digital reasoning. But will it be efficient? 2:59:00 But there is no discrete thing (actions) 3:00:40 (There you go. Yes. It's going to be hard. But that's the only way for a neural network to do it, and calculators aren't going to discover profound truths.) 3:01:30 (Omg it feels like they're starting to think the way I do about it. System 1, system 2) It's insanely powerful to train a discrete algorithm on top of neural network. Longer term possibility. 3:05:00 Underexplored. Feature creep? (No! That's insane. Is general intelligence feature creep?) 3:06:30 Hard to train (it seems the opposite to me) Getting to do discrete stuff involves lots of hacks. 3:07:30 "TABLE 1" Attacks attack on paper 3:17:00 You can't initialize a neural net with zeros 3:18:00 We're comparing neural nets to the entire human race and its culture and inheritance
I'm only 20 minutes in, so will remove this comment if it's answered in the discussion, but.... How do you think smooth activation functions (e.g. ELU) would affect the polyhedral covering of the feature space? If ReLU functions create hard boundaries between separate polyhedra, would smooth functions create smooth boundaries? Or perhaps weighted combinations of polyhedra?
On the one hand, sure, arguments which explicitly state repeatedly that they apply to piecewise linear functions do not immediately apply outside that assumption. On the other hand, that is not evidence the against a more general interpretation of the conclusions and there is soft evidence that in many problem spaces, NNs, regardless of their activation functions, are driven towards chopping up space into affine cells. Some examples of such soft evidence is 1a) the dominance of piecewise linear activation functions overall or otherwise 1b) the dominance of activation functions that are asymptotic (including your softplus example) and 2) the dominance of NN nodes structured as nonlinear functions of linear combinations as opposed to nonlinear combinations. The consequence of 2) is that softness still falls along a linear hyperplane boundary! And given 1b) there is a distance at which it effectively behaves as a piecewise linear function. It becomes a problem specific empirical question as to how much an NN actually leverages the curvature versus the asymptotic behavior. My claim, and it's just a gut conjecture based on soft evidence at this point, is that for most problems and typical NNs, any such activation function curvature is incidental and/or sometimes useful for efficient training and that's why ReLU and like, which "abandon all pretense at smooth non-linearity", as I said the video, are dominating.
I believe most of this just is a byproduct of the fact that brain neurons operate in analog space while computer neural networks are digital which is an approximation of analog data where sampling is always relevant. The other issue is the fact that all data in neural networks are collected together in a singular bucket with a singular answer for various learned scenarios. Whereas in the brain things are much more decomposed into component pieces or dimensions which become inputs into higher order reasoning processes. And this is what leads to the human cognitive evolution that creates language from symbols, with embedded meanings and things like numbers and mathematics. An analogy for this is to say that each individual arabic numeral has a distinct identity function (learned symbol pattern recognition) corresponding to a set of neurons in the brain. Separate from that you have another set of neurons that have learned the concept of numbers and can associate that with the symbol of a number. And separate from that there is a set of neurons that have learned the principle of counting associated with numbers. That is a network of networks that work together to produce a result. And as such the brain can learn and understand linear algebra and do calculations with it because of the preservation of low level atomic identity functions or logic functions that are not simple statistics problems. Meaning the brain is a network of networks where each dimension is a distinct network unto itself as opposed a singular statistical model.
I think that's a very nice way of looking at things. In a sense, NNs breaking up the space into polyhedra is like a simple hacked version of a network of networks. They are encoding little subunit networks, by virtual of the ReLU activations, that are then forced into shared latent value array representations. That introduces artifacts and isn't as flexible as networks of networks. The killer for trying to train networks of networks is the combinatorial blowup that happens when exploring the space of all possible connection configurations. And it's why so much of what makes NNs work today is actually the human engineering of certain network architectures that structurally hardcode useful priors. Great comment, thank you!
@@nomenec Thanks. It is definitely a much simpler quantification effort for dealing with probability calculation within a bounded context as defined by the algorithmic model, data provided for training and tuning of calculations. However, there is no reason not to investigate more open ended architectures, especially as a thought exercise of how such a thing would be possible.
More or less! We are going to drop another show with RandallB next month, we now think that MLPs are more extrapolative than previously thought (along the lines of Randalls/Yann's everything is extrapolation paper)
Okay I consider myself pretty reasonably intelligent but have no idea what this is about. Can someone tell me so I can research more and expand! I hear dimensions, neural networks and some geometry. Happy new year!
I think I understand the bimodal discussion of High dimension interpolation and extrapolation. Linear regression is a fitting of interpolated volume in 3 dimensions while extrapolation is any 3 dimensional values outside the interpolated value volume
I wonder if a machine's ability to find a non-linear function or to integrate one would be analogous to what Stephen Wolfram calls computational reducibility? Certainly, an agent can call a non-linear function rather than a piece-wise linear model.
There are certainly linear relationships at various levels of physical description: the first and second laws of thermodynamics, the Schwarzschild radius vs mass, photon energy vs frequency, etc. Whether or not any of these actually "exist in the Universe" is something philosophers have argued for at least millennia and probably will argue until heat death. From my perspective, they "exist" as epistemic descriptions of emergent phenomena.
@18:50 “machine learning hasn’t even advanced to the second order” Except many models don’t use piecewise linear activations like relu, for example most transformers use some version of the exponential linear unit.
It doesn't really make a material difference, ELUs just give you a smoother boundary between the polyhedra (the shift is the same as the bias parameter). NNs are still "chopping" up the input space with compositions of basis functions (which form partial keys of an epic hash table). It's just more understandable/intuitive with ReLUs. NNs are learning spatial boundaries.
Imputation can make interpolation appear to be extrapolation. But more importantly people don't understand the relationship between interpolation extrapolation and the Chomsky hierarchy. You simply cannot do extrapolation with context-free grammars. Transformers are context-free grammar capable not more.
Thanks James! According to arxiv.org/pdf/2207.02098.pdf transformers map to finite state automata computational model with no augmented memory and can recognise finite languages only
Great stuff, extremely interesting topic and strong content. But is there a version without the background music - podcast or video? I'm probably just a grumpy old fart, but I find it really hard to concentrate, it's like trying to follow a conversation while somebody is simultaneously licking my ear.
drive.google.com/file/d/16bc7XJjKJzw4YdvL5rYdRZZB19dSzR70/view?usp=sharing here is the intro (first 60 mins) with no background music, you old fart! :)
I would like to see this "absolute dog". I think it's possible to generate one, just reward the GAN to react both to realism and activation of particular neuron. I wonder how that doggest dog ever would look like. I would also like to see the doggest cat, and the cattest dog.
This obsession with piecewise linearity is a bit too much. We are in a highly distributed space, the nature of the unit does not give that much insight into the system. It would be like saying consciousness arose because real neurons are integrate and fire units. Piecewise linearity, like bayesian priors on the weights or on the latent space help because they make the model simpler, reduce the hypothesis space and stabilise learning, not because they hold the key to the mysteries of distributed computation and learning.
On the contrary, it gives wonderful insight. Namely that NNs are a storage mechanism, not a computation mechanism. They have massive representational power i.e. they partition euclidean space up to a reasonable dimensionality which would still confer statistical generalisation before the curse bites -- but, it's clearly not possible to "memorise infinity". R^N
@@TimScarfe I don’t see the depth that distinction brings. Writing down a function requires storage, the function is still performing computation. Is f(x) = 2x a storage mechanism ? The work of David Pfau on invariance makes much more sense to me to understand what those functions do. But I’ll read the paper, it might give me more insight into what you’re saying.
@@TimScarfe I would say that difference between storage and computation may be very subtle (of course in trivial case of saving raw data it seems very different but saving raw data is kind of like computing identity function - not very interesting). Most of smartest algorithms in classical algorithmics I seen were usually focused on finding a clever data structure which allows to performs some operations effectively (e.g. Fibonacci heaps). Isn't it storage? Even if NNs are just databases, the fact that they are able to response to queries of modalities like images and natural language seems to indicate that their query engine is non trivial and allows effective (not to confuse with optimal) way of compression so search space does not explode like it often does in combinatorial setups.
As a result of our seeing truth to be the thorough-going harmony of all the concepts we have at our command, the question forces itself upon us: Yes, but does thinking even have any content if you disregard all visible reality, if you disregard the sense-perceptible world of phenomena? Does there not remain a total void, a pure phantasm, if we think away all sense-perceptible content? That this is indeed the case could very well be a widespread opinion, so we must look at it a little more closely. As we have already noted above, many people think of the entire system of concepts as in fact only a photograph of the outer world. They do indeed hold onto the fact that our knowing develops in the form of thinking, but demand nevertheless that a "strictly objective science" take its content only from outside. According to them the outer world must provide the substance that flows into our concepts. Without the outer world, they maintain, these concepts are only empty schemata without any content. If this outer world fell away, concepts and ideas would no longer have any meaning, for they are there for the sake of the outer world. One could call this view the negation of the concept. For then the concept no longer has any significance at all for the objective world. It is something added onto the latter. The world would stand there in all its completeness even if there were no concepts. For they in fact bring nothing new to the world. They contain nothing that would not be there without them. They are there only because the knowing subject wants to make use of them in order to have, in a form appropriate to this subject, that which is otherwise already there. For this subject, they are only mediators of a content that is of a non-conceptual nature. This view - If it were justified, one of the following three presuppositions would have to be correct: 1. The world of concepts stands in a relationship to the outer world such that it only reproduces the entire content of this world in a different form. Here "outer world" means the sense world. If that were the case, one truly could not see why it would be necessary to lift oneself above the sense world at all. The entire whys and wherefores of knowing would after all already be given along with the sense world. 2. The world of concepts takes up, as its content, only a part of "what manifests to the senses." Picture the matter something like this. We make a series of observations. We meet there with the most varied objects. In doing so we notice that certain characteristics we discover in an object have already been observed by us before. Our eye scans a series of objects A, B, C, D, etc. A has the characteristics p, q, a, r; B: l, m, b, n; C: k h, c, g; and D: p, u, a, v. In D we again meet the characteristics a and p, which we have already encountered in A. We designate these characteristics as essential. And insofar as A and D have the same essential characteristics, we say that they are of the same kind. Thus we bring A and D together by holding fast to their essential characteristics in thinking. There we have a thinking that does not entirely coincide with the sense world, a thinking that therefore cannot be accused of being superfluous as in the case of the first presupposition above; nevertheless it it still just as far from bringing anything new to the sense world. But one can certainly raise the objection to this that, in order to recognize which characteristics of a thing are essential, there must already be a certain norm making it possible to distinguish the essential from the inessential. This norm cannot lie in the object, for the object in fact contains both what is essential and inessential in undivided unity. Therefore this norm must after all be thinking's very own content. This objection, however, does not yet entirely overturn this view. One can say, namely, that it is an unjustified assumption to declare that this or that is more essential or less essential for a thing. We are also not concerned about this. It is merely a matter of our encountering certain characteristics that are the same in several things and of our then stating that these things are of the same kind. It is not at all a question of whether these characteristics, which are the same, are also essential. But this view presupposes something that absolutely does not fit the facts. Two things of the same kind really have nothing at all in common if a person remains only with sense experience. An example will make this clear. The simplest example is the best, because it is the most surveyable. Let us look at the following two triangles. [Figure: Two Triangles] What is really the same about them if we remain with sense experience? Nothing at all. What they have in common - namely, the law by which they are formed and which brings it about that both fall under the concept "triangle" - we can gain only when we go beyond sense experience. The concept "triangle" comprises all triangles. We do not arrive at it merely by looking at all the individual triangles. This concept always remains the same for me no matter how often I might picture it, whereas I will hardly ever view the same "triangle" twice. What makes an individual triangle into "this" particular one and no other has nothing whatsoever to do with the concept. A particular triangle is this particular one not through the fact that it corresponds to that concept but rather because of elements Iying entirely outside the concept: the length of its sides, size of its angles, position, etc. But it is after all entirely inadmissible to maintain that the content of the concept "triangle" is drawn from the objective sense world, when one sees that its content is not contained at all in any sense-perceptible phenomenon. 3. Now there is yet a third possibility. The concept could in fact be the mediator for grasping entities that are not sense-perceptible but that still have a self-sustaining character. This latter would then be the non-conceptual content of the conceptual forn of our thinking. Anyone who assumes such entities, existing beyond experience, and credits us with the possibility of knowing about them must then also necessarily see the concept as the interpreter of this knowing. We will demonstrate the inadequacy of this view more specifically later. Here we want only to note that it does not in any case speak against the fact that the world of concepts has content. For, if the objects about which one thinks lie beyond any experience and beyond thinking, then thinking would all the more have to have within itself the content upon which it finds its support. It could not, after all, think about objects for which no trace is to be found within the world of thoughts. (rudolf steiner, theory of Knowledge Implicit in goethe's world conception)
@@ultrasound1459 Nothing much in terms of AI :(, but it has been nice spending 3 months in Europe and soon a couple in Asia. My gray AWS hairs are gone ^^
Thanks for posting this episode! And as "that guy" at 2:08:19, I'm happy to say I found the discussion very interesting and it's changed my mind :)
Thank you, Andre! And thank you for your article. Apologies we couldn't recall your name on the fly; we did make sure to show your name in video though ;-) I'm very curious, how did the discussion change you views?
Hey Andre, we really appreciate you dropping in here. Great article! For the benefit of folks -- here it is medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126
And this was the tweet where LeCun picked it up twitter.com/ylecun/status/1409940043951742981
@@nomenec For sure - Twitter really isn't the best platform to exchange nuanced perspectives, so when the Twitter conversation began, I took the disagreement (i.e. between LeCun, Pinker, Marcus, Booch, etc.) to be a sign that it was one of those types of ambiguous problems that one can't really confidently their mind up about. A lot of the Twitter thread content seemed pretty speculative or pulled willy-nilly without much organization. When I first read the paper on extrapolation, I was even more unsure of what to think - I was actually wondering many of the questions that you all asked in the interview, e.g. why choose the convex hull instead of another definition? Does this mean that neural networks are actually extrapolating? etc. After listening to LeCun and Balestriero's responses, I have a much more well-informed perspective of the paper's context and argument, and I think it's probably correct.
Thanks guys for all the work you do arranging context and asking insightful questions!
@@MachineLearningStreetTalk p
Just spent 5 hours watching this 3-hour video. This is both dense and profound. Great job, best episode yet in my book!
Thank you for your time and commitment!
I spent 7h lmao
I'm still too new to machine learning
I love this episode
This is incredible! Ms. Coffee Bean's dream came true: the extrapolation interpolation beef explained in a verbal discussion! 🤯
Cannot wait to watch this. So happy about a new episode from MLST. You kept us waiting.
Thank you, Letitia! We burned the midnight oil for weeks on this one; we are looking forward to the community enjoying (hopefully!) the effort. We are grateful to both Yann LeCun and Randall Balestriero for spending time with us!
I wait for your videos in more excitement than I wait for my favorite tv shows' new seasons. Looks amazing!
Starting off the new year with a bang. Tim, Keith and Yannic - thank you so much for this quality work. You can clearly tell how much love and dedication goes into every episode. Also the intros just continue to amaze me - the level of understanding you approach the variety of topics with is extremely inspiring.
Couple of minutes into the video and you break some of the fundamentals assumptions I had about deep learning/Neural nets, Jeez man. Excited for this 3hrs long video.
And as usual the production quality of the videos keeps getting better. Happy New Year Guys
Happy New Year! Tim and I certainly walked away with very different (upgraded, in my opinion) view on neural nets. Would love to learn how, if at all, your views change after watching.
The content in this channel is just mind blowing. But the main reason I come back is the thoughtful editing and introductions and reflections of the content by dr Tim. I cannot keep up yet in grasping all the content in real time but that is exactly why it's so awesome. Thanks!
Thank you guys, I've not been more amazed by anything in AI than this completely brand new revelation of neural network's internal working. Insanely interesting and beautiful.
I think I have seen this video over a dozen times, but every time I keep learning something new. Thx MLST!
Thanks for the shoutout at 38:01 Tim! The Discord channel rocks 😆
An additional note on extrapolation that people might find interesting:
- In effect, the ReLU activation function prevents value extrapolation to the left. So when these are stacked, they serve as "extrapolation inhibitors".
- This clipping could be applied to other activation functions to improve generalization (or forewarn excessive extrapolation)!
- I.e., clipping the inputs to all activation functions within a neural network to be in the range seen at the end of training time will reduce large extrapolation errors at evaluation time (and counting the number of times an input point is clipped throughout the network could indicate how far "outside the relevant convex hull" it is).
The clipping shouldn't be introduced until training is done (because we don't have a reason to assume the initialization vectors are "good" at identifying the relevant parts of the convex hull). But I'd be willing to bet that this "neuron input clipping" could improve generalization for many problems, is part of why ReLU works well for so many problems, and can prevent predictions from being made at all for adversarial inputs.
"[Activation clipping] ... can prevent predictions from being made at all for adversarial inputs." Would love to hear more about this line of thinking! Both practical side and what this illuminates on the theory side about "what does it mean to be adversarial / robust / etc". You guys didn't get a chance to discuss adversarial stuff on your chat episode much at all but it seems to abut the topic of generalization quite often which in turn tends to come up with geometric interpretation.
@@oncedidactic happy to clarify. One way I like to think about it is that every basis function inside an MLP (the activations at a node after applying the nonlinearity) generates a distribution. If you have 10k points at training time, then for every internal 1D function you can plot the distribution of the 10k values at those points. That should give a pretty precise definition of the CDF (from central limit theorem), and rather tight bounds of what is "in distribution" (/ likely given observations). The issue is that the generated distribution of values at internal nodes over training data is (obviously) not independent of the training process. So to get an accurate estimation of the distributions we withhold validation data, which provides a true estimation of the error function (the error of the model over the space covered by the validation data).
Now when you apply the model to new data, you can look at the values produced at internal nodes relative to the distributions seen at training / validation time. If you observe that a single evaluation point produces "out-of-distribution" (extrapolative) values for a substantial number of nodes in the model, then we know for certain that the point is not "nearby" to our training data. Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱
One of the core mechanisms for making approximations outside the bounds of training data is projecting new points back into the region of space where you can make an approximation (usually on to the convex hull). So in practice we can project points onto the convex hull of the 1D basis functions by clipping all values to the minimum and maximum seen at training time. We would want to do this mainly because we have no reason to assume that the linear fit produced by one node (and it's infinite linear extrapolation to the right) is correct! No training data justified that behavior. If we let our basis functions extrapolate without bounds then our error *definitely* grows without bounds. If we prevent infinite extrapolation, then we *might* be bounding our error too.
To tie it all together, the distributions of values seen at validation time (more validation data ➞ better distribution estimates) should *precisely* match the distributions for testing. If they do not, then you know that something about the data has changed (from training & validation time) and your error will change in a commensurate fashion (in an unknown way). This relates to another important fact: we can never modify a model based on validation error. If we make decisions based on validation error, then we entirely undo the (necessary) orthogonality of the validation set (and hence remove our ability to estimate error).
@@tchlux Thanks for the detailed reply! Any further reading you can point to? It makes perfect sense to me you would want to use the clipping / projection to learned convex hull to prevent wild extrapolation that leaves you at the mercy of "out-of-distribution", be that natural or adversarial. I can't think of an example where this is implemented but my knowledge is *not* deep. I imagine this curtails the "magic" of kinda-sorta extrapolating well sometimes, but you win the tradeoff because the limitation of your model is predictable. Or in other words predictably dumb is better than undependably intelligent, as a system component.
"Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱" This is so insightful yet simple and really reframes the whole issue for me. Not to pile on too much, but this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before."
@@oncedidactic
> Any further reading you can point to?
I mostly just think about things in terms of basic linear algebra. If you get super comfortable with matrix multiplication, linear operations, and think really hard about (or better, implement) a principal component analysis algorithm (any method), then you'll start to form the same intuitions I have (for better or worse 😜).
I try to think of everything in terms of directions, distances, and derivatives (/ rates of change). I can't think of any "necessary" knowledge in machine learning that you can't draw a nice 2D or 3D picture of, or at least produce a really simple example. I suggest aggressively simplifying anything until you can either draw a picture or clear example with minimal information. If it seems too complicated, it probably is.
Stephen Boyd's convex optimization work (RUclips or book) is great. And 3blue1brown is wonderful too.
> this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before."
Exactly. People will probably continue to talk about it forever, but it only makes sense to *extrapolate* in very specific scenarios with relatively strong assumptions. What we really want in most cases is a model that identifies a low dimensional subspace of the input where it can accurately interpolate.
I keep coming back to this. One of the best MLSTs.
Fantastic discussion and explanation of the thinking behind interpolation, extrapolation and linearisation. This has really helped shift the needle towards towards the ultimate problem we all face, helping decipher what input is relevant to the task. If possible, please do V.2 covering some of the other concepts Prof LeCun was talking about. Could be a series on its own as so good! Mike Nash - The AI finder
Sooo what are the odds we can get a conversation between LeCun and Chollet? Would love to watch them have a discussion on this.
You guys really kept us waiting.
Thank you! MLST for this one.
Thanks you for creating this amazing channel. The amount of insights one can get sitting for three hours with the Professionals is immense!
WOHOOOO! I'm so so stoked to see this video!
Time to drop everything and watch another epic interview by the MLST team!
Cheers! Just don't drop your Chai! ;-)
Time to write the afternoon off and make the most of an incredible opportunity in listening to this discussion.
Occam's razor always makes straight cuts (in reference to piecewise linear functions) was a great line!
Came here from Lex Fridman video, and gotta say these make the perfect combination (especially now that Lex has arched into some topics outside AI). Keep delivering this fantastically specified content👍
Thank you, Jusso! I really appreciate that. Tim and I often struggle with finding the right balance while keeping it (hopefully) entertaining. It's not easy and we are also trying to brainstorm on ways to improve. So, it's great to hear from a satisfied viewer!
Love these long form videos -- really appreciate the effort you guys are putting in!!
5 minutes in and it feels like extended Christmas :D
So glad to have the show back!
Great episode! These long deep dives are amazing, I get a lot of intuition from them and they are a great point to start reading more papers on the topic (who except Yann can keep up with axiv these days...) Really appreciate the effort and have a great 2022 :)
imagine the balls to make a 1 hour intro before the main discussion :D
we need more of Prof. Yann lecun!
I spend past two years in uni and attended all related classes in to ML and AI to try to understand the DNN, because noone in CS department can answer my question in the way which I can intuitively understand what the DNN is doing, how and why the DNN is doing.
Tim, thanks for the enlighted explanation.
Thanks a lot Michael! But don't thank us too much, most of this wisdom is coming directly from Chollet, Balestriero and LeCun we are just digesting their fascinating ideas and presenting them in the best way we can.
Well done guys it's really a pleasure to be diving into into this field
Interesting talk - I'm working on a pile of notes, amplifications, and critiques.
They are back ❤️ if only youtube decided to use that bell 🔔. Great talk - thank you very much for all your efforts!
Great episode, keep up the good work.
Agree with the reasoning = optimization, at least the reasoning that we currently do with machine learning. There is also a well known result in optimization which states that separation = optimization, where separation means finding a separating hyperplane between a point and some convex hull. So in other words membership, or interpolation is optimization.
Many of these concepts are well known in the optimization community for some time now. For instance linear vs nonlinear or discrete vs continuous are known to be of little difference, while convexity is the main concept that makes things tractable. Also the curse of dimensionality can be avoided if you formulate the problem combinatorially as a graph for instance which is dimensionless.
Happy New Year's!!! I've missed you guys
42:31 - 42:41
in 1993 there have been an architecture called ANFIS
that is combination of interpretability, monotonicity, from Fuzzy Logic Inference System and
combination of the adaptiveness of neural network
ANFIS guaranteed smooth gradual change of prediction caused by slight modification of input because of the smoothness and monotonicity aspect from fuzzy logic while still being able to be optimized using gradient based optimizer if desired
I love the analogy 'I feel like I'm standing on Pluto', nice :)
Just got done watching it. Grateful for the great work the team has done. Cheers :)
Tim, your statement about neural networks being analogous to classical decision trees absolutely hits home.
@1:36min: if we create labels for important stuff. These can be used again. Kind of 'meta propagation'. To be able to take something up. Building up a vocabulary.
Note: IF we can have a tiny center where a lot can happen. This can be applied on say: a hand or a foot. If we have A and B connected, we do not need all that happens in between. I guess one wants to create something that is applicable everywhere. Teleportation.
@1:46min: something differentiated, and molded together with related stuff (not yet known). Like velocity and acceleration together with the images related to it. Next normalize such information, into single principles (i guess normalization and making objects with what is normalized might be a way of creating : concepts).
Note: IF the will can be defined as 'one or a couple of objects, taken together at once', then you must be able to work with such (like how to work in a database). Perhaps apply it as a regular expression? This can become very very agressive, and thus interesting.
Note: a language such that we can derive where the machine is about. Like: visualizing what happens. (disentangle). Normalization.
To normalize a principle. PErhaps making a database of normalized principles.
@1:56min: perhaps create classes, like : per dimension a way to go about.
@02:00min: MAtch! Got the same idea somewhat.
Note: a language that generates generation 5 programming languages (relational language). Then terms normalized, put in a dataset. So, with a proper 'calculus', one can create discrete' objects, like: if it repeats a pattern on itself again: one needs 2 circles. (example).
You do not need to know everything, If you get a couple of dimensions you work in. Like: 1, 2 and 4. Then this can be called discrete because you solve it with (underneath), these. I label this will because you can let those 3 work together and learn like that.
Your build quality here is really high. Nice work. My only comment on this video was that I had to give parts of this video my full attention. That is probably a good thing.
Lol, cheers and thank you!
Note@15 min: if you create hyperplanes, this, my guess, will partake into extra usable information per hyperplane. No proof though.
Note@28 min: one OR at a time; not to give properties to objects such that you loose the 'single or instant'.
Note@38min: Experience pays off.
Note@:41min: "math lump", creating simple datasets and putting those together. Like a sentence of 'objects'. You play with the : "semantics".
Note@45min: Can one throw an object through all of the information present at hand and see what it does? Like an analysis: (one object at a time (no dogma)), and see, which manifold is strong and which is not.. (to entangle time as it where (@ 46.50 min))
Note@46min: I simply love this video!
Note@53min: So if we have a ball (lot of density), we could encode only its traits we want to have and work with that.
Note@1:02min: You need to build from certain objects, only a single spot. Not an object you need to redraw in each case. Such that it can be applied.
-(question) IF you are inspired at 50 minutes and see something at 60 for more inspiration and add it to the 50th minute inspiration. IS this wrong?
-IS it possible to let some data collect some data over time and notice as it where where it is going. Perhaps even creating objects that are good in this and adding these to ones data analytic toolkit. Having one such single object, is interesting simply in itself. Perhaps creating a vocabulary of some kind???
Term: "dataplatonic" mindset
@on the curse: : "jackpot ;-)" ,,
Note@51min: i guess it is utile to acquire virtualized versions of objects. Such that the data takes account of 'objects', i.e. : terms. Like a circle or a square as circle and square.
So, if we have a term, like a concept, we should generalize(?) it into something that we can use. So getting rid of 'drawing' objects... I guess a 'vocabulary' of a dataset is a nice concept as well..
How to make a concept. Keep track of it. Like: a point drawn, becomes a sphere. So if we create an animation, we re-encode this into data for data analysis... Perhaps even creating synesthesia for the sentences created. Such a 'gift', might parametrize for people watching.
Current conclusion: Each thing you want to analyse needs to be built up itself, such that you do not take big objects but building block parameters.. Such the result is not about objects but building blocks that might be like bigger objects, but without the crap (data intensive). One wants to get rid of .. and let the computer do it. Building the right concepts by the computer and by guidence of the hand.
Note@01:33min: if you got a function where the energy is understood (being zero). You can grow and shrink it and add it to 'a sentence'. Next you should be able to adapt (add substract) these and using such functions in line and create a kind of word sequence.
Wow! What an amazing video! Best one yet!
The talk was beautifully presented... Thank you all
My question is: why are we considering that the new sample (the test set) lies outside the convex hull of the training data, considering the dataset strictly represents a domain like pictures with or without cats?
My second question is: In signal processing, the impulse contains all the frequency content the reason why we have to characterize any form of the filter by its impulse response. Having said that, for a particular domain, can we have a training set that completely characterizes the problem and hence the ML model which means, any test data must then lie within the convex hull...???
Very very good episode guys - kudos, as always
I have a problem with LeCun's strong statement that "reasoning = optimization" (that most reasoning can be simulated by minimizing some cost). Inference/deduction is not optimization. That's not true at all.
Why is it not true
40:51 doesn't this suggest that the input data should to be transformed into a reduced dimension before training on it? Using MNIST digits, for example, the raw pixels could be transformed into the sequence of pen strokes that composed the written symbol. This might have dimensionality around a dozen rather than 784. Obviously, finding that transformation wouldn't be trivial. However, it could also allow generative models to create more realistic interpolations.
Great video - excellent conceptual discussion
Great stuff guys, LeCun is next level!:)
If you use relus, and simple ff networks yes they're tessellations but not non-linear act fns with inter-layer feedback connections. An example of the latter is the transformer hypothesis class.
~2:55:00
I think discrete vs continuous dichotomy is not so absolute. Human brain seems to be an analog system, but it can emulate discrete reasoning. Computers are discrete machines, but with neural networks they can emulate continuous reasoning. The main problem seems to be efficiency: emulating one via another is extremely inefficient, that's why dr. Balestriero noted that a hybrid system would be the most efficient.
EDIT: Yup, a little later Keith noted that, too.
Loved this one. Again
Amazing, congratulations!
Thanks Connor! We couldn't have done it without you!
Great video! Here's a question I have after reading the papers, if anybody can help me:
Hypothetically, if, say, the MNIST digits *did* lie on a lower dimensional manifold, then by definition all new data points would fall on that manifold, right? So in the Extrapolation paper, when they show in Table 1 that the test set data doesn't even fall within the convex hull of the Resnet **latent space **, this must mean either 1) Resnet is doing a poor job of learning the true latent space, or 2) MNIST digits do not actually fall on a lower dimensional manifold.
Is that right?
Reflected ReLU > ReLU 😎
I want a neural network from you Tim ❤
does GELU not smooth these polyhedra from a geodesic structure into a continuous smooth manifold?
Great, great talk! My reaction is based on the first 37' first, but before I go to sleep and forget… two (very non-expert) cents. 1) around 15', you say that NN basically try to find boundaries and don't care about the internal structure of classes. How far does this hold? Loss functions do take into account how far the data point is from the boundary of the class (how dog-typical this dog is, etc.). For sure this is only one tiny part of what 'class structure' can encompass. 2) (I'm quite sure I will find the answer in the remaining part, but) ReLU are different from previous, e.g. logistic, activation functions, which were basically smoothed separators, smoothed piecewise constant functions. ReLU are not constant on the x>0 side :-) - which I found dangerous at first (how far will this climb? how much will a single ReLU influence the outcome, on out-of-distribution test points?) - but doesn't *that* add to the ability to extrapolate, i.e. to say things about what happens far from the convex hull of training points?
Amazing video, the bass is super high though! Wish it was a little lower as it requires manual EQ
The converdivergence of x^n at x=1 saddle unstable point even implies input scaling has a convergence implication on a polynomial fit.
Awesome! Thanks again for arranging this :) !
A discrete attraction chaoform. Convergence to attractor locations as solutions of time series. Then a disjunct split and fold to exceptional zones surrounding expected precursors to exception. Then train for drop errors triggering exceptional close zone to chaoform large split discreet?
This is fucking crazy, there's just no other way to put it.
The idea of piecewise linearity of a neural network is the single biggest opening of the deep learning black box that I have ever seen
Cheers, Federico! I share your opinion as well; for me it was an eye opening view point.
The surface dividing the training set in two? How many would there be and are some better to consider as AND with the "search term"? Multiple max entropy cosearch parallelism?
Dr . Randall is saying that even in the generative setting, in GAN's latent space (which has large number of dimensions), there is no interpolation (due to the curse of dimensionality of course). What is then the explanation on why these models even work, and how come they manage to generate new examples? I can't quite figure it out. Great video, enjoyed it!
Well done! 👍
I didn't catch the new non-contrastive method Yann mentions after BYOL and Barlow Twins. Does anyone know?
Great interviews, with many abstract ideas, made simple; I want to wish you all great success, and I will wait for more interesting conversations to come. I am coming for a computational engineering background. We are looking in my field for models that can extrapolate for problems that can be categorized as a mix of differentiable and discrete in nature. Is there any possibility to see a video in future that discusses the ideas of the current episode but more toward computational engineering and physics orientated problems? Thanks and Happy New Year
For inputs that lie within the training data it's an ellipsoid. For inputs that lie outside of the training data I imagine more of a paraboloid. It seems like data could lie both inside of training data in some dimensions and outside in other dimensions, which makes it some kind of ellipsoid paraboloid hybrid. Is this a thing?
Extrapolation is interpolation where one endpoint is magnified by some potentiate of infinity controlled by end zone locking. The outer manifold potentiate?
The reflectome of the outer manifold into the morphology of the inner trained manifold to achieve greater formance from the IM. The focal of the reflectome as a filter to multibatch the stasis of the correct?
If in high dimensional spaces only have varying gradient in 16 or fewer dimensions, doesn't that suggest that principle component analysis should always be run?
Do you mean to run PCA on the ambient space and then throw away all but the top-K eigen vectors? Or just run PCA and use the entire transformed vector as input data points instead of raw data points?
If the former, I guess the fear (probably justified) is that we'd be subjecting the entire data set to a single linear transform and possibly throwing out factors that are only useful in smaller subsets of the data. Instead, NNs are able to chop up the space and use different transforms for different regions of the ambient data space. In a sense, they can defer and tweak decisions to throw out factors/linear-combinations. That chopping, ie piece-wise, capability seems an essential upgrade to using only a single transform for the entire data space.
If the latter, we'd just be adding another matrix multiplication to a stack of such and it wouldn't change much beyond perhaps numerical stability or efficiency since NNs are of course capable of finding any single linear transform including a PCA projection. In a way, it's related to all the various efforts at improving learning algorithms by tweaking gradients, hessians, etc. In the end, in practice most found that doing something super simple at GPU scale was faster; I'm not sure about the state-of-the-art in numerical stability, though.
@@nomenec I mean throw away the input data that isn't significant. Among other things, it will make smaller faster models. I hadn't heard that for really high dimensional data only 16 or fewer dimensions matter. If I'm not misunderstanding this, which I may very well be, doing PCA first makes a lot of sense. It takes me time to wrap my head around anything, and I'm often far off the mark anyway. Still, this seems logical.
Enlightning episode! A bit long but exciting subject... I would had appreciate to get François Chollet in this debate. Unfortunately, the elephant is not in the room...
Part 3 on the curse of dimensionality 🤯
Where have you being?
A tremendous amount of work went into this show let alone the MLST channel as a whole. Good things take time. Thank you for your patience and continued viewership!
Thanks guys
Just wonderful thank you!
hmm...what word do we use for a Interpolation between Interpolation and Extrapolation 🤪
Can anyone discuss or comment on extrapolation in context of projection volume? Or more than 3D
An effective interpolation? Do all interpolations have to be effective? Is it still not an interpolation between even if inaccurate?
Great and very inspring interviews. Thank you!
I wonder how to explain the fact that CNNs learn very practical features in first layers like edge detectors and texture detectors in a persepective of a spline trees theory (I mentioned these because we know what they do and that they are present in NNs). Of course we know that they are used by NNs to split latent space but I think that the fact that NNs are able to figure out such specific features at all is enough qualitative difference comparing to decision trees to question if an analogy to decision trees makes sense at all. Yann LeCun claims that in high dimensional spaces everything is an extrapolation I think it's valid to ask if in high dimensional spaces everything is decion tree-like hyperplane splitting.
I understood 0.04% of these buzzwords.
28:10 This isn't how humans understand physics. Really, really good video though.
34:00
54:30 It's cool that humans still understand a lot though. The possibilities in the universe are massively constrained by the fact that nothing is generated outside of physical laws.
1:00:00 The limitation that deep learning can't extrapolate
1:03:30 Extrapolation = reasoning? So can they reason?
1:03:50 No
1:05:50 Supervised learning is the thing that sucks
1:06:50 Geoff Hinton thinks general unsupervised first, specialization after
1:12:00 RBF network
1:15:00 Different definitions of interpolation
1:22:50 latent contrastive predictive models
1:25:00 New architectures that aren't contrastive have come out
1:29:30 No, they will be able to reason
1:30:00 What would prove that neural networks can reason?
1:35:30 RNNs are the networks that can train with variable number of layers
1:37:28 Nobody can train a neural net to print the nth digit of pi (I can). Yeah, once we figure out basic things we might be able to try mathematical concepts.
1:45:00 System 1 and 2 in chess and driving
2:07:10 Convolution is nothing more than a performance optimization by giving the network pre-knowledge that interesting features are spatially local
A lot of tearing down of not 100% correct analogies of neural networks and what might actually model them well
- 2:30:30 It's impossible for a neural network to find the nth digit of pi
2:34:45 Discrete vs smooth... Have both systems? (Actions distill, Jordan Peterson)
2:36:30 (The real world is limited) is it because neural nets only use textures? No, resolution is low, or it would blow up
(Man that accent was tough for me)
2:45:30 Summary of that last interview. Intuition is fine, but mathematical rigor doesn't apply well with that definition
2:47:30 We need a better definition of what kind of interpolation is happening, and that will help us progress
2:50:00 It's hard to figure out where researchers exactly disagree because of politeness
2:53:00 It's all about pushing back on the limitation that neural networks can't extrapolate
2:54:40 Digits of pi again. It's not what he's talking about actually, too advanced. He's talking about a cat jumping in a place it's never seen before (Tesla predicts paths of cars). He thinks eventually we'll get there, but I'm not as optimistic.
2:56:00 There's an article by Andre Ye that annoyed him because it invoked interpolation vs extrapolation to say they'll never do it, which is the real question
2:57:10 At the end of the day, neuron signals are continuous functions, but somehow they produce digital reasoning. But will it be efficient?
2:59:00 But there is no discrete thing (actions)
3:00:40 (There you go. Yes. It's going to be hard. But that's the only way for a neural network to do it, and calculators aren't going to discover profound truths.)
3:01:30 (Omg it feels like they're starting to think the way I do about it. System 1, system 2) It's insanely powerful to train a discrete algorithm on top of neural network. Longer term possibility.
3:05:00 Underexplored.
Feature creep? (No! That's insane. Is general intelligence feature creep?)
3:06:30 Hard to train (it seems the opposite to me) Getting to do discrete stuff involves lots of hacks.
3:07:30 "TABLE 1" Attacks attack on paper
3:17:00 You can't initialize a neural net with zeros
3:18:00 We're comparing neural nets to the entire human race and its culture and inheritance
I'm only 20 minutes in, so will remove this comment if it's answered in the discussion, but....
How do you think smooth activation functions (e.g. ELU) would affect the polyhedral covering of the feature space? If ReLU functions create hard boundaries between separate polyhedra, would smooth functions create smooth boundaries? Or perhaps weighted combinations of polyhedra?
So if I replace ReLU with Softplus, does that break their arguments?
On the one hand, sure, arguments which explicitly state repeatedly that they apply to piecewise linear functions do not immediately apply outside that assumption. On the other hand, that is not evidence the against a more general interpretation of the conclusions and there is soft evidence that in many problem spaces, NNs, regardless of their activation functions, are driven towards chopping up space into affine cells. Some examples of such soft evidence is 1a) the dominance of piecewise linear activation functions overall or otherwise 1b) the dominance of activation functions that are asymptotic (including your softplus example) and 2) the dominance of NN nodes structured as nonlinear functions of linear combinations as opposed to nonlinear combinations.
The consequence of 2) is that softness still falls along a linear hyperplane boundary! And given 1b) there is a distance at which it effectively behaves as a piecewise linear function. It becomes a problem specific empirical question as to how much an NN actually leverages the curvature versus the asymptotic behavior. My claim, and it's just a gut conjecture based on soft evidence at this point, is that for most problems and typical NNs, any such activation function curvature is incidental and/or sometimes useful for efficient training and that's why ReLU and like, which "abandon all pretense at smooth non-linearity", as I said the video, are dominating.
When will Jürgen follow? :)
I believe most of this just is a byproduct of the fact that brain neurons operate in analog space while computer neural networks are digital which is an approximation of analog data where sampling is always relevant. The other issue is the fact that all data in neural networks are collected together in a singular bucket with a singular answer for various learned scenarios. Whereas in the brain things are much more decomposed into component pieces or dimensions which become inputs into higher order reasoning processes. And this is what leads to the human cognitive evolution that creates language from symbols, with embedded meanings and things like numbers and mathematics.
An analogy for this is to say that each individual arabic numeral has a distinct identity function (learned symbol pattern recognition) corresponding to a set of neurons in the brain. Separate from that you have another set of neurons that have learned the concept of numbers and can associate that with the symbol of a number. And separate from that there is a set of neurons that have learned the principle of counting associated with numbers. That is a network of networks that work together to produce a result. And as such the brain can learn and understand linear algebra and do calculations with it because of the preservation of low level atomic identity functions or logic functions that are not simple statistics problems. Meaning the brain is a network of networks where each dimension is a distinct network unto itself as opposed a singular statistical model.
I think that's a very nice way of looking at things. In a sense, NNs breaking up the space into polyhedra is like a simple hacked version of a network of networks. They are encoding little subunit networks, by virtual of the ReLU activations, that are then forced into shared latent value array representations. That introduces artifacts and isn't as flexible as networks of networks. The killer for trying to train networks of networks is the combinatorial blowup that happens when exploring the space of all possible connection configurations. And it's why so much of what makes NNs work today is actually the human engineering of certain network architectures that structurally hardcode useful priors. Great comment, thank you!
@@nomenec Thanks. It is definitely a much simpler quantification effort for dealing with probability calculation within a bounded context as defined by the algorithmic model, data provided for training and tuning of calculations. However, there is no reason not to investigate more open ended architectures, especially as a thought exercise of how such a thing would be possible.
Does the guy with the Sun glasses have eye problems?
at 3:04:08 Yannic foreshadowing "active inference" 😁
@MLST: Do you guys still follow this line of thinking; or did you end up settling on yet another interpretation recently? 😊
More or less! We are going to drop another show with RandallB next month, we now think that MLPs are more extrapolative than previously thought (along the lines of Randalls/Yann's everything is extrapolation paper)
@@MachineLearningStreetTalk Thank you very much for taking the time to answer my question. 😌👍
Okay I consider myself pretty reasonably intelligent but have no idea what this is about. Can someone tell me so I can research more and expand! I hear dimensions, neural networks and some geometry. Happy new year!
Discussions from hallowed halls of Academia brought to RUclips. Or better? You 3 are setting very high standards!
The order of the equations matters is what they have established. This was already a basic tenet of symbolism.
I still haven’t really absorbed the intropolation VS extrapolation argument.
I think I understand the bimodal discussion of High dimension interpolation and extrapolation. Linear regression is a fitting of interpolated volume in 3 dimensions while extrapolation is any 3 dimensional values outside the interpolated value volume
I still don’t understand his argument.
How would this information change the direction of systematic AGI
I wonder if a machine's ability to find a non-linear function or to integrate one would be analogous to what Stephen Wolfram calls computational reducibility? Certainly, an agent can call a non-linear function rather than a piece-wise linear model.
Great ! Why not considering inviting Prof. Jerome Darbon from Brown Univ, he has always bright views on that topic !
I wonder, do linear functions exist in the universe?
There are certainly linear relationships at various levels of physical description: the first and second laws of thermodynamics, the Schwarzschild radius vs mass, photon energy vs frequency, etc. Whether or not any of these actually "exist in the Universe" is something philosophers have argued for at least millennia and probably will argue until heat death. From my perspective, they "exist" as epistemic descriptions of emergent phenomena.
E = mc^2, F = ma, KE = 1/2mv^2, tons in thermodynamics, notably W = -P (delta V)...basically, often the most fundamental ones are linear
@18:50
“machine learning hasn’t even advanced to the second order”
Except many models don’t use piecewise linear activations like relu, for example most transformers use some version of the exponential linear unit.
It’s also worth mentioning work like SIRENs which have achieved cool results using periodic activation functions.
It doesn't really make a material difference, ELUs just give you a smoother boundary between the polyhedra (the shift is the same as the bias parameter). NNs are still "chopping" up the input space with compositions of basis functions (which form partial keys of an epic hash table). It's just more understandable/intuitive with ReLUs. NNs are learning spatial boundaries.
Imputation can make interpolation appear to be extrapolation. But more importantly people don't understand the relationship between interpolation extrapolation and the Chomsky hierarchy. You simply cannot do extrapolation with context-free grammars. Transformers are context-free grammar capable not more.
Thanks James! According to arxiv.org/pdf/2207.02098.pdf transformers map to finite state automata computational model with no augmented memory and can recognise finite languages only
Great stuff, extremely interesting topic and strong content. But is there a version without the background music - podcast or video? I'm probably just a grumpy old fart, but I find it really hard to concentrate, it's like trying to follow a conversation while somebody is simultaneously licking my ear.
drive.google.com/file/d/16bc7XJjKJzw4YdvL5rYdRZZB19dSzR70/view?usp=sharing here is the intro (first 60 mins) with no background music, you old fart! :)
Woah you’re a great teacher. Better than you are as an
Interviewer
I would like to see this "absolute dog". I think it's possible to generate one, just reward the GAN to react both to realism and activation of particular neuron. I wonder how that doggest dog ever would look like. I would also like to see the doggest cat, and the cattest dog.
You should invite Guy Emerson from Department of Computer Science and Technology University of Cambridge.
He looks great, we would love to have him on!
Here we go! First!
This obsession with piecewise linearity is a bit too much. We are in a highly distributed space, the nature of the unit does not give that much insight into the system. It would be like saying consciousness arose because real neurons are integrate and fire units. Piecewise linearity, like bayesian priors on the weights or on the latent space help because they make the model simpler, reduce the hypothesis space and stabilise learning, not because they hold the key to the mysteries of distributed computation and learning.
On the contrary, it gives wonderful insight. Namely that NNs are a storage mechanism, not a computation mechanism. They have massive representational power i.e. they partition euclidean space up to a reasonable dimensionality which would still confer statistical generalisation before the curse bites -- but, it's clearly not possible to "memorise infinity". R^N
@@TimScarfe I don’t see the depth that distinction brings. Writing down a function requires storage, the function is still performing computation. Is f(x) = 2x a storage mechanism ?
The work of David Pfau on invariance makes much more sense to me to understand what those functions do. But I’ll read the paper, it might give me more insight into what you’re saying.
@@TimScarfe I would say that difference between storage and computation may be very subtle (of course in trivial case of saving raw data it seems very different but saving raw data is kind of like computing identity function - not very interesting). Most of smartest algorithms in classical algorithmics I seen were usually focused on finding a clever data structure which allows to performs some operations effectively (e.g. Fibonacci heaps). Isn't it storage? Even if NNs are just databases, the fact that they are able to response to queries of modalities like images and natural language seems to indicate that their query engine is non trivial and allows effective (not to confuse with optimal) way of compression so search space does not explode like it often does in combinatorial setups.
wahoo! it's amazing
As a result of our seeing truth to be the thorough-going harmony of all the concepts we have at our command, the question forces itself upon us: Yes, but does thinking even have any content if you disregard all visible reality, if you disregard the sense-perceptible world of phenomena? Does there not remain a total void, a pure phantasm, if we think away all sense-perceptible content?
That this is indeed the case could very well be a widespread opinion, so we must look at it a little more closely. As we have already noted above, many people think of the entire system of concepts as in fact only a photograph of the outer world. They do indeed hold onto the fact that our knowing develops in the form of thinking, but demand nevertheless that a "strictly objective science" take its content only from outside. According to them the outer world must provide the substance that flows into our concepts. Without the outer world, they maintain, these concepts are only empty schemata without any content. If this outer world fell away, concepts and ideas would no longer have any meaning, for they are there for the sake of the outer world. One could call this view the negation of the concept. For then the concept no longer has any significance at all for the objective world. It is something added onto the latter. The world would stand there in all its completeness even if there were no concepts. For they in fact bring nothing new to the world. They contain nothing that would not be there without them. They are there only because the knowing subject wants to make use of them in order to have, in a form appropriate to this subject, that which is otherwise already there. For this subject, they are only mediators of a content that is of a non-conceptual nature. This view - If it were justified, one of the following three presuppositions would have to be correct:
1. The world of concepts stands in a relationship to the outer world such that it only reproduces the entire content of this world in a different form. Here "outer world" means the sense world. If that were the case, one truly could not see why it would be necessary to lift oneself above the sense world at all. The entire whys and wherefores of knowing would after all already be given along with the sense world.
2. The world of concepts takes up, as its content, only a part of "what manifests to the senses." Picture the matter something like this. We make a series of observations. We meet there with the most varied objects. In doing so we notice that certain characteristics we discover in an object have already been observed by us before. Our eye scans a series of objects A, B, C, D, etc. A has the characteristics p, q, a, r; B: l, m, b, n; C: k h, c, g; and D: p, u, a, v. In D we again meet the characteristics a and p, which we have already encountered in A. We designate these characteristics as essential. And insofar as A and D have the same essential characteristics, we say that they are of the same kind. Thus we bring A and D together by holding fast to their essential characteristics in thinking. There we have a thinking that does not entirely coincide with the sense world, a thinking that therefore cannot be accused of being superfluous as in the case of the first presupposition above; nevertheless it it still just as far from bringing anything new to the sense world. But one can certainly raise the objection to this that, in order to recognize which characteristics of a thing are essential, there must already be a certain norm making it possible to distinguish the essential from the inessential. This norm cannot lie in the object, for the object in fact contains both what is essential and inessential in undivided unity. Therefore this norm must after all be thinking's very own content.
This objection, however, does not yet entirely overturn this view. One can say, namely, that it is an unjustified assumption to declare that this or that is more essential or less essential for a thing. We are also not concerned about this. It is merely a matter of our encountering certain characteristics that are the same in several things and of our then stating that these things are of the same kind. It is not at all a question of whether these characteristics, which are the same, are also essential. But this view presupposes something that absolutely does not fit the facts. Two things of the same kind really have nothing at all in common if a person remains only with sense experience. An example will make this clear. The simplest example is the best, because it is the most surveyable. Let us look at the following two triangles. [Figure: Two Triangles]
What is really the same about them if we remain with sense experience? Nothing at all. What they have in common - namely, the law by which they are formed and which brings it about that both fall under the concept "triangle" - we can gain only when we go beyond sense experience. The concept "triangle" comprises all triangles. We do not arrive at it merely by looking at all the individual triangles. This concept always remains the same for me no matter how often I might picture it, whereas I will hardly ever view the same "triangle" twice. What makes an individual triangle into "this" particular one and no other has nothing whatsoever to do with the concept. A particular triangle is this particular one not through the fact that it corresponds to that concept but rather because of elements Iying entirely outside the concept: the length of its sides, size of its angles, position, etc. But it is after all entirely inadmissible to maintain that the content of the concept "triangle" is drawn from the objective sense world, when one sees that its content is not contained at all in any sense-perceptible phenomenon.
3. Now there is yet a third possibility. The concept could in fact be the mediator for grasping entities that are not sense-perceptible but that still have a self-sustaining character. This latter would then be the non-conceptual content of the conceptual forn of our thinking. Anyone who assumes such entities, existing beyond experience, and credits us with the possibility of knowing about them must then also necessarily see the concept as the interpreter of this knowing.
We will demonstrate the inadequacy of this view more specifically later. Here we want only to note that it does not in any case speak against the fact that the world of concepts has content. For, if the objects about which one thinks lie beyond any experience and beyond thinking, then thinking would all the more have to have within itself the content upon which it finds its support. It could not, after all, think about objects for which no trace is to be found within the world of thoughts.
(rudolf steiner, theory of Knowledge Implicit in goethe's world conception)
I love Yann LeCun
Oddly enough, he's my inspiration to leave the AWS $$$ to pursue AI research.
@Georgesbarsukov how is it going so far for u? 🙌
@@ultrasound1459 Nothing much in terms of AI :(, but it has been nice spending 3 months in Europe and soon a couple in Asia. My gray AWS hairs are gone ^^