Watched the entire video. It's very well-put. I also like that you're still showing parts of the papers so we can pause the video to read the rest, instead of just the highlighted parts. It actually interested me to read the entire paper with a more vivid view of it as you go through it.
I heard about the Forward-Forward algo on the podcast Eye on AI. There’s was so much technical detail that I felt like I was lost. But you did a great job!
I just want to note that there have been many recent alternatives to traditional neural networks that do not rely on backpropagation or storing the results of the forward pass. Two examples are Neural Ordinary Differential Equations [Chen, 2017] and Deep Equilibrium Models (DEQ) [Bai, 2019]. The first performs both the forward pass and the backward pass by solving differential equations, while the latter does so by solving fixed-point equations. DEQs have been particularly successful. They are theoretically well-understood, use significantly less memory than regular neural networks due to not requiring to store the results of the forward pass, and have achieved comparable or SOTA results in many different tasks. There has also been some work in creating differentiable machine learning models based on optimizers such as Quadratic Programs [Amos, 2017], Semidefinite Programs [Wang, 2019] and even Integer Programs [Paulus, 2021]. These models can be trained using gradient-based optimization without storing intermediate results or using backpropagation, although their results are overall less impressive. Interestingly, Zico Kolter (who has been involved with DEQs) has mentioned a few years ago that specialized hardware could be developed to solve fixed-point equations. This could greatly favor Deep Equilibrium Models by providing some of the benefits outlined in Hinton's paper. Personally, I think there has been some overreaction to Hinton's paper.
True, I also have been interesetd in papers that you mentioned, and whats great about them and backprop in particular is that they rely on very solid and very well understood math, where models are just a form of notation for expressing calculations that computer can execute. However, Hintons "Forward forward" relies on some heavily intuitive engineering. Not saying that to discredit Hintons research, but even in this overview it is mentioned there are too many broad terms used like "goodness" and "positive and negative" data. It would be really cool if an attempt has been made to boil down this vague terms to statistical interpetations, like in distribution and out of distribution data, and provide some theoretical output, but I guess that out of scope for paper.
@@kuretaxyz Well, they sort of are not that relevant today, as attention mechanism really rules in most of todays architectures - and it pretty much learns what to learn, while capsules tried to capture specific information through manual engineering of network. They can be more efficient - but hardware is not an issue today (yet).
@@hr3nk aren't capsules more expensive to compute, but have the advantage of being orientation-independent? Which is probably one or a few clever hacks away from becoming practical, like how YOLO supplanted U-net for classification/detection and segmentation at once
@@randalllionelkharkrang4047 guessing Field Programable Gate Array. I vaguely remember goofing around with one in the mid '80'ies and I too was thinking that variations over that theme might pop up in this context. And that the weights of neural networks in solutions inspired by FPGAs might be at least partially digital, solving some of the bootstrapping issues. Then again the '80ies are a while back and I haven't kept up on the hardware side of things.
34:36 exactly, staying in the digital domain means that if you do something in the morning of 2023, february 16th in Sweden where it's snowing. Or if you do something in the Sahara desert at noon where it's scorching hot, you'll get the same answer. With an analog computer you won't, because resistors are affected by temperature. The copper traces on the PCB that contains all the analog voltages will be susceptible to noise, any electromagnetic wave that gets absorbed by the circuit... small or large... goes into the circuit and affects the output. With a digital circuit you have a bit, the voltage levels between a logic 1 and a logic 0 are so vast that they will never be misinterpreted (sometimes they do, but that depends on your made-up scenario).
Great dog! I love the pose with the legs and tail in motion, it really captures the forward direction of the dog. And the tilt of the head looks so curious and expressive!
I have tried using the Forward-Forward algorithm as a way to learn universal probability distribution approximators, that could then be used to detect anomalies and generate new data, but it didn't work very well. Perhaps if I used the more complex variant that has feedback it would have worked better, but it is very complicated for a small gain imo. But far from me the idea of throwing out forward propagation, I think it is the way to go for future AIs. Instead of Forward-Forward, there is this preprint called "Signal Propagation: A Framework for Learning and Inference In a Forward Pass", which is a very similar framework that actually has the Forward-Forward algorithm as a special case, in a way. Sigprop also has a mechanism for feedback, which seems much more efficient. I hope you will make a video about this paper also, this one was very well explained.
I still have yet to see an implementation of the more complex feedback version. I would really like to see it, but it's not the easiest thing to implement. I don't even think the performance will be great, but I would guess/hope that its naturally resistant to adversal attacks. Probably depends on the way negative data is generated though, which is kind of a major weakness overall.
@@JeffHykin You could look at it as a weakness or a benefit, the benefit being that figuring out how to generate good negative data could potentially make them even more effective than current backpropagated NNs
@@seanoconnor1984 direct random feedback alignment work much better on multilayer perceptrons, I don't think the method will work if you use it on CNNs or transformers. You also have to correctly scale your random feedback matrices and make sure they stay constant throughout the training. You also have to start from a randomly initialized model and the last layer should be learned normally. Lastly, you should prevent gradients from flowing from a layer to the previous one. If you do all of these things, the model should train, but probably not to the level of an equivalent model with regular backprop.
As someone with an original background in electrical engineering I can immediately see the potential in this. There are some other benefits as well to stand alone analog AI chips. Lower latency, perhaps even the theoretical limit of the speed of light with photonic analog. Higher precision and divisibility for a more fine control. The downside would be state control and retention. There has to be a backup in case power runs out. Otherwise you'd have AI's that "Die" when the chip loses power. I think it's important for computer scientists to recognize that traditional distinctions like Neumann architectures, Software-hardware divide and digitization of data are all at the end of the day just human inventions and tools developed for a reason. If this reason changes then the tools and abstractions can also fade away. There's a reason evolution favored analog thinking machines. It's just way more power efficient and I think this is the obvious future of computing in the far future.
I know hardly any electrical engineering, so it’s interesting to hear from others that know a lot more about this, I wish I knew more so I could dive deeper into these things in my videos
I do acknowledge the benefits but it's much harder to research neuromorphic hardware, let alone a company investing in neuromorphic chips when it could just use cloud computing or just buy already easily-scaleable general computers. I admit it's really frustrating the kind of world we live in, it'd happen one day I believe, but very slowly.
I'm not sure I'm convinced that analog is always superior to digital. When it comes to procedural processes, I would assume digital is more efficient. I am happy to be corrected. The benefits of analog hardware for things like neural nets seem obvious, but I would assume the future of computing would involve a combination of analog and digital.
@@sb_dunk You're getting the wrong idea. Analogs are not superior, but when it comes to computations on narrow tasks, they're much faster as they don't require cycles to perform computations. You could just let the laws of physics do the computations for you. Just like how quantum computers work in some narrow-purposed tasks.
@@nullbeyondo I understand that, but the original comment said "There's a reason evolution favored analog thinking machines. It's just way more power efficient and I think this is the obvious future of computing in the far future." I was challenging the idea that analog is the future. For AI and similar systems, maybe, but I doubt digital computing will be phased out (pun intended).
Regarding the analog to digital converters: They are only expensive and take a lot of power because you need them for all the weights. If you just want to interface the input and output of the network with a digital system, the number of bits and speed required is much more manageable and so is the power. So it seems doable to have an AI accelerator using an analog chip for the Forward-Forward algorithm connected to regular digital stuff that is as high or low power as required for non AI functions.
33:20 Multiplying two analog voltages is much more complicated than you'd expect. There are many many many many different ways of doing analog analog multiplication, but all of them are either more complicated than the binary multiplication, slower than the binary multiplication or less correct than the binary multiplication (5x3 is 15, not 14.3 or 18.01 which certain analog analog multiplication methods could spit out). Multiplying an analog voltage by a constant is easy peasy though, and luckily this is what the weights usually are.
With adjustable weights it could correct for manufacturing tolerance issues. It could have an initial set of weights that are “about right” for its purpose, then it could learn from actual performance to adjust them to account for those inconsistencies and even adjust for issues of age or local EM noise environments, which would play hell with analog chips.
I love how radical the ideas in the end of the paper are. Kind of vague still, but sometimes one has to challenge everything to get something better. To get out of a local maximum.
Nice spider-dog! Been a long time since I touched NN but, having took a course of his some years back, I must say; Hinton is not only a great researcher but also a great teacher. Thanks for doing this rundown of the paper! Sounds like FF being more power efficient, and more easily parallelized, while being slower to learn might also make it usable for battery powered online learning for a start and then BP, or some other algorithm, comes in when charging in sleep mode. Speaking of dreams, it's a great example of how the brain settles over the years. With the dreams initially being muddy mixtures of mostly negative learning data as a child; really bringing to mind that masking method for generating negative learning data shown in the MNIST example. Some decades ago there were analogue FPGAs called EPAC. That could be a good way to copy these NNs. Although it might be more viable to have a settling mode over a set range of paired I/O.
2:50 It doesn't sound unreasonable that the Human brain would buffer backpropagation and then have most of it happen during Sleep, it'd explain the biological necessity of sleep and why we appear to recall information easier after having slept on it.
Also, haven't researchers found neurons that propagate signals towards "the reverse direction"(e.g. from visual cortex towards the eyes)? So "backprop" could happen through them triggering certain activations, even though internal mechanisms might be insufficient?
There’s also research that shows that if you take a small break where you sit there and do literally nothing for 30 seconds every few minutes during active learning you learn at such a faster rate that it exceeds margin of error Which could be interpreted as forcing the backprop buffer to be cleared
@@revimfadli4666 there are several neural layers behind each retina that, at the very least, seem to work as a sort of edge-detection. If that's what you mean with "internal mechanisms"?
@@revimfadli4666 Sorry, I should've been more specific: My question were in regards to: What sort of "internal mechanism" do you consider to be insufficient for backprop to be functioning? The layers seem to work as a sort of edge-detection, among other filters, just like how the the AI focuses on hard edges when learning to identify numbers in MNIST. By that parallel the signals found in "the reverse direction" would then be a form of backprop that trains these filters. It could also explain why, in some cases, myopia & hyperopia can be, so to speak, "learned away". But I digress…
35:40. I actually think this is the wrong take away. The reason we don't use analog computers (we tried to in the beginning) is because of how difficult differentiation is. When you have a range of voltages for different values, such as ten for base10, it's difficult to identify if the voltage is really a "1" and not say a "0" or a "2". Whereas with digital logic, you have a voltage? 1. You don't? 0. That's not too complicated, don't need to measure how high it is, don't need to compare it against a discriminator, you don't have to implement complex error detection and correction for simple register states, you don't have to worry about how finely tuned the hardware is, and how much it's drifting over time (and all component in the system are given to drift, and not necessarily in step with one another). Neural nets, on the other hand, are far more forgiving about about voltages (weights) drifting a little, and besides that, implementing NNs in some sort of analog system would be idea. Especially if you're using a spike driven neuromorphic chip. But that doesn't mean that the weights have to die with the hardware. There is nothing stop us from designing the chip from being able to import and export its weights. After all, the chip itself has to be able to tune these weights, which means the chip has the ability to "read and write" them. You may need to have a DAC (assuming you want digital conversion) present for these operations, but that doesn't mean the DAC would have to be a part of its regular duty cycle, and would be reserved for only those times when the weights needed to either be imported or exported, otherwise, there would be no power going to it, and it wouldn't be in the loop. I really have no idea why Hinton thinks that a hardware/software intermingling necessitates the NN dying with the hardware. Really doesn't make any sense, for reasons explained above. Move over, what are you going to do, train every one of these intermingled systems from scratch? You're going to want to establish the foundational weights before you deploy to the field (you know, where power budgets really matter). You're going to want to be able to mass produce these systems. And I really see no technical reason why we can't read/write analog weights from/to the system. I question how strong Hinton's knowledge is on this particular aspect of the subject.
This is so wonderful and interesting to me. And thank you! I am trying to find the Hinton talk where he discusses dream activity you mentioned. Can you provide a link? Please?
25:45, Nah I'd say that big thing in the middle is just taking a copy of the results from the previous layer and identifying how closely the combination of them matches the specific thing it is supposed to match, can only be between 0% & 100% result and pass the merged data into the next node which is receiving from other local nodes, if no node in the chain matches more than 50% then a new node is created to identify a new pattern with, in other words every node set is a growable array, never a fixed static array
23:03 this is neat, if you model it that way, you don't need massive amounts of memory, actually, you can design electrical circuits that will do it with very little cell memory
If the forward pass consists of a black box, then you can't do a single level of gradient descent. How do you encourage that black box to return higher values for positive data and lower values for negative data? Hinton mentions how you do not need to know the precise structure of the forward pass calculation, so you can save on computational resources and/or power, but that takes me straight back to my original question.
Reminds me of echo state networks(ESNs) where the earlier layers are just "whatever a random network happens to do" and the last layer linearly learns to "make sense" of it, but this time the "random layers" also learn too
the last part remids me strong;y of Conway's idea that you could do addition and other operations on a random bundle of electrical circuitry, just learning how to input and read output although it's kind of reversed, because here we would adjust the weights (change "circuitry", not how we input and interpret) so that it gives good output
22:40 oooh. Looks like a cellular automata! Beautiful idea and a great explanation. I am following AI research a bit in my free time and I feel that just watching your videos gives me more understanding than if I've red the paper myself for much less effort. Thank you
On a second thought. It's weird that early layers are updated directly on whether the data is positive or negative (hot dog vs not hot dog). It's a high level piece of information.
I think the reason you want a high output on the first layer (sum of squares) is because that means that it has identified some characteristics somewhere. You don’t care which neurons activate at that stage, so long as at least some of them do. If none of them do, then that means that what ever it’s looking at, it’s not recognizing at all.
Reminded me of the work of psychologist Robyn Dawes who was keen on improper linear models (eg weights +1 or -1) which performed similarly to experts if the right feature set was being input. Each row is a bit like a set of improper linear models scoring the previous row. His work was often in areas where the between expert and even repeat expert agreement wasn’t perfect eg medical tasks. The improper linear models approached between expert reliability. Simple linear scoring models are often used in medical diagnostics again with the importance being which features are chosen. Sounds very promising.
I don't think the title is accurate. The new method only affects training , not inference. Even if it turns out to be useful for LLMs, it won't make them easy to run on low powered hardware.
My question is, how necessary is it really to do learning on an analog chip? Most of the energy cost is in inference, and my initial impression from "chips that dynamically learn" is that I expect they would require a ton of maintenance to perform well. The human brain has a constant stimulus and energy source, and a very important feature of a toaster or any other electrical device is that I would like to be able to unplug it or not use it for awhile. To me, it seems a much better way of going about this is to have a more hardcoded analog chip that is somehow update-able. Development still happens on digital computers so you don't need to incur all the logistical and manufacturing cost of training individual analog chips, and we find a way of transferring from a digitally trained model to an analog chip for the sake of inference. This does feel like such a simple idea though that I assume there are big issues with what I'm saying, I'm obviously no expert in analog computing. I know the transfer from digital to analog is expensive, but (as I understand) this is primarily a problem in backprop because of how often you need to update w/backprop, whereas if I had to update "my toaster" only like, once every few years or whatever, that just seems significantly more preferable. Analog is certainly one of the directions we could go to greatly reduce energy cost, but my immediately feeling is that we're gonna crack more efficient use of CPUs w/large NNs a lot sooner.
It certainly comes with many cons as you mention, and for that reason I’m sure we will always have digital, but specifically for a continual learning on the fly setting (as opposed to an update once every few years) I would say this direction seems more promising. I would imagine the future will be a mix of both, using them where they make the most sense
@@revimfadli4666 I mean basic computations carried out by the design of a circuit. The rule of thumb is to do a lot of simple operations to make a complicated operation. More complex instructions take longer like */ - takes longer than addition. Having a bunch of little simple components doing simple operations that result a complex output is "better" a big complex component doing a big operation when doing common operations. Having a customized circuit for just a neural network is a cool idea but I would rather design a more general computing machine where the heavy ai instructions are digital and the orders are relayed to the hardware. I was taught that having many identical circuits all doing a little bit of the computation is better design than a custom design. Maybe there is a research institute looking into this design.
could you (theoretically) have a "2 layer" chip where 1 layer is the actual chip and the second one is just many sensors and "writers" that can output the wheights and biases in digital form, so that you can later "copy" to the writers that write the bias to the local cell? this would be like taking a human brain and saying: i want this neuron to connect to this neuron in this way basically copying the whole chip?
The description of the algorithm itself was way to fast. We have gradient descent between layers and in the end everyhing is dumped into a linear classifier? Or something? How is this more biologically plausible? If there is an error signal between layers than it might well be passed "further back" right? So error backprop is still possible under that assumption. Also I don't think the brain has extra linear classification modules. And this example was only for classification, correct? How would this work for regression?
The problem with LLMs is not only resources for training (backprop), but also in inference itself (forward pass). It takes a lot of gpu flops to get a token through the system
Good question, there actually are many goodness function that would in theory work. I’m not sure if they tested the one you mention, but just thinking about it now, I would imagine that the interference from top down layers in the RNN is more interesting when you use the squared variant, though really I’m just guessing here.
L2 norm is a regularization that promotes sparseness in the hidden layer latents. This usually results in better generality aka more robust learned representations (avoids overfitting). [edit] I hope this helps, it is my intuition from what I’ve learned in the topic. Used all the buzzwords so you have some handles to google 😅 these ideas go by different words in various contexts. For example I recommend look up steve brunton on L2 norm
There are two answers. 1) The original equation in Yi et theta = mu was intended to approximate goodness in a population. That is, the emphasis in the original equations was on the shape of the function as a population, i.e., as the average or median over all pairs within the population, and the deceptive appearance of adding the two triangles no matter the order does nothing to change that. 2) However, if you wanted to study goodness of fit between a single data point and a population, that bend in function at theta = Pi is a serious issue, of which many models have been developed (including lognormal dance moves). To compute goodness of fit with respect to a single point location, you need to know precisely where it is in the population. Note that for one particular individual, location will be given by i = x / (2sigma), where x is the data value and Y is the location; hence in this formula,newsum((i-theta)^2 I moved the average of theta to screen out this incongruity.
the sum of squares is the more natural way in my mind. It is analogous to taking the square distance of a vector, for a 3d vector the distance is sqrt(x^2 + y^2 + z^2). And in computation you often use the square distance, since the sqrt is slow to compute and it mathematically equivalent, if you make theta the square of the threshold you would use if you took a sqrt.
First because it penalizes large deviations, but just as importantly because it's not linear. Otherwise it would just simplify away and you'd get a null result. Same reason why every artificial neuron needs some sort of non-linearity, be it a fancy sigmoid or a simple max aka. ReLU
The ability to use black box data, such as inputs from other nets, seems like a fantastic way to do sympathetic multimodel networks. For example, audio and video of a person talking and catching lip-reading and body language.
With a 3D printer with the right properties you could "program" an analog computer on the fly (this is currently feasible, but cutting edge). You could also use one expensive computer that simulates the hardware computers for quick iterations and once you are happy you can print out cheap physical computers with the tested design.
"My roomba was lucky in the chip binning process. It has the unique ability to execute a perfect kick flip down the stairs. It seems immensely proud of itself. What it doesn't know is that next year's model is supposed to be smart enough to climb back up the stairs again with no additional hardware"
so this seems like it would work really well for small data sets because you can make more negative data from less positive data. in the example in the video a 7 image is mixed with a 6 image to make a negative 7 image, but you can used the same 7 image to make a new negative image from a 5 image, so you have a (data set) worth of positive images, and you have (data set)*(deta set) worth of negative images. over-fitting might be a problem.
Thanks so much for such an informative video on this - best explanation of FF that I have seen and found the visual examples super useful! I have a question regarding adding black boxes - I can understand how this algorithm overcomes the problem, but what are the use cases? I'd love some concrete examples where adding a black box to the forward pass is useful/advantageous.
Chaotic behavior. Suppose a single forward pass has drastically different results from slightly perturbed inputs. This is a black box function as you have no idea how to predict what it will do unless you simulate it. This FF model would be forced to learn around that chaotic behavior, and it could be useful for modeling potentially related functions such as prime factor decomposition.
I presume you've seen the Veritasium video from a few months ago called "Future Computers Will Be Radically Different (Analog Computing)" - if not, it's relevant and interesting
Why have only one? Why not have Both? Both a chip in a toaster (being efficient) and a Home Hub computer (heavy work) being able to update these chips when neccesary.
Does the increase in energy efficiency compensate for the added energy cost of raising each individual AI from "birth" instead of copy-pasting pre-trained AIs into new units?
If you want to mass produce an AI that does the exact same thing, who knows (though I would guess yes) If you want a bunch of AI that learn individually to adjust to their scenarios on the fly, then yes absolutely
@@JorgetePanete It is practically impossible to make perfect copies with analog systems, each copy will come out slightly different, and if you make copies of copies errors will added up over each iteration. That is if you can even read individual components like that in the first place without tearing up the chip.
@@tiagotiagot Memristors allow reading it without/barely changing it, and you can base the following instances always on the same source, having slightly different prints from the same copy isn't ideal but in this case it may be good enough
I'm not sure digital computers are inherently more specific than analogue circuits for this kind of processing, since even in DSP we're working with floating-point numbers, heuristics, and approximations. Also, surely there would be a way to export the weights from a neural network hardware chip? Like you can save an image of a FPGA.
Forward-forward just strikes me as "dreaming" as far as one understanding of it goes. The negative data seems like the way the brain (allegedly)hallucinates implausible but familiar episodes to better anchor the concept of reality (positive data)
I'm gonna cry for so long I believed that robot characters dying in science fiction was unrealistic because "Well they are robots! They are immortal! You can just send the data to another body!" AND NOW YOU'RE TELLING ME IT'S REALISTIC?? ***ROBOTS WILL DIE!??!? 😭*** EDIT: AND YOU'RE GONNA PUT GENERAL AI IN A FREAKING TOASTER toaster: "what is my purpose" human: "you toast bread" toaster: "oh my god"
I don't understand how the "Recurrent network on the same input multiple times" is not just a fancy way to do approximately the same thing as backpropagation. You can imagine it as not storing the full forward pass activations like regular backprop and instead just calculating the activations on multiple times based on the number of layers. The only difference is that each layer gets updated multiple times instead of once, but even that is approximately close to an activity regularizer.
It is essentially trying to do something similar, the purpose of the algorithm is to pass information backwards (and forwards) through multiple layers of the network. The different is in the information needed to update. Backprop requires storing activations at every layer until the whole forward pass is done, and then doing an update for each layer that is dependent on all layers that come after it. While yes, FF also does backpropagate through multiple layers, those updates are derived from only local layers (the layer in front and behind), so it can be done on the fly (doesn't have to wait for forward pass to finish to at least start updating) and without having to store activations.
love the video seriously! but you do it a disservice by not at least briefly explain back prop, relative to how high level the rest of it was it feels like you should know but even then I think its the bow on top of a great video! keep it up
I'm coming back to tell that I think this algorithm is exactly Hebbian learning with positive data and anti-Hebbian learning with negative data. You can derive the non-linear hebbian learning rule of the weights if you take the L2 loss as the local loss. I may have made a mistake though, don't take my words for it.
The only prof I know that is definitely doing MBRL is Mike Bowling, though most of the profs in RLAI are generally interested in RL, generally including MBRL. Though I do not know who is taking post docs. Best of luck!
Honestly, the description of this training methods reminds me of phase kickback and the irrelevance of global phase vs. relative phase in quantum computing, but I may be misconstruing the algorithm here.
Its all being developed in a way that i didnt forsee, but having said that the trajectories of all attempts seem to be on a similar track . It was supposed to be at least 4 distinct layers. But it wont matter as the realisation will have already occurred to others. Also the digtial analogue machines can be clusters or stand alone units. But needs a new language, and that would need a new understanding of old techniques.
About back prop being implausible - because it needs to take a break... You know, sleep. I very much doubt we backprop in our sleep but we do something akin and the activity rest cycle is well established.
@@nocturn9x the definition of a parallel network is if it seems like layer 1 calculates loss for layer 2's output, then layer 2 has to calculate the loss for layer 3's output, and so on. This isn't true because the real loss is just the loss from the previous layer, and we just add back in the biases you would likely need to solve for if you were in a sequential network.
"Imaging you're blacking out constantly, because brain needs to stop to back propagate" You're basically describing sleep, it's just delayed blacking out 😂
This method only makes training potentially possible with less or more analog computational resources. It has no direct relation with the complexity of the model itself such as number of parameters. It has no implication on the inference hardware that the model is deployed on as far as I can tell so I don’t see where the title of the video comes from…
I’ve always had a hunch that sleep was just data training for living beings. Probably more complicated than that but it would make sense. Gather all the experiences of a day, wait for sleep, and then use the data to retrain at night, then clear the “dream storage” for another day of collection.
“Imagine your brain blacking every couple seconds” I mean we do sleep for 8 hours a night
seconds = hours? Please watch the video, he talks about sleep too.
@@nullbeyondo when the joke is taken seriously
@@ThompYT Bono found the pub ya
ENOUGH IS ENOUGH.
kidding.
And I'd hazard a guess that dreams are mostly our brains doing some sort of data processing on the day's events.
Doesn't the brain black out every 100ms already ?
Watched the entire video. It's very well-put. I also like that you're still showing parts of the papers so we can pause the video to read the rest, instead of just the highlighted parts. It actually interested me to read the entire paper with a more vivid view of it as you go through it.
u must be really bored
I heard about the Forward-Forward algo on the podcast Eye on AI. There’s was so much technical detail that I felt like I was lost. But you did a great job!
I just want to note that there have been many recent alternatives to traditional neural networks that do not rely on backpropagation or storing the results of the forward pass.
Two examples are Neural Ordinary Differential Equations [Chen, 2017] and Deep Equilibrium Models (DEQ) [Bai, 2019]. The first performs both the forward pass and the backward pass by solving differential equations, while the latter does so by solving fixed-point equations.
DEQs have been particularly successful. They are theoretically well-understood, use significantly less memory than regular neural networks due to not requiring to store the results of the forward pass, and have achieved comparable or SOTA results in many different tasks.
There has also been some work in creating differentiable machine learning models based on optimizers such as Quadratic Programs [Amos, 2017], Semidefinite Programs [Wang, 2019] and even Integer Programs [Paulus, 2021].
These models can be trained using gradient-based optimization without storing intermediate results or using backpropagation, although their results are overall less impressive.
Interestingly, Zico Kolter (who has been involved with DEQs) has mentioned a few years ago that specialized hardware could be developed to solve fixed-point equations. This could greatly favor Deep Equilibrium Models by providing some of the benefits outlined in Hinton's paper.
Personally, I think there has been some overreaction to Hinton's paper.
True, I also have been interesetd in papers that you mentioned, and whats great about them and backprop in particular is that they rely on very solid and very well understood math, where models are just a form of notation for expressing calculations that computer can execute. However, Hintons "Forward forward" relies on some heavily intuitive engineering. Not saying that to discredit Hintons research, but even in this overview it is mentioned there are too many broad terms used like "goodness" and "positive and negative" data. It would be really cool if an attempt has been made to boil down this vague terms to statistical interpetations, like in distribution and out of distribution data, and provide some theoretical output, but I guess that out of scope for paper.
I have also seen bayesian inference neural networks which are really interesting!
By the way, what happened to capsule networks?
@@kuretaxyz Well, they sort of are not that relevant today, as attention mechanism really rules in most of todays architectures - and it pretty much learns what to learn, while capsules tried to capture specific information through manual engineering of network. They can be more efficient - but hardware is not an issue today (yet).
@@hr3nk aren't capsules more expensive to compute, but have the advantage of being orientation-independent? Which is probably one or a few clever hacks away from becoming practical, like how YOLO supplanted U-net for classification/detection and segmentation at once
"What if we abandoned computers altogether?"
Bah gawd, that's the Butlerian Jihad's music
Self-organizing FPGA chips will someday be a thing, very exciting stuff
I'm curious to see more on this - any good references? I love FPGAs.
Hi, what do mean by FPGA? I am aware of Self-Organizing Maps , but what are FPGA?
@@randalllionelkharkrang4047 guessing Field Programable Gate Array.
I vaguely remember goofing around with one in the mid '80'ies and I too was thinking that variations over that theme might pop up in this context.
And that the weights of neural networks in solutions inspired by FPGAs might be at least partially digital, solving some of the bootstrapping issues.
Then again the '80ies are a while back and I haven't kept up on the hardware side of things.
Been fantasizing about this for years now
34:36 exactly, staying in the digital domain means that if you do something in the morning of 2023, february 16th in Sweden where it's snowing. Or if you do something in the Sahara desert at noon where it's scorching hot, you'll get the same answer. With an analog computer you won't, because resistors are affected by temperature. The copper traces on the PCB that contains all the analog voltages will be susceptible to noise, any electromagnetic wave that gets absorbed by the circuit... small or large... goes into the circuit and affects the output. With a digital circuit you have a bit, the voltage levels between a logic 1 and a logic 0 are so vast that they will never be misinterpreted (sometimes they do, but that depends on your made-up scenario).
Great dog! I love the pose with the legs and tail in motion, it really captures the forward direction of the dog. And the tilt of the head looks so curious and expressive!
I have tried using the Forward-Forward algorithm as a way to learn universal probability distribution approximators, that could then be used to detect anomalies and generate new data, but it didn't work very well. Perhaps if I used the more complex variant that has feedback it would have worked better, but it is very complicated for a small gain imo.
But far from me the idea of throwing out forward propagation, I think it is the way to go for future AIs.
Instead of Forward-Forward, there is this preprint called "Signal Propagation: A Framework for Learning and Inference In a Forward Pass", which is a very similar framework that actually has the Forward-Forward algorithm as a special case, in a way. Sigprop also has a mechanism for feedback, which seems much more efficient.
I hope you will make a video about this paper also, this one was very well explained.
I still have yet to see an implementation of the more complex feedback version. I would really like to see it, but it's not the easiest thing to implement. I don't even think the performance will be great, but I would guess/hope that its naturally resistant to adversal attacks. Probably depends on the way negative data is generated though, which is kind of a major weakness overall.
@@JeffHykin You could look at it as a weakness or a benefit, the benefit being that figuring out how to generate good negative data could potentially make them even more effective than current backpropagated NNs
@@seanoconnor1984 direct random feedback alignment work much better on multilayer perceptrons, I don't think the method will work if you use it on CNNs or transformers. You also have to correctly scale your random feedback matrices and make sure they stay constant throughout the training. You also have to start from a randomly initialized model and the last layer should be learned normally. Lastly, you should prevent gradients from flowing from a layer to the previous one.
If you do all of these things, the model should train, but probably not to the level of an equivalent model with regular backprop.
"I toast, therefore I am"
Red Dwarf
As someone with an original background in electrical engineering I can immediately see the potential in this. There are some other benefits as well to stand alone analog AI chips. Lower latency, perhaps even the theoretical limit of the speed of light with photonic analog. Higher precision and divisibility for a more fine control. The downside would be state control and retention. There has to be a backup in case power runs out. Otherwise you'd have AI's that "Die" when the chip loses power.
I think it's important for computer scientists to recognize that traditional distinctions like Neumann architectures, Software-hardware divide and digitization of data are all at the end of the day just human inventions and tools developed for a reason. If this reason changes then the tools and abstractions can also fade away. There's a reason evolution favored analog thinking machines. It's just way more power efficient and I think this is the obvious future of computing in the far future.
I know hardly any electrical engineering, so it’s interesting to hear from others that know a lot more about this, I wish I knew more so I could dive deeper into these things in my videos
I do acknowledge the benefits but it's much harder to research neuromorphic hardware, let alone a company investing in neuromorphic chips when it could just use cloud computing or just buy already easily-scaleable general computers. I admit it's really frustrating the kind of world we live in, it'd happen one day I believe, but very slowly.
I'm not sure I'm convinced that analog is always superior to digital. When it comes to procedural processes, I would assume digital is more efficient. I am happy to be corrected.
The benefits of analog hardware for things like neural nets seem obvious, but I would assume the future of computing would involve a combination of analog and digital.
@@sb_dunk You're getting the wrong idea. Analogs are not superior, but when it comes to computations on narrow tasks, they're much faster as they don't require cycles to perform computations. You could just let the laws of physics do the computations for you. Just like how quantum computers work in some narrow-purposed tasks.
@@nullbeyondo I understand that, but the original comment said "There's a reason evolution favored analog thinking machines. It's just way more power efficient and I think this is the obvious future of computing in the far future."
I was challenging the idea that analog is the future. For AI and similar systems, maybe, but I doubt digital computing will be phased out (pun intended).
Regarding the analog to digital converters: They are only expensive and take a lot of power because you need them for all the weights. If you just want to interface the input and output of the network with a digital system, the number of bits and speed required is much more manageable and so is the power. So it seems doable to have an AI accelerator using an analog chip for the Forward-Forward algorithm connected to regular digital stuff that is as high or low power as required for non AI functions.
33:20 Multiplying two analog voltages is much more complicated than you'd expect. There are many many many many different ways of doing analog analog multiplication, but all of them are either more complicated than the binary multiplication, slower than the binary multiplication or less correct than the binary multiplication (5x3 is 15, not 14.3 or 18.01 which certain analog analog multiplication methods could spit out).
Multiplying an analog voltage by a constant is easy peasy though, and luckily this is what the weights usually are.
With adjustable weights it could correct for manufacturing tolerance issues. It could have an initial set of weights that are “about right” for its purpose, then it could learn from actual performance to adjust them to account for those inconsistencies and even adjust for issues of age or local EM noise environments, which would play hell with analog chips.
@@DeruwynArchmage And how would these "adjustable weights" work?
The "sleep is unlearning thing by negative examples" notion is decades old. I read it in a book about NNs about 30 years ago. :-)
I love how radical the ideas in the end of the paper are. Kind of vague still, but sometimes one has to challenge everything to get something better. To get out of a local maximum.
7:23 I love your pretty doggo picture, I subscribed instantly and will hope for many more doggo pictures to come! 😍
very RL like in its local kind of update rules! thanks for bringing this up Dr Meyer!
Hi Edan!
This is very interesting! Lovely overview. I'm excited!
Nice spider-dog!
Been a long time since I touched NN but, having took a course of his some years back, I must say; Hinton is not only a great researcher but also a great teacher.
Thanks for doing this rundown of the paper!
Sounds like FF being more power efficient, and more easily parallelized, while being slower to learn might also make it usable for battery powered online learning for a start and then BP, or some other algorithm, comes in when charging in sleep mode.
Speaking of dreams, it's a great example of how the brain settles over the years.
With the dreams initially being muddy mixtures of mostly negative learning data as a child; really bringing to mind that masking method for generating negative learning data shown in the MNIST example.
Some decades ago there were analogue FPGAs called EPAC. That could be a good way to copy these NNs. Although it might be more viable to have a settling mode over a set range of paired I/O.
Great Video!
Kind of a bold move to be the only author and only mention people in the acknowledgements
24:11 I noticed that hidden layers combine top-down inputs and bottom-up inputs, but why there are blue arrows for hidden layers to top layers?
That idea about dreams at 29:00 is really interesting and makes me feel like I should probably get more sleep 😂
2:50 It doesn't sound unreasonable that the Human brain would buffer backpropagation and then have most of it happen during Sleep, it'd explain the biological necessity of sleep and why we appear to recall information easier after having slept on it.
Also, haven't researchers found neurons that propagate signals towards "the reverse direction"(e.g. from visual cortex towards the eyes)? So "backprop" could happen through them triggering certain activations, even though internal mechanisms might be insufficient?
There’s also research that shows that if you take a small break where you sit there and do literally nothing for 30 seconds every few minutes during active learning you learn at such a faster rate that it exceeds margin of error
Which could be interpreted as forcing the backprop buffer to be cleared
@@revimfadli4666 there are several neural layers behind each retina that, at the very least, seem to work as a sort of edge-detection. If that's what you mean with "internal mechanisms"?
@@顔boom no not edge detection, intracellular backprop
@@revimfadli4666 Sorry, I should've been more specific: My question were in regards to: What sort of "internal mechanism" do you consider to be insufficient for backprop to be functioning?
The layers seem to work as a sort of edge-detection, among other filters, just like how the the AI focuses on hard edges when learning to identify numbers in MNIST.
By that parallel the signals found in "the reverse direction" would then be a form of backprop that trains these filters.
It could also explain why, in some cases, myopia & hyperopia can be, so to speak, "learned away". But I digress…
The potential consequences of this algorithm are really cool! Thank you so much for sharing all these great papers and topics.
You do actually black out for 1/3 of day every day and let your brain do the backprop and organize the data its called sleep
27:41
My biggest issue with this is that FF normally stands for Feed Forward
it's sorta the same thing.
It's not a big deal, it happens often. There's only so many letters with words that can make sense
Paper should call it FFA
Or Final Fantasy....
Fast Forward
35:40. I actually think this is the wrong take away. The reason we don't use analog computers (we tried to in the beginning) is because of how difficult differentiation is. When you have a range of voltages for different values, such as ten for base10, it's difficult to identify if the voltage is really a "1" and not say a "0" or a "2". Whereas with digital logic, you have a voltage? 1. You don't? 0. That's not too complicated, don't need to measure how high it is, don't need to compare it against a discriminator, you don't have to implement complex error detection and correction for simple register states, you don't have to worry about how finely tuned the hardware is, and how much it's drifting over time (and all component in the system are given to drift, and not necessarily in step with one another).
Neural nets, on the other hand, are far more forgiving about about voltages (weights) drifting a little, and besides that, implementing NNs in some sort of analog system would be idea. Especially if you're using a spike driven neuromorphic chip. But that doesn't mean that the weights have to die with the hardware. There is nothing stop us from designing the chip from being able to import and export its weights. After all, the chip itself has to be able to tune these weights, which means the chip has the ability to "read and write" them. You may need to have a DAC (assuming you want digital conversion) present for these operations, but that doesn't mean the DAC would have to be a part of its regular duty cycle, and would be reserved for only those times when the weights needed to either be imported or exported, otherwise, there would be no power going to it, and it wouldn't be in the loop.
I really have no idea why Hinton thinks that a hardware/software intermingling necessitates the NN dying with the hardware. Really doesn't make any sense, for reasons explained above. Move over, what are you going to do, train every one of these intermingled systems from scratch? You're going to want to establish the foundational weights before you deploy to the field (you know, where power budgets really matter). You're going to want to be able to mass produce these systems. And I really see no technical reason why we can't read/write analog weights from/to the system. I question how strong Hinton's knowledge is on this particular aspect of the subject.
This is so wonderful and interesting to me. And thank you!
I am trying to find the Hinton talk where he discusses dream activity you mentioned.
Can you provide a link? Please?
25:45, Nah I'd say that big thing in the middle is just taking a copy of the results from the previous layer and identifying how closely the combination of them matches the specific thing it is supposed to match, can only be between 0% & 100% result and pass the merged data into the next node which is receiving from other local nodes, if no node in the chain matches more than 50% then a new node is created to identify a new pattern with, in other words every node set is a growable array, never a fixed static array
23:03 this is neat, if you model it that way, you don't need massive amounts of memory, actually, you can design electrical circuits that will do it with very little cell memory
Great video! What annotation software do you use?
This naming convention is gonna give us things like FoFo Vs FeFo
And don’t even get me started on FeeFi vs FoFum
And dont translate the first one from portuguese to english
@@Wertsir feerless fidelity?
@@revimfadli4666 I smell the blood of an englishman. Be he living, or be he dead, I'll grind his bones to make my bread.
If the forward pass consists of a black box, then you can't do a single level of gradient descent. How do you encourage that black box to return higher values for positive data and lower values for negative data? Hinton mentions how you do not need to know the precise structure of the forward pass calculation, so you can save on computational resources and/or power, but that takes me straight back to my original question.
Amazing video holy cow
Reminds me of echo state networks(ESNs) where the earlier layers are just "whatever a random network happens to do" and the last layer linearly learns to "make sense" of it, but this time the "random layers" also learn too
Great paper digest and comments, thanks!
i was just wondering about this today. great video
the last part remids me strong;y of Conway's idea that you could do addition and other operations on a random bundle of electrical circuitry, just learning how to input and read output
although it's kind of reversed, because here we would adjust the weights (change "circuitry", not how we input and interpret) so that it gives good output
Your description of backprop makes me think of somebody sleeping/dreaming
22:40 oooh. Looks like a cellular automata!
Beautiful idea and a great explanation. I am following AI research a bit in my free time and I feel that just watching your videos gives me more understanding than if I've red the paper myself for much less effort. Thank you
On a second thought. It's weird that early layers are updated directly on whether the data is positive or negative (hot dog vs not hot dog). It's a high level piece of information.
@@alengm 🥺
I think the reason you want a high output on the first layer (sum of squares) is because that means that it has identified some characteristics somewhere. You don’t care which neurons activate at that stage, so long as at least some of them do. If none of them do, then that means that what ever it’s looking at, it’s not recognizing at all.
Reminded me of the work of psychologist Robyn Dawes who was keen on improper linear models (eg weights +1 or -1) which performed similarly to experts if the right feature set was being input. Each row is a bit like a set of improper linear models scoring the previous row. His work was often in areas where the between expert and even repeat expert agreement wasn’t perfect eg medical tasks. The improper linear models approached between expert reliability. Simple linear scoring models are often used in medical diagnostics again with the importance being which features are chosen. Sounds very promising.
This is better explained than the Hinton interview with Eye on AI.
I don't think the title is accurate. The new method only affects training , not inference. Even if it turns out to be useful for LLMs, it won't make them easy to run on low powered hardware.
New Hinton paper lets go 😁
That's amazing vulgarisation thanks! Some parts were not easy to follow though like the "6 video" I didn't get what were the x axis for
I think the x-axis was time and the y-axis was how far through the model the information was
My question is, how necessary is it really to do learning on an analog chip? Most of the energy cost is in inference, and my initial impression from "chips that dynamically learn" is that I expect they would require a ton of maintenance to perform well. The human brain has a constant stimulus and energy source, and a very important feature of a toaster or any other electrical device is that I would like to be able to unplug it or not use it for awhile. To me, it seems a much better way of going about this is to have a more hardcoded analog chip that is somehow update-able. Development still happens on digital computers so you don't need to incur all the logistical and manufacturing cost of training individual analog chips, and we find a way of transferring from a digitally trained model to an analog chip for the sake of inference.
This does feel like such a simple idea though that I assume there are big issues with what I'm saying, I'm obviously no expert in analog computing. I know the transfer from digital to analog is expensive, but (as I understand) this is primarily a problem in backprop because of how often you need to update w/backprop, whereas if I had to update "my toaster" only like, once every few years or whatever, that just seems significantly more preferable.
Analog is certainly one of the directions we could go to greatly reduce energy cost, but my immediately feeling is that we're gonna crack more efficient use of CPUs w/large NNs a lot sooner.
It certainly comes with many cons as you mention, and for that reason I’m sure we will always have digital, but specifically for a continual learning on the fly setting (as opposed to an update once every few years) I would say this direction seems more promising. I would imagine the future will be a mix of both, using them where they make the most sense
@@EdanMeyer It seems cool, but designing analog hardware is hard from my experience
it's really hard to download an application if it's hard coded on to hardware
There will be a way to train the chip, if I understood correctly?
aren't instruction sets hardcoded onto hardware too? yet programs can still be installed
@@revimfadli4666 I mean basic computations carried out by the design of a circuit. The rule of thumb is to do a lot of simple operations to make a complicated operation. More complex instructions take longer like */ - takes longer than addition. Having a bunch of little simple components doing simple operations that result a complex output is "better" a big complex component doing a big operation when doing common operations. Having a customized circuit for just a neural network is a cool idea but I would rather design a more general computing machine where the heavy ai instructions are digital and the orders are relayed to the hardware. I was taught that having many identical circuits all doing a little bit of the computation is better design than a custom design. Maybe there is a research institute looking into this design.
35:17 that's called an FPGA. That's when you describe hardware with software and you get absurd, like, really absurd, throughput. Massive parallelism.
could you (theoretically) have a "2 layer" chip where 1 layer is the actual chip and the second one is just many sensors and "writers" that can output the wheights and biases in digital form, so that you can later "copy" to the writers that write the bias to the local cell? this would be like taking a human brain and saying: i want this neuron to connect to this neuron in this way
basically copying the whole chip?
The description of the algorithm itself was way to fast. We have gradient descent between layers and in the end everyhing is dumped into a linear classifier? Or something?
How is this more biologically plausible? If there is an error signal between layers than it might well be passed "further back" right? So error backprop is still possible under that assumption.
Also I don't think the brain has extra linear classification modules.
And this example was only for classification, correct? How would this work for regression?
Maybe what's more biologically plausible is local vs backprop gradient descent/ascent?
The problem with LLMs is not only resources for training (backprop), but also in inference itself (forward pass). It takes a lot of gpu flops to get a token through the system
What is the talk about forward forward algorithm referenced in the video, did not find in description 😢
Someone can explain me why we take the sum of the square of yi - theta and not just sum of yi - theta on the goodness function please?
Good question, there actually are many goodness function that would in theory work. I’m not sure if they tested the one you mention, but just thinking about it now, I would imagine that the interference from top down layers in the RNN is more interesting when you use the squared variant, though really I’m just guessing here.
L2 norm is a regularization that promotes sparseness in the hidden layer latents. This usually results in better generality aka more robust learned representations (avoids overfitting).
[edit] I hope this helps, it is my intuition from what I’ve learned in the topic. Used all the buzzwords so you have some handles to google 😅 these ideas go by different words in various contexts. For example I recommend look up steve brunton on L2 norm
There are two answers. 1) The original equation in Yi et theta = mu was intended to approximate goodness in a population. That is, the emphasis in the original equations was on the shape of the function as a population, i.e., as the average or median over all pairs within the population, and the deceptive appearance of adding the two triangles no matter the order does nothing to change that. 2) However, if you wanted to study goodness of fit between a single data point and a population, that bend in function at theta = Pi is a serious issue, of which many models have been developed (including lognormal dance moves). To compute goodness of fit with respect to a single point location, you need to know precisely where it is in the population. Note that for one particular individual, location will be given by i = x / (2sigma), where x is the data value and Y is the location; hence in this formula,newsum((i-theta)^2 I moved the average of theta to screen out this incongruity.
the sum of squares is the more natural way in my mind. It is analogous to taking the square distance of a vector, for a 3d vector the distance is sqrt(x^2 + y^2 + z^2). And in computation you often use the square distance, since the sqrt is slow to compute and it mathematically equivalent, if you make theta the square of the threshold you would use if you took a sqrt.
First because it penalizes large deviations, but just as importantly because it's not linear. Otherwise it would just simplify away and you'd get a null result. Same reason why every artificial neuron needs some sort of non-linearity, be it a fancy sigmoid or a simple max aka. ReLU
The ability to use black box data, such as inputs from other nets, seems like a fantastic way to do sympathetic multimodel networks. For example, audio and video of a person talking and catching lip-reading and body language.
With a 3D printer with the right properties you could "program" an analog computer on the fly (this is currently feasible, but cutting edge). You could also use one expensive computer that simulates the hardware computers for quick iterations and once you are happy you can print out cheap physical computers with the tested design.
3:10 - I always thought of a second process backproping and updating also we get sleep
Sounds a bit like "Contrastive Divergence" which is the traditional training algorithm of Restricted Boltzman Machines (RBMs).
FF sounds like it is amenable to also going up in scale, with the local interactions split up between different cloud clusters.
Great drawing of the dog.
"My roomba was lucky in the chip binning process. It has the unique ability to execute a perfect kick flip down the stairs. It seems immensely proud of itself. What it doesn't know is that next year's model is supposed to be smart enough to climb back up the stairs again with no additional hardware"
22:48 Discord?
I stopped the video to check my messages.
Your doggo is impeccable
(Until ~13:00) - Is this not (a close variant of) Direct Feedback Alignment?
That was an amazing doggo. 7:24
so this seems like it would work really well for small data sets because you can make more negative data from less positive data. in the example in the video a 7 image is mixed with a 6 image to make a negative 7 image, but you can used the same 7 image to make a new negative image from a 5 image, so you have a (data set) worth of positive images, and you have (data set)*(deta set) worth of negative images. over-fitting might be a problem.
Thanks so much for such an informative video on this - best explanation of FF that I have seen and found the visual examples super useful! I have a question regarding adding black boxes - I can understand how this algorithm overcomes the problem, but what are the use cases? I'd love some concrete examples where adding a black box to the forward pass is useful/advantageous.
Chaotic behavior. Suppose a single forward pass has drastically different results from slightly perturbed inputs. This is a black box function as you have no idea how to predict what it will do unless you simulate it. This FF model would be forced to learn around that chaotic behavior, and it could be useful for modeling potentially related functions such as prime factor decomposition.
So this is the last component for the singularity?
I presume you've seen the Veritasium video from a few months ago called "Future Computers Will Be Radically Different (Analog Computing)" - if not, it's relevant and interesting
Why have only one? Why not have Both?
Both a chip in a toaster (being efficient) and a Home Hub computer (heavy work) being able to update these chips when neccesary.
Incredible drawing 😍
Does the increase in energy efficiency compensate for the added energy cost of raising each individual AI from "birth" instead of copy-pasting pre-trained AIs into new units?
If you want to mass produce an AI that does the exact same thing, who knows (though I would guess yes)
If you want a bunch of AI that learn individually to adjust to their scenarios on the fly, then yes absolutely
@@EdanMeyer What sort of use would there be for untested AI that may not have learned the right lesson yet?
I don't get the point of mortal software, can't you save the resistance of memristors and use it on a new instance?
@@JorgetePanete It is practically impossible to make perfect copies with analog systems, each copy will come out slightly different, and if you make copies of copies errors will added up over each iteration. That is if you can even read individual components like that in the first place without tearing up the chip.
@@tiagotiagot Memristors allow reading it without/barely changing it, and you can base the following instances always on the same source, having slightly different prints from the same copy isn't ideal but in this case it may be good enough
3:08 sleeping is just backprop in real life
no
Great dog drawing ❤
I'm not sure digital computers are inherently more specific than analogue circuits for this kind of processing, since even in DSP we're working with floating-point numbers, heuristics, and approximations.
Also, surely there would be a way to export the weights from a neural network hardware chip? Like you can save an image of a FPGA.
It really sounds like all the statistical learning field, eg. Boltzmann machines, reservoir learning, hebbian rule etc.
Forward-forward just strikes me as "dreaming" as far as one understanding of it goes. The negative data seems like the way the brain (allegedly)hallucinates implausible but familiar episodes to better anchor the concept of reality (positive data)
I'm gonna cry
for so long I believed that robot characters dying in science fiction was unrealistic because "Well they are robots! They are immortal! You can just send the data to another body!"
AND NOW YOU'RE TELLING ME IT'S REALISTIC??
***ROBOTS WILL DIE!??!? 😭***
EDIT: AND YOU'RE GONNA PUT GENERAL AI IN A FREAKING TOASTER
toaster: "what is my purpose"
human: "you toast bread"
toaster: "oh my god"
Weights & Biases changed their name to Clear ML?
I don't understand how the "Recurrent network on the same input multiple times" is not just a fancy way to do approximately the same thing as backpropagation.
You can imagine it as not storing the full forward pass activations like regular backprop and instead just calculating the activations on multiple times based on the number of layers.
The only difference is that each layer gets updated multiple times instead of once, but even that is approximately close to an activity regularizer.
It is essentially trying to do something similar, the purpose of the algorithm is to pass information backwards (and forwards) through multiple layers of the network. The different is in the information needed to update. Backprop requires storing activations at every layer until the whole forward pass is done, and then doing an update for each layer that is dependent on all layers that come after it. While yes, FF also does backpropagate through multiple layers, those updates are derived from only local layers (the layer in front and behind), so it can be done on the fly (doesn't have to wait for forward pass to finish to at least start updating) and without having to store activations.
Impeccable drawing sir. Eh hem. Exemplary.
love the video seriously! but you do it a disservice by not at least briefly explain back prop, relative to how high level the rest of it was it feels like you should know but even then I think its the bow on top of a great video! keep it up
I'm coming back to tell that I think this algorithm is exactly Hebbian learning with positive data and anti-Hebbian learning with negative data. You can derive the non-linear hebbian learning rule of the weights if you take the L2 loss as the local loss.
I may have made a mistake though, don't take my words for it.
Hi Edan, I'm looking for potential postgraduate degrees in Alberta doing model based RL, do you have any recommendations for supervisors?
The only prof I know that is definitely doing MBRL is Mike Bowling, though most of the profs in RLAI are generally interested in RL, generally including MBRL. Though I do not know who is taking post docs. Best of luck!
@@EdanMeyer thanks!
Honestly, the description of this training methods reminds me of phase kickback and the irrelevance of global phase vs. relative phase in quantum computing, but I may be misconstruing the algorithm here.
Its all being developed in a way that i didnt forsee, but having said that the trajectories of all attempts seem to be on a similar track .
It was supposed to be at least 4 distinct layers. But it wont matter as the realisation will have already occurred to others.
Also the digtial analogue machines can be clusters or stand alone units. But needs a new language, and that would need a new understanding of old techniques.
Thank you so much
About back prop being implausible - because it needs to take a break... You know, sleep. I very much doubt we backprop in our sleep but we do something akin and the activity rest cycle is well established.
thanks a million, perfect
Weird question: what software are you using to highlight the paper and draw on it?
Microsoft paint
PDF opened in Microsoft Edge can highlight and annotate
Wait, people don't black out multiple times a day to dissociate?
its like an integrated neural network ASIC a NNASICGreat video!
I wonder if this can be combined with "fast weights"
Well, time to start some Neurals from scratch.
7:27 nice looking dawg, my dawg.
I like your Drawing. I'mma sub
How does it actually apply adjustments to the weights/biases?
Gradient descent on each layer in regards to the goodness function
@@angrymurloc7626 isn't that just backprop
@@nocturn9x No
@@phoneticalballsack care to elaborate?
@@nocturn9x the definition of a parallel network is if it seems like layer 1 calculates loss for layer 2's output, then layer 2 has to calculate the loss for layer 3's output, and so on. This isn't true because the real loss is just the loss from the previous layer, and we just add back in the biases you would likely need to solve for if you were in a sequential network.
I wonder if mortal computation was the inspiration behind Overwatch's omnic irreplacability
"Imaging you're blacking out constantly, because brain needs to stop to back propagate"
You're basically describing sleep, it's just delayed blacking out 😂
If that was the only backpropagation then you would only be able to learn in your sleep
This method only makes training potentially possible with less or more analog computational resources. It has no direct relation with the complexity of the model itself such as number of parameters. It has no implication on the inference hardware that the model is deployed on as far as I can tell so I don’t see where the title of the video comes from…
I’ve always had a hunch that sleep was just data training for living beings. Probably more complicated than that but it would make sense. Gather all the experiences of a day, wait for sleep, and then use the data to retrain at night, then clear the “dream storage” for another day of collection.
There's research that supports this pretty strongly so yeah. I think even Deepmind did some on it, or at least cited it at some point.