That's one of the best deep learning related presentations I've seen in a while! Not only introduced transformers but also gave an overview of other NLP strategies, activation functions and also best practices when using optimizers. Thank you!!
For anyone feeling overwhelmed, it is completely reasonable, as this video is just a 28 minute recap for experienced machine learning practitioners, and lot of them are just spamming the top comments with "This is by far the best video", "Everything is clear with this single video" and all.
Sounds like it is my lucky day then, for me to jump from noob to semi-non-noob by gathering thinking patterns from more-advanced individuals. I will fill in the swiss cheese holes of crystallized intelligence later by extrapolating out from my current fluid intelligence level... or something like that. Sorry I'll see myself out.
I was about to make a remark about the presenter speaking like a machine gun at the start. I can't even follow such a pace even in my native language, on a lazy Sunday afternoon with a drink in my hand. Who cares what you say if no one manages to understand it??? Easy, easy boy... slow down, no one cares how fast you can speak, what matters is what you are able to explain. (so the others understand it).
Its hard to overstate just how much this topic has(is) transformed the industry. As others have said, understanding it is not easy because there are a bunch of components that don't seem to align with one another and overall the architecture is such a departure from the most traditional things you are taught. I myself have wrangled with it for a while and its still difficult to fully grasp. Like any hard problem, you have to bang your head against it for a while before it clicks.
12:56 the review of the pseudocode of the attention mechanism was what finally helped me understand it (specifically the meaning of the Q,K,V vectors), what other videos were lacking. In the second outer for loop, I still don't fully understand why it loops over the length of the input sequence. The output can be of different length, no? Maybe this is an error. Also, I think he didn't mention the masking of the remaining output at each step so the model doesn't "cheat".
1. @10:17, the speaker says all we need is the encoder part for classification problem, is this True? How about BERT, when we use BERT encoding for classification, say sentiment analysis, all that has worked was the encoder part? 2. @ 12:25, the slide is really clear in explaining relevance[i,j], but the example is translation, so clearly it is not on the "encoder part". In the encoder part, how is relevance[i,j] computed? what is the difference between key and value? It seems they are all values of the input vector? Aren't they the same in the encoder part? Thank you!
Good question...Key and Value seems symmetric. I was expecting symmetry in a self-attention model, but I can't quite understand how this works with the key/value analogy.
Wonderfully clear and precise presentation. One thing that tripped me up, though, is this formula at 4 minutes in: Hi+1 = A(Hi, xi) Seems this should rather be: Hi+1 = A(Hi,xi+1) which might be more intuitively written as: Hi = A(Hi-1,xi)
You folks need to look into asymptotics and Padé approximant methods, or for functions of many variables as ANN's are you'd use the generalize Canterbury Approximants. The is not yet a rigorous development in information theoretic terms, but Padé summations (essentially repeated fraction representations) are known to yield rapid convergence to correct limits for divergent Taylor series in non-converging regions of the complex plane. What this boils down to is that you only need a fairly small number of iterations to get very accurate results if you only require approximations. To my knowledge this sort of method is not being used in deep learning, but has been used by physicists in perturbation theory. I think you will find it extremely powerful in deep learning. Padé (or Canterbury) summation methods when generalized are a way of extracting information from incomplete data. So if you use a neural net to get a few first approximants, and assume they are modelling an analytically continued function, then you have a series (the node activation summation) you can Padé sum and extract more information than you'd be able to otherwise.
I was trying to use similar super-low frequency sine trick for audio sample classification (to give network more clues about attack/sustain/release positioning). Never did I know, that one can use several of those in different phases. Such a simple and beautiful trick The presentation is awesome
When using acronyms, it is not good to LRTD. And do not ever GLERD. People won't understand the SMARG. . It does help if you ETFM (Explain the Fu*king Meaning) as you write.
I love this presentation Doesn't assume that the audience knows far more than is necessary, goes through explanations of relevant parts of Transformers, notes shortcomings, etc; Best slideshow I've seen this year, and it's from over 3 years ago
I've always wondered how standard Relu's can provide non-trivial learning if they are essentially linear for positive values? I know with standard linear activation functions any deep network can be reduced to a since layer transformation. Is it the discontinuity at zero that stops this being the case for Relu?
Exactly. Think of it like this. A matrix-vector multiplication is a linear transformation. That means it rotates and shifts its input vector. That is why you can write two of these operations as a single one (A_matrix * B_matrix * C_vec = D_matrix * C_vec) and also why you can add scalar multiplications in between (which is what linear activation would do, and is just a scaling operation on the vector). But if you only scale some of the entries of the vector (ReLu) that does not work anymore. If you take a pen, rotating and scaling it preservers your pen, but if you want to only scale parts of it, you have to break it.
World deserve more lectures like this one. I don't need examples on how to tune U-net, but the overview of this huge research space and ideas underneath each group.
-sorry for the lack of technical terms- I did not completely get it how transformers work regarding to positional information: Isn't X_in the information of the previous hidden layer? That is not enough for the network, because the input embeddings lack any temporal/positional information, right? But why not just add one new linear temporal value to the embeddings instead of many sinewaves at different scale?
9:07, could you please explain what do you mean by "Needs specific labelled dataset for every task"? I literally just trained a LSTM network (Char RNN based on Karpathy's github.com/sherjilozair/char-rnn-tensorflow) by just giving it non labeled text.
That sentence isn't accurate, especially out of context. You can always train LSTM's unsupervised like you did. But the point I'm explaining is that "transfer learning never really worked" - which is to say you usually can't use a pre-trained model on a new problem.
the Eng:French matrix/diagram from 11:35 shows attention between an English and a French vector. But that would involve both the ENCODing and DECODing....how they interact. Whereas speaker is discussing *only* the internals of the ATTENTION mechanim in the Encoder at this point. I'd really like to see a similar matrix/diagram illustrating use of attention WITHIN the ENCODing session......it wouldnt involve French at all at this point, cuz ENCODER hasnt even got to the shared representation yet....the machine version of the of the input that comes AFTER the ENCODE, but BEFORE the DECODE..... ==> and you're not alone, I see this same vaguery elsewhere in other of Transformer processing.... ==> but then, most likely I just misunderstand......
Literally it means that at each time step, there are two different state vectors passed from one LSTM cell to the next in the time sequence. What they each do or how they are distinct is not entirely clear to me. But structurally, the top one in the diagram (usually called C) acts like a ResNet in that new information is only added to it at each time step, making the gradient path simpler, and training easier. The bottom one (usually called h) is more like a vanilla RNN, responding quickly and directly to the input at that time step. So it's probably reasonable to think of them as representing slower & faster moving changes in the state - capturing interactions that are either closer together in the inputs or stretch over longer ranges.
📺💬 Yui krub I give you ice cream 🍦 when we plot sine wave in the word sentiment we still see some relationship that can be converted into word sequences in the sentence. 🐑💬 It is possible and what you to do with the time domain when input is in bunches of frequencies with the time-related relationship. 🥺💬 I hope they can mixed together with embedding or shuffling but remain the information within the same set of the inputs. 🐑💬 You plot the Sigmoid function, Tanh and reLU and yes you can do a direct compares the estimated values within the same time domain. 📺💬 Now give me some see what me dress like ⁉️ 👧💬 There are many points one significant see is low precisions network machine when execution with less precision but high accuracy. 📺💬 Words CNN it can do some tasks better for di-grams tri-grams tasks it is working as CNN layer. 🐑💬 That is meaning we can add label or additional data into it ⁉️ 👧💬 Do you mean the scores, good, bad or some properties you earn from other networks or training with concatenated layers ⁉️ 🧸💬 You cannot copies and separated each parts when they are working.
Relevance is just how often a word appears in the input? NM on this. I looked it up. The answer is similarity of tokens in the embedding - ones with higher similarity gets more relevance.
Typical example of a bad lecture. Only showing stuff and without introducing or explaining what he is showing (what is that graph about? what are the axis or the arrows mean?) talks about it and goes on to the next slide
Well the problem is that you are using AI to make it learn wasting time and resources than use machine learning as an optimizer which is a better usage of neural networks! What i mean is that most dont get it when you are supposed to use neural network as an AI to make it learn from data and when to use neural network as a machine learning as an optimizer!! You need an engineer for that not a Phd IT professor xD. Stop wasting your time and hire more engineers!!!
When I want to use transformers for time series analysis while the dataset includes individual specific effects. What do I do? In this case the only possibility would be to match the batch size with the length of the individual data length? Right?
No, batch and time will be different tensor dimensions. If your dataset has 17 features, and the length is 100 time steps, then your input tensor might be 32x100x17 with a batch size of 32.
Very nice recap of Transformers and what sets them apart from RNNs! Just one little remark, you are not doing things in N^2 for the transformer since you fixed your N to be at maximum some sequence length. You can now set this N to be a much bigger number as GPUs have been highly optimized to do the according multiplications. However, for long sequence lengths, the quadratic nature of an all-to-all comparison is going to be an issue nonetheless.
I've never read any papers about this - just my personal experience and talking to colleagues. If I had to guess, I'd say it's related to the fact that LSTM's are really tough to train. Which is not surprising if you think about them as incredibly deep networks (depth = sequence length) but the weights are re-used at every layer. Those few parameters get re-used for a lot of things. Transfer learning necessarily means being able to _quickly_ retrain a network on a new task. But training is never fast with an LSTM. That's just my speculation though.
question regarding 26:27 so if i plan on analysing time series sensor data should i stick to LSTM or is the transformers model a good choice for time series data?
U need to use LSTM for time series. Bcos in transformers, it's all about attention or positional intelligence which has to be learnt. Whereas in time series, it's all about the trend and patterns which requires the model to remember a complete sequence of data points.
The primary advantages and benefits from the transformer are the attention and positional encoding, which are quite useful for translation because the grammar differences in different languages may cause the disorder of the input and output words. But for time series sensor data, they are not disordered (comparing output with input)! RNN, such as LSTM is a suitable choice to perform analysis for such data.
This is such a rich talk. He should definitely change the title. I've searched far and wide for a lucid explanation of LSTM - this is the best online but doesn't seem as such due to odd title.
Thank's so much for video . can'i ask some one if he know where i can find a pre-trainded modele to identfiy number in Image that are from 0 to 100. No writied by hand specialy and can be any where position in image ? Thank's for adavance.
Almos ever, this videos in youtube are a lost of time, just talking and no real example or pratical stuff.All the same, too much talk no real thing. if had some teorics ok, but not else this they have.
Presentation is good but the presenter makes too many unnecessary jokes and murmuring. It is difficult to follow without pausing because attention is all I need and this kind of presenting disturbs it.
While I appreciate the association, what did I say to imply you can't retrain on a large corpus? In the summary "Key Advantages of Transformers" I wrote "Can be trained on unsupervised text; all the world's text data is now valid training data."
(Leo here - sorry if you see this twice, but YT is blocking comments from my account for some reason.) Yes! My favorite paper on this topic is from Bengio's group which uses Unitary weight matrices, which are complex-valued, but constrained to have their eigenvalues exactly as 1. arxiv.org/abs/1511.06464 A simpler approach is to just initialize the weight-matrices with real-valued orthonormal matrices, a good summary at smerity.com/articles/2016/orthogonal_init.html But overall I think the key thing is that not long after these ideas were being explored, Transformers came along, which are simpler, more robust, and have plenty of other advantages. Critically IMHO, the training depth doesn't scale by the sequence length, which makes convergence much simpler.
supposing i am using a net to approximate a real world physis ODE equation with time series data, in this case the Transformer is still the best choice?
I'm not sure. I have barely read any papers on this kind of modeling. I will say that a wonderful property of transformers is that they can learn to analyze arbitrary dimensional inputs - it's easy to create positional encodings for 1D inputs (sequence), or 2D (image), or 3D, 4D, 5D, etc. Some physics modeling scenarios will want this kind of input. If your inputs are purely 1D, you could use older NN architectures, but in 2023 there are very few situations where I'd choose an LSTM over a transformer. (e.g. if you need an extremely long time horizon.) -Leo
I had a tutorial few hours ago on how to build an LSTM network using TF only, left me feeling completely stupid. Thank you for showing there is a better way.
Does the multi-headed attention + position encoding work equally well and better than plain vanilla LSTM but on numeric input ( float or integers ) vectors / tensors ? Your input is highly appreciated
Not an expert here, but the way attention works is closely tied to the way nearby words are relevant to each other: for example, a pronoun and it's relevant noun. Multi-headed attention would identify more such abstract relationships between words in a window. So if the numeric input seq has a set of consistent relationships among all its members, then attention would help embed more relational info on the input data so that processing it becomes easier when honouring this relational info.
ive never understood the use of sin and cos for positional encoding. just giving it a linear function would have also carried positional information. 0.2 > 0.1 so must be after 0.1
You are correct - a simple linear function would give the neural net all the positional encoding it needs, and it could figure out all the subsequent relationships from there. But many of those useful relationships would require several/many layers of nonlinear transformations for the NN to figure out -- e.g. if you need to learn a detector like "0.2 < x-y < 0.25" that's necessarily going to take at least two layers simply because each ReLU can only do so much work. Instead, the sin & cos encode more information that we're pretty sure is going to be useful, and thus save the NN the effort of figuring this stuff out itself. That is, the sin/cos encoding make arbitrary-distance positional comparison relationships instantly linearly separable in a single layer, and thus in a sense it "pre-trains" the net for what it would have to learn itself if you just gave it a simple linear positional encoding. HTH.
20:00 If I multiply a small scaling factor λ₁ (e.g. 0.01) to the output before feeding to activation function, sigmoid will be sensitive to difference between, say, 5 and 50. Similarly, if I multiply another scaling factor λ₂ (e.g. 100) to the sigmoid output, I can get activated output ranging between 0 and 100. Is that a better solution than Relu, which has no cap at all?
The problem with that approach is that in the very middle of the range the sigmoid is almost entirely linear - for input near zero, the output is 0.5 + x/4. And neural networks need nonlinearity in the activation to achieve their expressiveness. Linear algebra tells us that if you have a series of linear layers they can always and exactly be compressed down to a single linear layer, which we know isn't a very powerful neural net.
@@xruan6582 Right! That's the funny thing about ReLU - it either "does nothing" (leaves the input the same) or it "outputs nothing" (zero). But by sometimes doing one and sometimes doing the other, it is effectively making a logic decision for every neuron based on the input value, and that's enough computational power to build arbitrarily complex functions. If you want to follow the biological analogy, you can fairly accurately say that each neuron in a ReLU net is firing or not, depending on whether the weighted sum of its inputs exceeds some threshold (either zero, or the bias if your layer has bias). And then a cool thing about ReLU is that they can fire weakly or strongly.
it depends on what the task is but basically, yeah. the biggest problem for AI atm is doing new stuff, its terrible at doing stuff it hasnt done/seen almost exactly before.
Totally! I always listen to people talking at 1.25x to 1.5x if I can. Humans are much better at parsing language quickly than generating it. And I was umming and awwing a lot which lowers the information density.
Good to see Adam Driver working on transformers 😁
That's one of the best deep learning related presentations I've seen in a while! Not only introduced transformers but also gave an overview of other NLP strategies, activation functions and also best practices when using optimizers. Thank you!!
I second this! The talk was such a joy to listen to
in about 30 minutes!!!!
:¥£€€’
Agree -- I've watched half a dozen videos on transformers in the past 2 days, I wish I'd started with Leo's.
For anyone feeling overwhelmed, it is completely reasonable, as this video is just a 28 minute recap for experienced machine learning practitioners, and lot of them are just spamming the top comments with "This is by far the best video", "Everything is clear with this single video" and all.
Sounds like it is my lucky day then, for me to jump from noob to semi-non-noob by gathering thinking patterns from more-advanced individuals. I will fill in the swiss cheese holes of crystallized intelligence later by extrapolating out from my current fluid intelligence level... or something like that. Sorry I'll see myself out.
I was about to make a remark about the presenter speaking like a machine gun at the start. I can't even follow such a pace even in my native language, on a lazy Sunday afternoon with a drink in my hand. Who cares what you say if no one manages to understand it??? Easy, easy boy... slow down, no one cares how fast you can speak, what matters is what you are able to explain. (so the others understand it).
@@svily0 >I can't even follow such a pace even in my native language
maybe that's the issue?
@@ВиталийБуланенков Well, could as well be, but on the fringe side I have a masters degree. Could not be just that. ;)
This is by far the best comment, Everything is clear after reading this single comment! Thank you all
Transformers seem overly prone to recency bias.
How?
Thank you for this concise and well-rounded talk! The pseudocode example was awesome!
Its hard to overstate just how much this topic has(is) transformed the industry. As others have said, understanding it is not easy because there are a bunch of components that don't seem to align with one another and overall the architecture is such a departure from the most traditional things you are taught. I myself have wrangled with it for a while and its still difficult to fully grasp. Like any hard problem, you have to bang your head against it for a while before it clicks.
"has(is)"??
Great talk. It's always thrilling to see someone who actually knows what they're supposedly presenting.
12:56 the review of the pseudocode of the attention mechanism was what finally helped me understand it (specifically the meaning of the Q,K,V vectors), what other videos were lacking. In the second outer for loop, I still don't fully understand why it loops over the length of the input sequence. The output can be of different length, no? Maybe this is an error. Also, I think he didn't mention the masking of the remaining output at each step so the model doesn't "cheat".
for every word we compute its query, key and value vectors, so we need to loop through our sequence
1. @10:17, the speaker says all we need is the encoder part for classification problem, is this True? How about BERT, when we use BERT encoding for classification, say sentiment analysis, all that has worked was the encoder part?
2. @ 12:25, the slide is really clear in explaining relevance[i,j], but the example is translation, so clearly it is not on the "encoder part". In the encoder part, how is relevance[i,j] computed? what is the difference between key and value? It seems they are all values of the input vector? Aren't they the same in the encoder part?
Thank you!
Good question...Key and Value seems symmetric. I was expecting symmetry in a self-attention model, but I can't quite understand how this works with the key/value analogy.
are there really a half million of you out there that understand this?
Wonderfully clear and precise presentation. One thing that tripped me up, though, is this formula at 4 minutes in:
Hi+1 = A(Hi, xi)
Seems this should rather be:
Hi+1 = A(Hi,xi+1)
which might be more intuitively written as:
Hi = A(Hi-1,xi)
This is like 90% of what I remember from my NLP course with all the uncertainty cleared up, thanks!
You folks need to look into asymptotics and Padé approximant methods, or for functions of many variables as ANN's are you'd use the generalize Canterbury Approximants. The is not yet a rigorous development in information theoretic terms, but Padé summations (essentially repeated fraction representations) are known to yield rapid convergence to correct limits for divergent Taylor series in non-converging regions of the complex plane. What this boils down to is that you only need a fairly small number of iterations to get very accurate results if you only require approximations. To my knowledge this sort of method is not being used in deep learning, but has been used by physicists in perturbation theory. I think you will find it extremely powerful in deep learning. Padé (or Canterbury) summation methods when generalized are a way of extracting information from incomplete data. So if you use a neural net to get a few first approximants, and assume they are modelling an analytically continued function, then you have a series (the node activation summation) you can Padé sum and extract more information than you'd be able to otherwise.
RIP LSTM 2019, she/he/it/they would be remembered by....
Not everyone will get this
Still, LSTM works better with long texts. It has its own use cases.
@@dineshnagumothu5792 you obviously didn't get it. it is "DEAD", lol. RIP LSTM.
interesting looks a lot like my signal class. how to implement various filters on a dsp.
@10:30 - Attention is all you need -- Multi Head Attention Mechanism --
Keep the microphone to your mouth so hbpffpfpfph because that's really annoying right in pfpppbhbhffff of a sentence.
Presentation: perfect
Explanation: perfect
me (every 10 mins): " but that belt tho... ehh PERFECT!"
All i want is his level of humbleness and knowledge
Don't just want, make it happen than. You could literally do this
Find the humility to get your head down and acquire the knowledge. Let the universe do the rest.
I was trying to use similar super-low frequency sine trick for audio sample classification (to give network more clues about attack/sustain/release positioning). Never did I know, that one can use several of those in different phases. Such a simple and beautiful trick
The presentation is awesome
Leo is an excellent professor. He explains difficult concepts in an easy-to-understand way.
When using acronyms, it is not good to LRTD. And do not ever GLERD. People won't understand the SMARG.
.
It does help if you ETFM (Explain the Fu*king Meaning) as you write.
I love this presentation
Doesn't assume that the audience knows far more than is necessary, goes through explanations of relevant parts of Transformers, notes shortcomings, etc;
Best slideshow I've seen this year, and it's from over 3 years ago
I've always wondered how standard Relu's can provide non-trivial learning if they are essentially linear for positive values? I know with standard linear activation functions any deep network can be reduced to a since layer transformation. Is it the discontinuity at zero that stops this being the case for Relu?
Exactly. Think of it like this. A matrix-vector multiplication is a linear transformation. That means it rotates and shifts its input vector. That is why you can write two of these operations as a single one (A_matrix * B_matrix * C_vec = D_matrix * C_vec) and also why you can add scalar multiplications in between (which is what linear activation would do, and is just a scaling operation on the vector). But if you only scale some of the entries of the vector (ReLu) that does not work anymore.
If you take a pen, rotating and scaling it preservers your pen, but if you want to only scale parts of it, you have to break it.
@@lucast2212 Cheers! good explanation, thanks.
Four years ago! Shocking.
At 4:15 shouldn't it be
H[i+1] = A(H[i] ; x[i+1]) ?
YES! Ooops, sorry about that. Good catch.
World deserve more lectures like this one. I don't need examples on how to tune U-net, but the overview of this huge research space and ideas underneath each group.
-sorry for the lack of technical terms- I did not completely get it how transformers work regarding to positional information: Isn't X_in the information of the previous hidden layer? That is not enough for the network, because the input embeddings lack any temporal/positional information, right? But why not just add one new linear temporal value to the embeddings instead of many sinewaves at different scale?
This video is incredible good. Keep in short and clear enough. Can you allow me to add translation for chinese?
If you have translated it into Chinese, please let me know and give me the link, thank you
That would be great! I don't know of any RUclips feature to delegate that permission, but if there is one, let us know how. 谢谢你的帮助!
LSTMs dead? Yep, simple parity bit problem, cannot solved by transformers, but lstms :)
Nice video but forced to watch on 2x speed trying not to fall asleep
9:07, could you please explain what do you mean by "Needs specific labelled dataset for every task"?
I literally just trained a LSTM network (Char RNN based on Karpathy's github.com/sherjilozair/char-rnn-tensorflow) by just giving it non labeled text.
That sentence isn't accurate, especially out of context. You can always train LSTM's unsupervised like you did. But the point I'm explaining is that "transfer learning never really worked" - which is to say you usually can't use a pre-trained model on a new problem.
This finally made it clear for me why RNNs have been introduced! thanks for sharing
the Eng:French matrix/diagram from 11:35 shows attention between an English and a French vector. But that would involve both the ENCODing and DECODing....how they interact.
Whereas speaker is discussing *only* the internals of the ATTENTION mechanim in the Encoder at this point.
I'd really like to see a similar matrix/diagram illustrating use of attention WITHIN the ENCODing session......it wouldnt involve French at all at this point, cuz ENCODER hasnt even got to the shared representation yet....the machine version of the of the input that comes AFTER the ENCODE, but BEFORE the DECODE.....
==> and you're not alone, I see this same vaguery elsewhere in other of Transformer processing....
==> but then, most likely I just misunderstand......
7:25 can you explain to me what does he mean by two hidden States
Literally it means that at each time step, there are two different state vectors passed from one LSTM cell to the next in the time sequence. What they each do or how they are distinct is not entirely clear to me. But structurally, the top one in the diagram (usually called C) acts like a ResNet in that new information is only added to it at each time step, making the gradient path simpler, and training easier. The bottom one (usually called h) is more like a vanilla RNN, responding quickly and directly to the input at that time step. So it's probably reasonable to think of them as representing slower & faster moving changes in the state - capturing interactions that are either closer together in the inputs or stretch over longer ranges.
Ha ha, is the constitution spam. Subtle.
📺💬 Yui krub I give you ice cream 🍦 when we plot sine wave in the word sentiment we still see some relationship that can be converted into word sequences in the sentence.
🐑💬 It is possible and what you to do with the time domain when input is in bunches of frequencies with the time-related relationship.
🥺💬 I hope they can mixed together with embedding or shuffling but remain the information within the same set of the inputs.
🐑💬 You plot the Sigmoid function, Tanh and reLU and yes you can do a direct compares the estimated values within the same time domain.
📺💬 Now give me some see what me dress like ⁉️ 👧💬 There are many points one significant see is low precisions network machine when execution with less precision but high accuracy.
📺💬 Words CNN it can do some tasks better for di-grams tri-grams tasks it is working as CNN layer. 🐑💬 That is meaning we can add label or additional data into it ⁉️
👧💬 Do you mean the scores, good, bad or some properties you earn from other networks or training with concatenated layers ⁉️
🧸💬 You cannot copies and separated each parts when they are working.
My greatest successes are blending traditional time series modeling with Transformer like Wavelet Denoised ARTFIMA + TFT
Relevance is just how often a word appears in the input?
NM on this. I looked it up.
The answer is similarity of tokens in the embedding - ones with higher similarity gets more relevance.
Typical example of a bad lecture. Only showing stuff and without introducing or explaining what he is showing (what is that graph about? what are the axis or the arrows mean?) talks about it and goes on to the next slide
The only useful part was the self attention which you ruined ... I couldn't understand anything from your descriptions
i came here for "Transformers movie" and end up watching something i didn't understand s*t from it,
the whole video was like alien language to me
Well the problem is that you are using AI to make it learn wasting time and resources than use machine learning as an optimizer which is a better usage of neural networks!
What i mean is that most dont get it when you are supposed to use neural network as an AI to make it learn from data and when to use neural network as a machine learning as an optimizer!!
You need an engineer for that not a Phd IT professor xD. Stop wasting your time and hire more engineers!!!
When I want to use transformers for time series analysis while the dataset includes individual specific effects. What do I do? In this case the only possibility would be to match the batch size with the length of the individual data length? Right?
No, batch and time will be different tensor dimensions. If your dataset has 17 features, and the length is 100 time steps, then your input tensor might be 32x100x17 with a batch size of 32.
If only he'd discovered this before thinking wework up.
This aged like fine wine.
Very nice recap of Transformers and what sets them apart from RNNs! Just one little remark, you are not doing things in N^2 for the transformer since you fixed your N to be at maximum some sequence length.
You can now set this N to be a much bigger number as GPUs have been highly optimized to do the according multiplications. However, for long sequence lengths, the quadratic nature of an all-to-all comparison is going to be an issue nonetheless.
Anybody knows why transfer learning never really worked with LSTM. Any links or papers on that?
I've never read any papers about this - just my personal experience and talking to colleagues. If I had to guess, I'd say it's related to the fact that LSTM's are really tough to train. Which is not surprising if you think about them as incredibly deep networks (depth = sequence length) but the weights are re-used at every layer. Those few parameters get re-used for a lot of things. Transfer learning necessarily means being able to _quickly_ retrain a network on a new task. But training is never fast with an LSTM. That's just my speculation though.
"LSTM is just like ResNet" ..... MAYBE... Just maybe.... its the other way around :-D
Maybe people who designed ResNet can tell if they thought of LSTM when designing the network.
This is brilliant.
Dude, can you share your PPT or PDF. Thanks in advance!
question regarding 26:27
so if i plan on analysing time series sensor data should i stick to LSTM or is the transformers model a good choice for time series data?
I could use an answer to this question as well
@@isaacgroen3692 Damn I have the same question
U need to use LSTM for time series.
Bcos in transformers, it's all about attention or positional intelligence which has to be learnt.
Whereas in time series, it's all about the trend and patterns which requires the model to remember a complete sequence of data points.
@@abdulazeez7971 thanks for the info :)
The primary advantages and benefits from the transformer are the attention and positional encoding, which are quite useful for translation because the grammar differences in different languages may cause the disorder of the input and output words. But for time series sensor data, they are not disordered (comparing output with input)! RNN, such as LSTM is a suitable choice to perform analysis for such data.
Uploaded a month ago but has just 150 views and just 24 subs? WTH?
@@vothka205 But ML uses cats and dogs too!
I need that belt.
Best transformer presentation I’ve seen hands down. Nice job!
Thanks for this! It gets to the heart of the matter quickly and in an easy to grasp way. Excellent.
This is such a rich talk. He should definitely change the title. I've searched far and wide for a lucid explanation of LSTM - this is the best online but doesn't seem as such due to odd title.
Thank's so much for video . can'i ask some one if he know where i can find a pre-trainded modele to identfiy number in Image that are from 0 to 100. No writied by hand specialy and can be any where position in image ?
Thank's for adavance.
Almos ever, this videos in youtube are a lost of time, just talking and no real example or pratical stuff.All the same, too much talk no real thing. if had some teorics ok, but not else this they have.
Presentation is good but the presenter makes too many unnecessary jokes and murmuring. It is difficult to follow without pausing because attention is all I need and this kind of presenting disturbs it.
What a nutcase
ooooh I so want to see a documentary about this ==> @25:20
Leo Dirac: Can't pretrain on large corpus
Sam Altman: Hold my beer...
While I appreciate the association, what did I say to imply you can't retrain on a large corpus? In the summary "Key Advantages of Transformers" I wrote "Can be trained on unsupervised text; all the world's text data is now valid training data."
This beautiful speech is before OpenAI GPT, the world badly needs an update
Unfortunately OpenAI is a Close Source by now, people cannot openly talk about its internal structure anymore.
I'll train my transformer with the comments below ^^
Amazing talk. It would be of great help if you can post link to the documents.
Was curious about machine learning and feel like I'm getting a lesson on how to speak in heirogliyphs.
good summary of the RNN models. This video os not for newbies though
11:29 was that French? Nice explanation tho!
Yeah he's reading the translation on the left side.
@@NkThor *badly* reading the translation
@@LeoDirac thanks for giving us a facet on which not to feel inferior :)
No, don’t care about them.
none of that could of made sense and i wouldnt know
wrong, not at all like word2vec.
How to implement transformer n lstm in c++
Wow... that was a quick summarization of all the NN research things in past many decades...
Did anyone try scaling the matricies so that that the Eigen value is exactly 1?
(Leo here - sorry if you see this twice, but YT is blocking comments from my account for some reason.)
Yes! My favorite paper on this topic is from Bengio's group which uses Unitary weight matrices, which are complex-valued, but constrained to have their eigenvalues exactly as 1. arxiv.org/abs/1511.06464 A simpler approach is to just initialize the weight-matrices with real-valued orthonormal matrices, a good summary at smerity.com/articles/2016/orthogonal_init.html
But overall I think the key thing is that not long after these ideas were being explored, Transformers came along, which are simpler, more robust, and have plenty of other advantages. Critically IMHO, the training depth doesn't scale by the sequence length, which makes convergence much simpler.
@@seattleapplieddeeplearning, Thanks.
I stopped after about 5 minutes.. Too much for me as this stage!
supposing i am using a net to approximate a real world physis ODE equation with time series data, in this case the Transformer is still the best choice?
I'm not sure. I have barely read any papers on this kind of modeling. I will say that a wonderful property of transformers is that they can learn to analyze arbitrary dimensional inputs - it's easy to create positional encodings for 1D inputs (sequence), or 2D (image), or 3D, 4D, 5D, etc. Some physics modeling scenarios will want this kind of input. If your inputs are purely 1D, you could use older NN architectures, but in 2023 there are very few situations where I'd choose an LSTM over a transformer. (e.g. if you need an extremely long time horizon.) -Leo
@@seattleapplieddeeplearning Thanks for your reply, this realy helps me.
Schmidhuba comin to get ya !
Any chance this guy is related to Paul Dirac?
High probability
"Then you call fit and that's it"
I had a tutorial few hours ago on how to build an LSTM network using TF only, left me feeling completely stupid. Thank you for showing there is a better way.
Great video. Thanks
Excellent presentation! Perfect!
Does the multi-headed attention + position encoding work equally well and better than plain vanilla LSTM but on numeric input ( float or integers ) vectors / tensors ?
Your input is highly appreciated
Not an expert here, but the way attention works is closely tied to the way nearby words are relevant to each other: for example, a pronoun and it's relevant noun. Multi-headed attention would identify more such abstract relationships between words in a window. So if the numeric input seq has a set of consistent relationships among all its members, then attention would help embed more relational info on the input data so that processing it becomes easier when honouring this relational info.
very impressive presentation. thank you.
i didnt knew that zlatan also teach deep learning
Me, writing my batchelor's thesis partly on LSTMs: FUCK
LSTMs still have a very important place in deep learning. Just not for NLP.
This was more than meets the eye
ive never understood the use of sin and cos for positional encoding. just giving it a linear function would have also carried positional information. 0.2 > 0.1 so must be after 0.1
You are correct - a simple linear function would give the neural net all the positional encoding it needs, and it could figure out all the subsequent relationships from there. But many of those useful relationships would require several/many layers of nonlinear transformations for the NN to figure out -- e.g. if you need to learn a detector like "0.2 < x-y < 0.25" that's necessarily going to take at least two layers simply because each ReLU can only do so much work. Instead, the sin & cos encode more information that we're pretty sure is going to be useful, and thus save the NN the effort of figuring this stuff out itself. That is, the sin/cos encoding make arbitrary-distance positional comparison relationships instantly linearly separable in a single layer, and thus in a sense it "pre-trains" the net for what it would have to learn itself if you just gave it a simple linear positional encoding. HTH.
20:00 If I multiply a small scaling factor λ₁ (e.g. 0.01) to the output before feeding to activation function, sigmoid will be sensitive to difference between, say, 5 and 50. Similarly, if I multiply another scaling factor λ₂ (e.g. 100) to the sigmoid output, I can get activated output ranging between 0 and 100. Is that a better solution than Relu, which has no cap at all?
The problem with that approach is that in the very middle of the range the sigmoid is almost entirely linear - for input near zero, the output is 0.5 + x/4. And neural networks need nonlinearity in the activation to achieve their expressiveness. Linear algebra tells us that if you have a series of linear layers they can always and exactly be compressed down to a single linear layer, which we know isn't a very powerful neural net.
@@LeoDirac Relu is linear from 0 to ∞
@@xruan6582 Right! That's the funny thing about ReLU - it either "does nothing" (leaves the input the same) or it "outputs nothing" (zero). But by sometimes doing one and sometimes doing the other, it is effectively making a logic decision for every neuron based on the input value, and that's enough computational power to build arbitrarily complex functions. If you want to follow the biological analogy, you can fairly accurately say that each neuron in a ReLU net is firing or not, depending on whether the weighted sum of its inputs exceeds some threshold (either zero, or the bias if your layer has bias). And then a cool thing about ReLU is that they can fire weakly or strongly.
at 5:40, I think determinant is right word instead of eigen value...
still trying to do what a 10 year old can do. AGM is safe for now.
Last time I checked, 10 year olds cant beat world champions in chess, go or starcraft.
it depends on what the task is but basically, yeah. the biggest problem for AI atm is doing new stuff, its terrible at doing stuff it hasnt done/seen almost exactly before.
Linear algebra of variable dimensions? Fock Spaces. Known for 90 years. en.wikipedia.org/wiki/Fock_space
Best Transformer explanation ever.
where can I find the presentation doc of this talk amigos? thanks
Great talk, had to watch at 1.25x though.
he already talks as if he is on steroids :D Cant imagine I'd understand anything he says at 1.25x lol
Totally! I always listen to people talking at 1.25x to 1.5x if I can. Humans are much better at parsing language quickly than generating it. And I was umming and awwing a lot which lowers the information density.
Thanks! Really good compare/contrasting.
6:41 hahaha this is GODLIKE! The fact that Schmidhuber is on there makes the joke even better!
I use Python as "pseudocode" in presentations too. Much more intuitive than the ALGOL style that has been standard for so long.
This helped me a ton to understand the basics. Thanks!