LSTM is dead. Long Live Transformers!

Поделиться
HTML-код
  • Опубликовано: 21 ноя 2024

Комментарии • 296

  • @FernandoWittmann
    @FernandoWittmann 4 года назад +359

    That's one of the best deep learning related presentations I've seen in a while! Not only introduced transformers but also gave an overview of other NLP strategies, activation functions and also best practices when using optimizers. Thank you!!

    • @ahmadmoussa3771
      @ahmadmoussa3771 4 года назад +5

      I second this! The talk was such a joy to listen to

    • @aashnavaid6918
      @aashnavaid6918 4 года назад +2

      in about 30 minutes!!!!

    • @jackholloway7516
      @jackholloway7516 2 года назад

      :¥£€€’

    • @jbnunn
      @jbnunn Год назад

      Agree -- I've watched half a dozen videos on transformers in the past 2 days, I wish I'd started with Leo's.

  • @vamseesriharsha2312
    @vamseesriharsha2312 4 года назад +67

    Good to see Adam Driver working on transformers 😁

  • @richardosuala9739
    @richardosuala9739 4 года назад +45

    Thank you for this concise and well-rounded talk! The pseudocode example was awesome!

  • @lmao4982
    @lmao4982 Год назад +5

    This is like 90% of what I remember from my NLP course with all the uncertainty cleared up, thanks!

  • @_RMSG_
    @_RMSG_ Год назад +2

    I love this presentation
    Doesn't assume that the audience knows far more than is necessary, goes through explanations of relevant parts of Transformers, notes shortcomings, etc;
    Best slideshow I've seen this year, and it's from over 3 years ago

  • @monikathornton8790
    @monikathornton8790 4 года назад +53

    Great talk. It's always thrilling to see someone who actually knows what they're supposedly presenting.

  • @BartoszBielecki
    @BartoszBielecki 2 года назад +3

    World deserve more lectures like this one. I don't need examples on how to tune U-net, but the overview of this huge research space and ideas underneath each group.

  • @ajitkirpekar4251
    @ajitkirpekar4251 4 года назад +25

    Its hard to overstate just how much this topic has(is) transformed the industry. As others have said, understanding it is not easy because there are a bunch of components that don't seem to align with one another and overall the architecture is such a departure from the most traditional things you are taught. I myself have wrangled with it for a while and its still difficult to fully grasp. Like any hard problem, you have to bang your head against it for a while before it clicks.

  • @cliffrosen5180
    @cliffrosen5180 Год назад +1

    Wonderfully clear and precise presentation. One thing that tripped me up, though, is this formula at 4 minutes in:
    Hi+1 = A(Hi, xi)
    Seems this should rather be:
    Hi+1 = A(Hi,xi+1)
    which might be more intuitively written as:
    Hi = A(Hi-1,xi)

  • @evennot
    @evennot 4 года назад +13

    I was trying to use similar super-low frequency sine trick for audio sample classification (to give network more clues about attack/sustain/release positioning). Never did I know, that one can use several of those in different phases. Such a simple and beautiful trick
    The presentation is awesome

  • @BcomingHIM
    @BcomingHIM 4 года назад +52

    All i want is his level of humbleness and knowledge

    • @pazmiki77
      @pazmiki77 4 года назад +4

      Don't just want, make it happen than. You could literally do this

    • @pi5549
      @pi5549 4 года назад

      Find the humility to get your head down and acquire the knowledge. Let the universe do the rest.

  • @Scranny
    @Scranny 4 года назад +14

    12:56 the review of the pseudocode of the attention mechanism was what finally helped me understand it (specifically the meaning of the Q,K,V vectors), what other videos were lacking. In the second outer for loop, I still don't fully understand why it loops over the length of the input sequence. The output can be of different length, no? Maybe this is an error. Also, I think he didn't mention the masking of the remaining output at each step so the model doesn't "cheat".

    • @Splish_Splash
      @Splish_Splash Год назад

      for every word we compute its query, key and value vectors, so we need to loop through our sequence

  • @ProfessionalTycoons
    @ProfessionalTycoons 4 года назад +25

    RIP LSTM 2019, she/he/it/they would be remembered by....

    • @mohammaduzair608
      @mohammaduzair608 4 года назад +3

      Not everyone will get this

    • @dineshnagumothu5792
      @dineshnagumothu5792 4 года назад +4

      Still, LSTM works better with long texts. It has its own use cases.

    • @mateuszanuszewski69
      @mateuszanuszewski69 4 года назад

      @@dineshnagumothu5792 you obviously didn't get it. it is "DEAD", lol. RIP LSTM.

  • @Achrononmaster
    @Achrononmaster 4 года назад +2

    You folks need to look into asymptotics and Padé approximant methods, or for functions of many variables as ANN's are you'd use the generalize Canterbury Approximants. The is not yet a rigorous development in information theoretic terms, but Padé summations (essentially repeated fraction representations) are known to yield rapid convergence to correct limits for divergent Taylor series in non-converging regions of the complex plane. What this boils down to is that you only need a fairly small number of iterations to get very accurate results if you only require approximations. To my knowledge this sort of method is not being used in deep learning, but has been used by physicists in perturbation theory. I think you will find it extremely powerful in deep learning. Padé (or Canterbury) summation methods when generalized are a way of extracting information from incomplete data. So if you use a neural net to get a few first approximants, and assume they are modelling an analytically continued function, then you have a series (the node activation summation) you can Padé sum and extract more information than you'd be able to otherwise.

  • @maciej2320
    @maciej2320 10 месяцев назад +1

    Four years ago! Shocking.

  • @Johnathanaa7
    @Johnathanaa7 4 года назад +13

    Best transformer presentation I’ve seen hands down. Nice job!

  • @briancase6180
    @briancase6180 3 года назад +2

    Thanks for this! It gets to the heart of the matter quickly and in an easy to grasp way. Excellent.

  • @ismaila3347
    @ismaila3347 4 года назад +8

    This finally made it clear for me why RNNs have been introduced! thanks for sharing

  • @riesler3041
    @riesler3041 3 года назад +2

    Presentation: perfect
    Explanation: perfect
    me (every 10 mins): " but that belt tho... ehh PERFECT!"

  • @SanataniAryavrat
    @SanataniAryavrat 4 года назад

    Wow... that was a quick summarization of all the NN research things in past many decades...

  • @timharris72
    @timharris72 4 года назад +2

    This is hands down the best presentation on LSTMs and Transformers I have ever seen. The speaker is really good. He knows his stuff.

  • @sanjivgautam9063
    @sanjivgautam9063 4 года назад +232

    For anyone feeling overwhelmed, it is completely reasonable, as this video is just a 28 minute recap for experienced machine learning practitioners, and lot of them are just spamming the top comments with "This is by far the best video", "Everything is clear with this single video" and all.

    • @adamgm84
      @adamgm84 4 года назад +23

      Sounds like it is my lucky day then, for me to jump from noob to semi-non-noob by gathering thinking patterns from more-advanced individuals. I will fill in the swiss cheese holes of crystallized intelligence later by extrapolating out from my current fluid intelligence level... or something like that. Sorry I'll see myself out.

    • @svily0
      @svily0 4 года назад +7

      I was about to make a remark about the presenter speaking like a machine gun at the start. I can't even follow such a pace even in my native language, on a lazy Sunday afternoon with a drink in my hand. Who cares what you say if no one manages to understand it??? Easy, easy boy... slow down, no one cares how fast you can speak, what matters is what you are able to explain. (so the others understand it).

    • @ВиталийБуланенков
      @ВиталийБуланенков 4 года назад +12

      @@svily0 >I can't even follow such a pace even in my native language
      maybe that's the issue?

    • @svily0
      @svily0 4 года назад +2

      @@ВиталийБуланенков Well, could as well be, but on the fringe side I have a masters degree. Could not be just that. ;)

    • @Nathan0A
      @Nathan0A 4 года назад +6

      This is by far the best comment, Everything is clear after reading this single comment! Thank you all

  • @rohitdhankar360
    @rohitdhankar360 Год назад +1

    @10:30 - Attention is all you need -- Multi Head Attention Mechanism --

  • @8chronos
    @8chronos 4 года назад

    The best presentation/explanation to the topic I have seen so far. Thanks a lot :)

  • @dhariri
    @dhariri 3 года назад

    Excellent talk. Thank you @leopd !

  • @sarab9644
    @sarab9644 3 года назад +1

    Excellent presentation! Perfect!

  • @DeltonMyalil
    @DeltonMyalil 10 месяцев назад +1

    This aged like fine wine.

  • @rp88imxoimxo27
    @rp88imxoimxo27 4 года назад +1

    Nice video but forced to watch on 2x speed trying not to fall asleep

  • @MrDudugeda2
    @MrDudugeda2 4 года назад

    this is easily the best NLP talk ive heard this year

  • @stevelamprou
    @stevelamprou Год назад

    I had a tutorial few hours ago on how to build an LSTM network using TF only, left me feeling completely stupid. Thank you for showing there is a better way.

  • @DavidWhite679
    @DavidWhite679 4 года назад +9

    This helped me a ton to understand the basics. Thanks!

  • @ehza
    @ehza 3 года назад

    This is beautiful. Clear and concise!

  • @ax5344
    @ax5344 3 года назад +2

    1. @10:17, the speaker says all we need is the encoder part for classification problem, is this True? How about BERT, when we use BERT encoding for classification, say sentiment analysis, all that has worked was the encoder part?
    2. @ 12:25, the slide is really clear in explaining relevance[i,j], but the example is translation, so clearly it is not on the "encoder part". In the encoder part, how is relevance[i,j] computed? what is the difference between key and value? It seems they are all values of the input vector? Aren't they the same in the encoder part?
    Thank you!

    • @trevorclark2186
      @trevorclark2186 2 года назад

      Good question...Key and Value seems symmetric. I was expecting symmetry in a self-attention model, but I can't quite understand how this works with the key/value analogy.

  • @driziiD
    @driziiD Год назад

    very impressive presentation. thank you.

  • @georgejo7905
    @georgejo7905 4 года назад +17

    interesting looks a lot like my signal class. how to implement various filters on a dsp.

  • @asnaeb2
    @asnaeb2 4 года назад

    More vids please this was really informative on what actual SOTA is

  • @a_sun5941
    @a_sun5941 3 года назад

    Great Presentation!

  • @joneskiller8
    @joneskiller8 10 месяцев назад +1

    I need that belt.

  • @shivapriyakatta4885
    @shivapriyakatta4885 4 года назад +1

    One of the best talks on Deep Learning!...thank you

  • @dgabri3le
    @dgabri3le 3 года назад

    Thanks! Really good compare/contrasting.

  • @MoltarTheGreat
    @MoltarTheGreat 4 года назад +1

    Amazing video, I feel like I actually have a more concrete grasp on how transformers work now. The only thing I didn't understand was the Positional Encoding but that's because I'm unfamiliar with signal processing.

  • @jung-akim9157
    @jung-akim9157 4 года назад

    This is one of the clearest and most informative presentation about nlp models and their comparison. Thank you so much.

  • @cafeinomano_
    @cafeinomano_ Год назад

    Best Transformer explanation ever.

  • @johnnyBrwn
    @johnnyBrwn Год назад

    This is such a rich talk. He should definitely change the title. I've searched far and wide for a lucid explanation of LSTM - this is the best online but doesn't seem as such due to odd title.

  • @tastyw0rm
    @tastyw0rm Год назад

    This was more than meets the eye

  • @jeffg4686
    @jeffg4686 8 месяцев назад

    Relevance is just how often a word appears in the input?
    NM on this. I looked it up.
    The answer is similarity of tokens in the embedding - ones with higher similarity gets more relevance.

  • @GoogleUser-ee8ro
    @GoogleUser-ee8ro Год назад

    This beautiful speech is before OpenAI GPT, the world badly needs an update

    • @JohnNy-ni9np
      @JohnNy-ni9np Год назад

      Unfortunately OpenAI is a Close Source by now, people cannot openly talk about its internal structure anymore.

  • @ooio78
    @ooio78 4 года назад +1

    Wonderful and educational, value to those who need it!

  • @Rhannmah
    @Rhannmah 4 года назад +2

    6:41 hahaha this is GODLIKE! The fact that Schmidhuber is on there makes the joke even better!

  • @amortalbeing
    @amortalbeing 4 года назад +1

    This was fantastic. really well presented.

  • @leromerom
    @leromerom 4 года назад +3

    Clear, precise, fluid thank you!

  • @gauravkantrod1205
    @gauravkantrod1205 4 года назад +1

    Amazing talk. It would be of great help if you can post link to the documents.

  • @sainissunil
    @sainissunil 2 года назад

    This talk is awesome!

  • @lukebitton3694
    @lukebitton3694 4 года назад +4

    I've always wondered how standard Relu's can provide non-trivial learning if they are essentially linear for positive values? I know with standard linear activation functions any deep network can be reduced to a since layer transformation. Is it the discontinuity at zero that stops this being the case for Relu?

    • @lucast2212
      @lucast2212 4 года назад +9

      Exactly. Think of it like this. A matrix-vector multiplication is a linear transformation. That means it rotates and shifts its input vector. That is why you can write two of these operations as a single one (A_matrix * B_matrix * C_vec = D_matrix * C_vec) and also why you can add scalar multiplications in between (which is what linear activation would do, and is just a scaling operation on the vector). But if you only scale some of the entries of the vector (ReLu) that does not work anymore.
      If you take a pen, rotating and scaling it preservers your pen, but if you want to only scale parts of it, you have to break it.

    • @lukebitton3694
      @lukebitton3694 4 года назад +1

      @@lucast2212 Cheers! good explanation, thanks.

  • @JeffCaplan313
    @JeffCaplan313 Год назад +11

    Transformers seem overly prone to recency bias.

  • @mongojrttv
    @mongojrttv 3 года назад

    Was curious about machine learning and feel like I'm getting a lesson on how to speak in heirogliyphs.

  • @FrancescoCapuano-ll1md
    @FrancescoCapuano-ll1md Год назад

    This is outstanding!

  • @literallyjustsomegirl
    @literallyjustsomegirl 4 года назад +2

    Such a useful talk! TYSM 🤗

  • @oleksiinag3150
    @oleksiinag3150 4 года назад

    He is incredible
    One of the best presenters

  • @BlockDesignz
    @BlockDesignz 4 года назад +1

    This is brilliant.

  • @anewmanvs
    @anewmanvs 3 года назад

    Very good presentation

  • @Davourflave
    @Davourflave 4 года назад +5

    Very nice recap of Transformers and what sets them apart from RNNs! Just one little remark, you are not doing things in N^2 for the transformer since you fixed your N to be at maximum some sequence length.
    You can now set this N to be a much bigger number as GPUs have been highly optimized to do the according multiplications. However, for long sequence lengths, the quadratic nature of an all-to-all comparison is going to be an issue nonetheless.

  • @matthewarnold3922
    @matthewarnold3922 4 года назад

    Excellent talk. Kudos!

  • @axe863
    @axe863 Год назад

    My greatest successes are blending traditional time series modeling with Transformer like Wavelet Denoised ARTFIMA + TFT

  • @dalissonfigueiredo
    @dalissonfigueiredo 4 года назад +2

    What a great explanation, thank you.

    • @yangl1849
      @yangl1849 3 года назад

      hkj678aTY656S\]paxz dAESAZ RS

  • @ThingEngineer
    @ThingEngineer 4 года назад

    This is by far the best video. Ever.

  • @lucyairapetian407
    @lucyairapetian407 4 года назад +1

    Great talk, had to watch at 1.25x though.

    • @thusi87
      @thusi87 4 года назад

      he already talks as if he is on steroids :D Cant imagine I'd understand anything he says at 1.25x lol

    • @LeoDirac
      @LeoDirac 3 года назад

      Totally! I always listen to people talking at 1.25x to 1.5x if I can. Humans are much better at parsing language quickly than generating it. And I was umming and awwing a lot which lowers the information density.

  • @swe_fun
    @swe_fun Год назад

    This was amazing.

  • @ChrisHalden007
    @ChrisHalden007 Год назад

    Great video. Thanks

  • @jaypark7417
    @jaypark7417 4 года назад

    Thank you for sharing it. Really helpful!!

  • @thusi87
    @thusi87 4 года назад

    Great summary! Wonder if you have a collection of talks you give on similar topics ?

  • @Lumcoin
    @Lumcoin 3 года назад

    -sorry for the lack of technical terms- I did not completely get it how transformers work regarding to positional information: Isn't X_in the information of the previous hidden layer? That is not enough for the network, because the input embeddings lack any temporal/positional information, right? But why not just add one new linear temporal value to the embeddings instead of many sinewaves at different scale?

  • @randomcandy1000
    @randomcandy1000 2 года назад

    this is awesome!!! thank you

  •  4 года назад

    Amazing presentation

  • @damienbegon9547
    @damienbegon9547 3 года назад

    Sigmoid / tanh activation saturation is actually not a problem since they don't get involved in backpropagation (needing to calculate gradient) regarding LSTM.

    • @LeoDirac
      @LeoDirac 3 года назад

      How so? They are literally involved - they're calculated, and effect the gradients and weight updates. Do you mean the updates don't matter for some reason? Why would that be? I suppose if they are saturated then the gradients will be zero, so they won't get updated, but I'd say therein lies the problem. Curious to understand where you're coming from.

    • @damienbegon9547
      @damienbegon9547 3 года назад

      @@LeoDirac I got into Deep Learning from the Stanford University courses that I found on youtube.
      here's the link to the explanation of the gradient flow for LSTM : ruclips.net/video/6niqTuYFZLQ/видео.html
      I might have understood something wrong so feel free to explain me more about it :)

  • @thomaskwok8389
    @thomaskwok8389 4 года назад +3

    Clear and concise👍

  • @kjpcs123
    @kjpcs123 4 года назад

    A great introduction to transformers.

  • @felipevaldes7679
    @felipevaldes7679 Год назад

    Leo Dirac: Can't pretrain on large corpus
    Sam Altman: Hold my beer...

    • @LeoDirac
      @LeoDirac 9 месяцев назад

      While I appreciate the association, what did I say to imply you can't retrain on a large corpus? In the summary "Key Advantages of Transformers" I wrote "Can be trained on unsupervised text; all the world's text data is now valid training data."

  • @ajayram198
    @ajayram198 4 года назад

    Traditional ML guy here, completed Masters in Machine Learning in 2012 when MOOCs and Deep Learning weren't "Pop Culture" . Have heard of Schmidhuber and LSTM in passing though. Gotta check this out ~~|

  • @謝其宏-p3z
    @謝其宏-p3z 4 года назад +24

    This video is incredible good. Keep in short and clear enough. Can you allow me to add translation for chinese?

    • @kamisama3099
      @kamisama3099 4 года назад +1

      If you have translated it into Chinese, please let me know and give me the link, thank you

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning  4 года назад

      That would be great! I don't know of any RUclips feature to delegate that permission, but if there is one, let us know how. 谢谢你的帮助!

  • @DrummerBoyGames
    @DrummerBoyGames 4 года назад

    Excellent vid, am wondering about a point made around 22:00 about SGD being "slow but gives great results."
    I was under the impression that SGD was generally considered pretty OK w/r/t speed, especially compared to full gradient descent? Maybe it's slow compared to Adam I guess, or in this specific use-case it's slow? Perhaps I'm wrong. Anyways, thanks for the vid!

    • @LeoDirac
      @LeoDirac 4 года назад +2

      I was really just comparing SGD vs Adam there. Adam is usually much faster than SGD to converge. SGD is the standard and so a lot of optimization research has tried to produce a faster optimizer.
      Full batch gradient descent is almost never practical in deep learning settings. That would require a "minibatch size" equal to your entire dataset, which would require vast amounts of GPU RAM unless your dataset is tiny. FWIW, full batch techniques can actually converge fairly quickly, but it's mostly studied for convex optimization problems, which neural networks are not. The "noise" introduced by the random samples in SGD is thought to be very important to help deal with the non-convexity of the NN loss surface.

  • @vijayabhaskar-j
    @vijayabhaskar-j 4 года назад +4

    Uploaded a month ago but has just 150 views and just 24 subs? WTH?

    • @vsiegel
      @vsiegel 4 года назад +2

      @@vothka205 But ML uses cats and dogs too!

  • @beire1569
    @beire1569 Год назад

    ooooh I so want to see a documentary about this ==> @25:20

  • @snehotoshbanerjee1938
    @snehotoshbanerjee1938 4 года назад

    Simply Wow!

  • @BoersenCrashKurs
    @BoersenCrashKurs 3 года назад +1

    When I want to use transformers for time series analysis while the dataset includes individual specific effects. What do I do? In this case the only possibility would be to match the batch size with the length of the individual data length? Right?

    • @LeoDirac
      @LeoDirac 3 года назад

      No, batch and time will be different tensor dimensions. If your dataset has 17 features, and the length is 100 time steps, then your input tensor might be 32x100x17 with a batch size of 32.

  • @arparwan
    @arparwan 3 года назад

    good summary of the RNN models. This video os not for newbies though

  • @bgundogdu16
    @bgundogdu16 4 года назад

    Great presentation!

  • @BonkersGameplay
    @BonkersGameplay Год назад +2

    are there really a half million of you out there that understand this?

  • @BiranchiNarayanNayak
    @BiranchiNarayanNayak 4 года назад +1

    Very well explained... Love it.

  • @g3kc
    @g3kc 3 года назад

    great talk!

  • @23232323rdurian
    @23232323rdurian Год назад

    the Eng:French matrix/diagram from 11:35 shows attention between an English and a French vector. But that would involve both the ENCODing and DECODing....how they interact.
    Whereas speaker is discussing *only* the internals of the ATTENTION mechanim in the Encoder at this point.
    I'd really like to see a similar matrix/diagram illustrating use of attention WITHIN the ENCODing session......it wouldnt involve French at all at this point, cuz ENCODER hasnt even got to the shared representation yet....the machine version of the of the input that comes AFTER the ENCODE, but BEFORE the DECODE.....
    ==> and you're not alone, I see this same vaguery elsewhere in other of Transformer processing....
    ==> but then, most likely I just misunderstand......

  • @ramibishara5887
    @ramibishara5887 4 года назад +1

    where can I find the presentation doc of this talk amigos? thanks

  • @LC-lj5kd
    @LC-lj5kd 4 года назад

    every sentence he said punch the right spot.

  • @SuilujChannel
    @SuilujChannel 4 года назад +5

    question regarding 26:27
    so if i plan on analysing time series sensor data should i stick to LSTM or is the transformers model a good choice for time series data?

    • @isaacgroen3692
      @isaacgroen3692 4 года назад +4

      I could use an answer to this question as well

    • @akhileshrai4176
      @akhileshrai4176 4 года назад

      @@isaacgroen3692 Damn I have the same question

    • @abdulazeez7971
      @abdulazeez7971 4 года назад +8

      U need to use LSTM for time series.
      Bcos in transformers, it's all about attention or positional intelligence which has to be learnt.
      Whereas in time series, it's all about the trend and patterns which requires the model to remember a complete sequence of data points.

    • @SuilujChannel
      @SuilujChannel 4 года назад +1

      @@abdulazeez7971 thanks for the info :)

    • @Jason-jk1zo
      @Jason-jk1zo 4 года назад +8

      The primary advantages and benefits from the transformer are the attention and positional encoding, which are quite useful for translation because the grammar differences in different languages may cause the disorder of the input and output words. But for time series sensor data, they are not disordered (comparing output with input)! RNN, such as LSTM is a suitable choice to perform analysis for such data.

  • @육태경-n5r
    @육태경-n5r 2 года назад

    Awesome!

  • @aj-tg
    @aj-tg 4 года назад +1

    Great stuff.

  • @srikantachaitanya6561
    @srikantachaitanya6561 4 года назад

    Thank you...

  • @thetruereality2
    @thetruereality2 3 года назад +1

    7:25 can you explain to me what does he mean by two hidden States

    • @LeoDirac
      @LeoDirac 3 года назад

      Literally it means that at each time step, there are two different state vectors passed from one LSTM cell to the next in the time sequence. What they each do or how they are distinct is not entirely clear to me. But structurally, the top one in the diagram (usually called C) acts like a ResNet in that new information is only added to it at each time step, making the gradient path simpler, and training easier. The bottom one (usually called h) is more like a vanilla RNN, responding quickly and directly to the input at that time step. So it's probably reasonable to think of them as representing slower & faster moving changes in the state - capturing interactions that are either closer together in the inputs or stretch over longer ranges.

  • @maloukemallouke9735
    @maloukemallouke9735 3 года назад

    Thank's so much for video . can'i ask some one if he know where i can find a pre-trainded modele to identfiy number in Image that are from 0 to 100. No writied by hand specialy and can be any where position in image ?
    Thank's for adavance.

  • @davr9724
    @davr9724 3 года назад

    Amazing!!!