[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality

Поделиться
HTML-код
  • Опубликовано: 23 янв 2025

Комментарии • 56

  • @sg785
    @sg785 4 года назад +65

    classic papers is maybe the best addition to this kind of content. i find it really useful and important to come back to old papers sometimes and look at them from the perspective of modern state of dl.

    • @kappadistributive
      @kappadistributive 4 года назад +4

      +1 As my history teacher in high school used to say: You must know where you came from to know where you are going.

  • @robertlucente657
    @robertlucente657 Месяц назад

    It is refreshing to have all these classic worked through - They are helpful to mid-tier people - The experts don't need help - For beginners it is to much - And so mid-tier is helpful

  • @oostopitre
    @oostopitre 4 года назад +1

    There is so much value in the videos just by core content itself. However, anecdotes like how the 'Hierarchical softmax' was a distraction in the paper adds much more context and hence understanding. Thank you for these videos :)

  • @Scranny
    @Scranny 4 года назад +5

    Wow. Just wow. This was a fantastic overview of word2vec. Your explanations of the minute details and the vague and harder to grasp concepts of their paper were exceptional. Your comments of their unconventional authorship and writing style issues were also on point. I felt like I learned and re-learned how word2vec really works. Yes, please cover more classic papers, because understanding the foundations is important. Way to go Yannic!

  • @ShivaramKR
    @ShivaramKR 4 года назад +2

    Thanks Yannic for the [Classic] videos! These videos are more useful than many of the papers which do small incremental improvements.

  • @MrjbushM
    @MrjbushM 4 года назад +2

    Thanks for this classic series papers for us that are learning deep learning is important to cover the classic and main old ideas in the field.

  • @doyourealise
    @doyourealise 4 года назад +2

    wow, I am learning word2vec from yesterday, and was struggling to grasp the concept and here you uploaded the video, explaining the paper!

  • @DiegoJimenez-ic8by
    @DiegoJimenez-ic8by 4 года назад +2

    Thanks for visiting such an important paper!!! Awesome content!!

  • @florianhonicke5448
    @florianhonicke5448 4 года назад +2

    Welcome to Yannic`s paper museum :)
    Very nice to look at older papers as well!

  • @leapdaniel8058
    @leapdaniel8058 4 года назад +1

    I would definitely be into a playlist of "classical" data science videos like this. There is so much content to absorb, being able to focus on the ones that have been proven historically and vetted would be awesome.
    It also gives you a chance to reference how things have improved since then, which is nice to know.

  • @ironic_bond
    @ironic_bond 4 года назад +1

    Really enjoying watching these videos. You did a great job explaining them!

  • @fotisj321
    @fotisj321 4 года назад +2

    Great explanation of a paper as usual. And this paper (or the three of them) changed so much. Even if token-based embeddings are usually preferably. for some applications type-based word embeddings are probably still the better choice, for example if you are interested in the history of concepts and want to track their semantic change.

  • @thearianrobben
    @thearianrobben 4 года назад

    always good to look back classic papers

  • @michaelfrost6437
    @michaelfrost6437 3 года назад +3

    My browser crashed along with my 50,000 tabs. I restored them and suddenly Yannic is telling me about 5 papers simultaneously.

  • @zd676
    @zd676 4 года назад

    Please keep going with the amazing content! Love it!

  • @adriandip8448
    @adriandip8448 2 года назад

    Thank you!!! So much better than the Standford class.

  • @kappadistributive
    @kappadistributive 4 года назад

    To provide another argument for the case of classical papers: It is very difficult to anticipate which ideas will stand the test of time in the moment of their creation. But visiting ‘classical’ papers we allow ourselves the benefit of hindsight - examining those ideas that time proved to be invaluable.

  • @aflah7572
    @aflah7572 3 года назад

    Love this series, looking forward to more such videos

  • @joseiglesias330
    @joseiglesias330 4 года назад

    Yes, more historical papers!!

  • @harshpoddar2113
    @harshpoddar2113 3 года назад

    Really loved your explanation. Thank You.

  • @sonOfLiberty100
    @sonOfLiberty100 4 года назад +1

    Love it, more of old papers :)

  • @francoisdupont2108
    @francoisdupont2108 4 года назад

    Classic papers are a great Ideas. It's really helpful for those like me who are new in ML. I often try to read some papers that are extension of algorithms introduced in the classic ones and I struggle to understand them since I don't have the prerequisite.

  • @carlossegura403
    @carlossegura403 4 года назад +1

    This is awesome!

  • @wizardOfRobots
    @wizardOfRobots 3 года назад

    Thanks you. I couldn't understand word2vec from prof. Andrew Ng's video, but you explained it clearly!

  • @spaceisawesome1
    @spaceisawesome1 4 года назад +20

    Wait you're supposed to be having a break! This is your second video in two days. 😅

    • @tech4028
      @tech4028 4 года назад +3

      The videos are pre-recorded! He's amazing, man.

    • @spaceisawesome1
      @spaceisawesome1 4 года назад +2

      Indeed what a guy. I think he's doing some good things with this channel!

  • @TechVizTheDataScienceGuy
    @TechVizTheDataScienceGuy 4 года назад

    Classic series 🔥

  • @thepaulozip
    @thepaulozip 4 года назад

    Wow that's nice! Please do more about classical papers!

  • @binjianxin7830
    @binjianxin7830 3 года назад

    OMG I’m revisiting this clip for negative sampling because I was confused by it in understanding the node embedding of random walk in GNN.

  • @thntk
    @thntk 4 года назад +1

    Can you please give references to your claim at 5:20? You said that Queen is just one of the closest words to King and the computation -man+woman is irrelevant; that makes sense in this case, but I don't see how it can explain more complicated analogies such as plural form analogy? I would like to read more about this.

  • @aa-xn5hc
    @aa-xn5hc 4 года назад

    Really Great!

  • @ativjoshi1049
    @ativjoshi1049 4 года назад +1

    More videos like this please....

  • @vladimirradenkovic9119
    @vladimirradenkovic9119 2 года назад

    I love you man!

  • @aa-xn5hc
    @aa-xn5hc 4 года назад

    Yes, i love historical papers

  • @danberm1755
    @danberm1755 Год назад

    Thanks! 👍

  • @herp_derpingson
    @herp_derpingson 4 года назад +3

    5:00 Thats news to me. I remember trying it out myself, the king queen thing worked while a lot of other analogies didnt, I didnt put much thought to it back then.
    .
    25:13 3/4 is 75% which is very close to 80%, which makes me think, it has something to do with Pareto Principle. Maybe 4/5 didnt do better because we truncated the tail of the distribution.
    .
    27:40 Heuristics = Wild ass guess. Computer Science 101 :D
    .
    30:30 I think they didnt do that because back in 2013 they didnt have the option :) Tensorflow was made public in late 2015. Back in 2013 there was no Tensorflow, no TPUs and GPU clusters were super niche.

    • @kappadistributive
      @kappadistributive 4 года назад +2

      Regarding your second comment: 80% don’t magically translate to exponent here in the way you seem to suggest: To see this, consider the extreme case in which 1 contributor causes 19% of the effect. This contributor would receive the same exponent in its probability mass function that it would receive in a much less extreme power-law scenario. It would seem, however, that the 19% contributor should be sampled *way* less frequent than that.

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      Yes you're probably right with there not being GPUs, but they had their whole MapReduce infrastructure etc, it would have been easy for them to just keep it at that scale.

  • @saswatnanda3481
    @saswatnanda3481 2 года назад

    one video on Efficient Estimation of Word Representations in
    Vector Space please

  • @Notshife
    @Notshife 4 года назад

    Yannic, are you also taking a break from such regular reading of papers in your personal time as well? And if not, do you think you could provide a "this is interesting" list in your discord channel when you happen to come across interesting papers?

  • @scottmiller2591
    @scottmiller2591 4 года назад +1

    Don't forget that Word2Vec is part of the encoding in the front end of a transformer, so w2v is still plenty relevant!

    • @YuenHsienTseng
      @YuenHsienTseng 4 года назад +1

      As far as I know, Transformers or the like (especially BERT) use Byte Pair Encoding to tackle the out-of-vocabulary problem. The vocabulary size is often reduced to within 30000, rather than 10 to 5 or 7. Therefore, no Word2Vec embeddings there (but an input embeddings layer is still there whose weights are learned when the Transformer is trained). Despite of this change, the concept of Word2Vec does really influentially affect how we apply deep leaning in natural language processing.

  • @simba2702
    @simba2702 4 года назад +1

    I love your videos. Just a side note, when you try to explain things with notes make them readable so that if I jump to a random section I can understand what you are trying to explain.

  • @kurianbenoy1459
    @kurianbenoy1459 4 года назад +1

    Obviosuly like this

  • @M0481
    @M0481 4 года назад

    A comment: In 3:33 you mention that with PCA these are the first 2 dimensions that are portrayed. I don't think this is true, right? PCA allows you to map a certain percentage of the expressiveness of the data into a lower dimensional space. This is unequal to simply getting the first two dimensions.

    • @YannicKilcher
      @YannicKilcher  4 года назад

      Correct, I meant the first two PCA dimensions, not data dimensions

  • @ikopysitsky
    @ikopysitsky Год назад

    I may be mistaken here but if you're maximizing the objective function for negative sampling your negative and positive signs for the WO vs Wi should be reversed, so it should be minimizing instead of maximizing.

  • @jeremykothe2847
    @jeremykothe2847 4 года назад +1

    0 0 0 0 0.05 0.95 st!

  • @maxdoner4528
    @maxdoner4528 3 года назад

    Gj

  • @GuilhermeOliveira-kx4mz
    @GuilhermeOliveira-kx4mz Год назад

    To All my students. Let me know personally if you find my comment. Cheers!