Introduction to NLP | How to Train Custom Word Vectors

Поделиться
HTML-код
  • Опубликовано: 5 апр 2020
  • #nlp #word2vec #python
    In the last video, we have learned how to use pre-trained word vectors. Here, I've shown how to train your own word vectors using gensim library.
    For more videos please subscribe -
    bit.ly/normalizedNERD
    Support me if you can ❤️
    www.paypal.com/paypalme2/suji04
    www.buymeacoffee.com/normaliz...
    NLP playlist -
    • Introduction to NLP
    Source code -
    github.com/Suji04/NormalizedN...
    Data source -
    megagonlabs.github.io/HappyDB/
    Facebook page -
    / nerdywits

Комментарии • 21

  • @shrutiiyyer2783
    @shrutiiyyer2783 2 года назад

    More such videos please, this is much better than the Udemy courses, even the paid ones!

  • @ARSHABBIR100
    @ARSHABBIR100 4 года назад

    Excellent. Thanks for uploading. Kindly make more videos to build a chatbot .

    • @NormalizedNerd
      @NormalizedNerd  4 года назад

      It is in my wish-list too! keep supporting

  • @vishnuprabhaviswanathan546
    @vishnuprabhaviswanathan546 2 года назад

    Can u show how to calculate similarity of 2 words using custom trained word2vec

  • @Lotof_Mazey
    @Lotof_Mazey Год назад +1

    Sir Kindly guide - How can I use Pre Trained word embedding models for local languages (or languages written in Roman format) that are not available/trained in the pretrained model. Do I have to use an embedding layer(not pre trained) for creating embedding matrices for any local language? How can I get benefit from pretrained models for local language?

    • @NormalizedNerd
      @NormalizedNerd  Год назад +1

      Hi, unfortunately there aren't a lot of pre-trained word embeddings of romanized non-english languages. You can search and if you find something then you can fine tune it on your data. But I don't think there's an easy way to use English models on romanized non-english languages.

  • @MrStudent1978
    @MrStudent1978 4 года назад +1

    Very nice explanation! I have a question....in shell no 50..what is sense behind "trainable = false" ? The video is about training custom word2vec...then why false?

    • @NormalizedNerd
      @NormalizedNerd  4 года назад +1

      @Gurpreet Singh
      I understand your confusion.
      We are actually training our word vectors in shell 46. In shell 50, we are making our embedding layer that will be placed just before the LSTM units. Remember that embedding layer is nothing but the learned word vectors (in matrix form)!
      So if we make trainable = True at the embedding layer then keras will train the embedding layer (i.e. the word vectors) again while performing the back prop on LSTM. We don't want that.
      I hope now it's clear to you.

    • @MrStudent1978
      @MrStudent1978 4 года назад

      @@NormalizedNerd thanks for your response! I got it now....

  • @vishnuprabhaviswanathan546
    @vishnuprabhaviswanathan546 2 года назад

    Pls show how to custom train Bert embedding

  • @rushikeshkulkarni7758
    @rushikeshkulkarni7758 Год назад

    why didn't we use sklearn train_test_split?

  • @hanjes4793
    @hanjes4793 3 года назад +1

    Hello...i got a question. In train test split cell. Where is 'word_index' from??? Thx

    • @NormalizedNerd
      @NormalizedNerd  3 года назад

      It's the Keras Tokenizer that is giving us the 'word_index'

  • @WhatsAI
    @WhatsAI 4 года назад

    Great video m yfriend!

  • @s.m.saifulislambadhon2654
    @s.m.saifulislambadhon2654 4 года назад +1

    bro, in 44 no shell what is the purpose of tokenizer when we already tokenize the sentences into words in preprocessing part

    • @s.m.saifulislambadhon2654
      @s.m.saifulislambadhon2654 4 года назад

      would you please explain 44 no shell little bit more briefly? I think this is the most important part which I missing....

    • @NormalizedNerd
      @NormalizedNerd  4 года назад +1

      Great point! In NLP preprocessing, tokenization makes it easier to clean the text. Here I generally use nltk library.
      In block 44, I did the tokenization with keras Tokenizer which allows us to use two nice functions: word_index & texts_to_sequences. These help us to create the tensors easily. So yes, tokenization is redundant here but I did it anyway to make our life easier :D

    • @s.m.saifulislambadhon2654
      @s.m.saifulislambadhon2654 4 года назад +1

      Thanks for the explanation

  • @coxixx
    @coxixx 4 года назад +1

    would you learn how to train our custom word vectors with Glove using python?

    • @NormalizedNerd
      @NormalizedNerd  4 года назад

      That's actually very easy. Just make your corpus (.txt file). Then use the official repo to train glove model on your corpus. github.com/stanfordnlp/GloVe