Introduction to NLP | How to Train Custom Word Vectors
HTML-код
- Опубликовано: 5 апр 2020
- #nlp #word2vec #python
In the last video, we have learned how to use pre-trained word vectors. Here, I've shown how to train your own word vectors using gensim library.
For more videos please subscribe -
bit.ly/normalizedNERD
Support me if you can ❤️
www.paypal.com/paypalme2/suji04
www.buymeacoffee.com/normaliz...
NLP playlist -
• Introduction to NLP
Source code -
github.com/Suji04/NormalizedN...
Data source -
megagonlabs.github.io/HappyDB/
Facebook page -
/ nerdywits
More such videos please, this is much better than the Udemy courses, even the paid ones!
Excellent. Thanks for uploading. Kindly make more videos to build a chatbot .
It is in my wish-list too! keep supporting
Can u show how to calculate similarity of 2 words using custom trained word2vec
Sir Kindly guide - How can I use Pre Trained word embedding models for local languages (or languages written in Roman format) that are not available/trained in the pretrained model. Do I have to use an embedding layer(not pre trained) for creating embedding matrices for any local language? How can I get benefit from pretrained models for local language?
Hi, unfortunately there aren't a lot of pre-trained word embeddings of romanized non-english languages. You can search and if you find something then you can fine tune it on your data. But I don't think there's an easy way to use English models on romanized non-english languages.
Very nice explanation! I have a question....in shell no 50..what is sense behind "trainable = false" ? The video is about training custom word2vec...then why false?
@Gurpreet Singh
I understand your confusion.
We are actually training our word vectors in shell 46. In shell 50, we are making our embedding layer that will be placed just before the LSTM units. Remember that embedding layer is nothing but the learned word vectors (in matrix form)!
So if we make trainable = True at the embedding layer then keras will train the embedding layer (i.e. the word vectors) again while performing the back prop on LSTM. We don't want that.
I hope now it's clear to you.
@@NormalizedNerd thanks for your response! I got it now....
Pls show how to custom train Bert embedding
why didn't we use sklearn train_test_split?
Hello...i got a question. In train test split cell. Where is 'word_index' from??? Thx
It's the Keras Tokenizer that is giving us the 'word_index'
Great video m yfriend!
Thank you pal :)
bro, in 44 no shell what is the purpose of tokenizer when we already tokenize the sentences into words in preprocessing part
would you please explain 44 no shell little bit more briefly? I think this is the most important part which I missing....
Great point! In NLP preprocessing, tokenization makes it easier to clean the text. Here I generally use nltk library.
In block 44, I did the tokenization with keras Tokenizer which allows us to use two nice functions: word_index & texts_to_sequences. These help us to create the tensors easily. So yes, tokenization is redundant here but I did it anyway to make our life easier :D
Thanks for the explanation
would you learn how to train our custom word vectors with Glove using python?
That's actually very easy. Just make your corpus (.txt file). Then use the official repo to train glove model on your corpus. github.com/stanfordnlp/GloVe