Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1)
HTML-код
- Опубликовано: 9 фев 2025
- Welcome to Zero to Hero for Natural Language Processing using TensorFlow! If you’re not an expert on AI or ML, don’t worry -- we’re taking the concepts of NLP and teaching them from first principles with our host Laurence Moroney (@lmoroney).
In this first lesson we’ll talk about how to represent words in a way that a computer can process them, with a view to later training a neural network to understand their meaning.
Hands-on Colab → goo.gle/2uO6Gee
NLP Zero to Hero playlist → goo.gle/nlp-z2h
Subscribe to the TensorFlow channel → goo.gle/Tensor...
Such a great lecture on NLP wow. I wish I had found it when it was uploaded, saving me two years.
What did you did for two years? I mean which course?
Simple and straight to the point I love it!
Best Explanation Ever. Best Sir I had ever listened
Wow! Thank you for breaking this down in such an easy way.
Thank you so much! This is so informative, so quickly, in well structured lessons. I'm using a TensorFlow package for R and this helps me understand my project so much better!
why are using r instead of python
Thanks Laurence Moroney are blessing for us! Awesome information
Lol, you explained this so well that it made me want to implement my own library for tokenization
loooool I too had this feeling :D :D
I love your videos! They are very professional and concepts are very clearly explained.
I've always been discouraged learning NLP ..But you've just made it a whole lot easier
It's a huge field, and I'm just scratching the surface. I hope it's useful! :)
Thanks for making it clear waiting for the next one
Thanks!
This is a godsend.
No other definition is possible.
Thanks you made it so easy for me to understand nlp 🙏
We're happy to hear that the video was helpful. If you'd like to learn more about NLP, check out the NLP Zero to Hero playlist → goo.gle/nlp-z2h
@@TensorFlow I have checked it. But I have one request can we build a model like chatgpt using tensorflow 🤔
Thank u so much 🙏🏻 such a great information.
Welcome!
Thanks for making.clear
Fantastic video! Very informative. Thank you for sharing TensorFlow!
Thanks, Matty!
Laurence Moroney, I have a specific tensor flow question regarding beautiful soup and specifically gathering text from an html output. Is there anyway we could start a dialogue?
Incredibly amazing!
Thank u so much , This is very well explained.
Great explanation, thanks a lot!!!
Its great. Waiting for the next.
Glad you enjoyed!
He's a Tensorflow guru!
The legend is back
But you got me instead ;)
Excellent presentation.
Amazingly well said
Yeah. Zero to hero back
For 3 episodes, and I'm working on another 3 for text generation to come out in the not-too-distant future. I hope!
think you so much sir for grate videos
🤩😍😍🤩 Very informative, waiting for the rest
Thanks, Mohamed!
quite informative. thanks.
Glad it was helpful!
Great introduction
For NLP freshers - this video is more about encoding than being about tokenization itself. Read about both topics separately before going through this video to better understand it.
If you are confused like I was why love receive index 1 then go to the end of that video where it's explained:
Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing
Great, thanks for the info!
amazing presentation.thanks dear for the info
we can also use tokenization for converting sentences to words
@lmoroney I have come across the chatbot deployments recently. It is said that there is a problem with the continued conversation in the case of chatbots. But I have a query that why can't we add a lstm on a lstm model? I mean that if suppose we are able to provide a memory on sentences too along with memory on particular sentence then it may able to store the essentials of the previous conversations. Please help me with this query actually I am new to nlp and lot more excited to know.
Thank you so much!
Thank you very much
Better explanation [imposible].
THANKS 😇
Thanks
thank you, it was helpful :)
great video
Amazing presentation
So, its like markov lempel compression?
Good one. Sir i want to know if nlp(" I have III years of exp"), if i check for ,_.ISNUM is not working, do we have any work around for this, is ROMAN letters will not be detected ?
I need little more help; can you please mention the books you have followed? or Reseach papers?
Basically, I am asking for References, so I read them by myself.
Thanks for the breakdown! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (air carpet target dish off jeans toilet sweet piano spoil fruit essay). How should I go about transferring them to Binance?
Nice video
What an amazing and simple way of explication, Thank you
Tokenizer is deprecated now
what's used now?
I was working in Marathi (Indian regional language) language for last 20 odd years. Since last 8 years I am working as a writer-translator. If I learn NLP, will I be able to combine Marathi linguistic skills and NLP skills in practical use? If yes, how it will be and where can I use it?
perhaps this is coming in a later video, but is there any rhyme/reason to the integers that get assigned to the words? or is it PURELY arbitrary?
ope -- looks like the more frequent you are, the smaller your assigned integer. Correct?
Awesome
Thiis is awesome
Hello everyone
So I am new to the ML NLP world. I need some tips my team is working on a project in which we want to convert text ( especially Hindi or Sanskrit) to a set of specific images. Which algorithm or model we go for or form where we should start we have made the data set but now what?
Very nice
Thanks!
I appreciate your efforts! 🙏 I’ve got a question: 🤨 I have these words 🤨. (behave today finger ski upon boy assault summer exhaust beauty stereo over). What should I do with this? 🤷♂️
Excellence! How do I leverage kMeans clustering to find similarities or segment sentences from one another?
Thank you for the video. Sometimes exclamation mark could be informative for tasks such as sentiment classification. But the tokenizer filters out. Is there way for preventing this?
did you got answer for this from any other source ?
@@vishnurajyadav8917 yes, we can control by changing filters parameter www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
Can this be called hacked? Or are there reasons that Keras doesn't include this? (Notice: "'you're", the left quote is still there, and it's got "'": 11 recognized as a word. and num_words=4 doesn't really limit the word count down to 4.)
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I, love my cat',
'You love my dog!',
"Jack said, 'You're gonna love my cat!'"
]
tokenizer = Tokenizer(num_words = 4)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6, 'jack': 7, 'said': 8, "'you're": 9, 'gonna': 10, "'": 11}
i'm also wondering why this is the case, especially when i set num_words to be 1 or even 0 it still tokenizes all the provided words. have u got the answer for this?
useful
This code has apache license, so can it be reused?
I need your advise on finding the text similarity
sentences = [
'كم سعر الراجحي',
'ما هي قيمة الراجحي؟',
'هل تعرف سعر أرامكو؟'
]
{'سعر': 1, 'كم': 2, 'الراجحي': 3, 'ما': 4, 'هي': 5, 'قيمة': 6, 'الراجحي؟': 7, 'هل': 8, 'تعرف': 9, 'أرامكو؟': 10}
its putting الراجحي and الراجحي? as two tokens, is that becuase of arabic?
Too good
Should we keep only nouns when topic modelling ? I am quite new with NLP and it seems there is no clear universal thumb rule for extracting topics information, what would you advise ?
You can probably do decently well using just nouns, but you will probably also lose a lot of information if you filter out non-nouns at the Tokenization or pre-processing step. For example, if you only use nouns, you could very well pick up on a topic like "machine learning" in your dataset, but you might miss separate discussions of "deep learning," because "deep" is an adjective that would get filtered out and you would be left with just general "learning." An ultra-crude way you might augment this a bit is to instead do topic modeling on n-grams and keep only those n-grams that contain at least one noun, but I haven't tried this, so I can't assert it will actually work.
@@Aoitetsugakusha +1 Great answer
Depends what your objective is. If the end result is only centered around identifying entities then keeping NN/NNP may make sense(given your POS tagger is not making errors). It all depends upon the objective, for my use case I remembered I have extracted chunks of SVO phrases(Subject-Verb-Object) and then performed topic modeling, that had worked well for me, but I had made adjustments to my POS tagger to do this task well.
Suppose if you have 30 textfile in one folder how do you tokenize the word?
Great introduction which is easy to understand. Can't wait for the next videos of this series!
But is there any way to group words ignoring some grammar? Like: "He plays piano - I play piano" where "plays" != "play", but it basically is the same word and tempus.
The part of ignoring the "!" in "dog!" is fascinating.
Yeah...that's a little more difficult in preprocessing text. I won't be covering that...sorry!
Those are sub-words, and a different tools can be used for obtaining them, such as sentencepiece (github.com/google/sentencepiece). In this case the model searches for common sub words such as play and in case of plays it tokenizes as . It is also possible to tokenize as the character and as a sub-word.
yup ofcourse. you can lemmatize these words or stem these words
Could you help me with python 3.8.2 compatible version of Tensorflow and Keras.
Hi. What happens if I set nums_words to 0? I tried and it still prints all the words
What are the advantages of using TF framework instead of other preprocessing method such as thoose spacy or nltk provides for example ? :) thank you
I can't compare with the others...but this way they're in a unified framework that makes it less code when I get around to training a NN with them (in episode 3)
Thanks you so much. But I have a question. How can I use words in other language than English. For example building a NLP in Azerbaijani.
You need to find and download Azerbaijani corpus from the Internet. You can then prepare the word index using Tensorflow. The rest of the steps should be the same as the English example shown in the video. I don't know about the Azerbaijani language but some languages, like Tamil, don't have separate grammatical words like English. You need to make heaps of preprocessing before you prepare the word index. This is something you need to be aware of. Also, if you can't find a corpus for your language, use something called "hashing trick" (or "feature hashing") to hash the individual words in your language. Luckily, Tensorflow supports hashing trick.
Is the link to Part 2: Sequencing - Turning Sentences into Data available?
Yep, came out yesterday, check yt.com/tf for details
Does NLP only process english? Could it do another language? My question is really if it could be used to learn a different language as basis and go from there.
It can be used for any language:)
print(Hashing == Tokenization )
whats the output??
Sir my name is mohith I am final year BE student can you help me out some doubt on nlp I am working on data generalization and data sanitization our task is identifying given text weather it is sanitized or not generalized or not how it work in python can you help out sir please.... it is helpfull to me
The colab is labeld as Course 3 - Week 1 - Lesson 1.ipynb - where can I sign up for the full course? Thank you!
The colab was adapted for one I wrote at Coursera, where it was course 3 in teh TensorFlow:In Practice specialization. There's more there. Otherwise, this is a 3 part series, with part 2 now on the YT channel :)
Thanks. Where is the next episode?
next week!
How many languages are supported? Or only English is supporting.
I've only tried English, but this technique should work with most languages. Try the notebook linked, and change the language and see what happens?
Love Tokenizers ❤️
:)
Ashleigh Park
thank you
how can make the same example in my raspberry?
Medogb Medo exact the same way when you’ve installed python and TensorFlow
Anyone can tell me what is first princeple method teach
"From first principles" means teaching with zero (or at least very few) assumptions
@@LaurenceMoroney sank u , I hope I can figure it out
‘From first principle’ could also mean from the smallest to the biggest:from the known to the unknown basically it’s a way of breaking concepts down to the simplest form
I want TensorFlow track jacket that you have wear
Шансон
can 1679 15223 2 153692 be a word?
04:15 is really misleading for anyone watching this as their entry to nlp. There are too many steps missing that need to be talked about it in a 'Zero to Hero' tutorial series after this point, instead of jumping into sequenzing. Even steps before this point.
I see why these aren't included (because these are not included in tensorflow). But at the same time, this is just setting an unrealistic standard.
In machine learning terms, I'd say... This video is just mislabeled
...and what are these steps? With these videos and the codelabs, we'll have everything we need to build a simple text classifier, the beginnings of NLP.
@@laurencemoroney655 Maybe he's referring to the clean up required for the grammar (like someone pointed out: play vs plays)? However, Tensorflow cannot include that in the library as he's suggesting. Because Tensorflow is not an English-only library rather a more generic one.
Cool
Thanks!
A question of ignoring the "!". It seems, the Tokenizer doesn't include "!" because it was filtered as punctuation. Let's assume, that we want to use punctuation and set `filters=''` for Tokenizer. In this case, Tokenizer is not smart enough to separate the token "dog" from the token "!"
Here's the example in Colab colab.research.google.com/drive/1M6Nf-WQxorf_X9z2jFnCSJ_QjrY3i5BJ
How to detect difference between "I love my dog" and "l love not my dog"?
Beyond the obvious "it has one more word", there are several approaches.
One that is fairly easy is to have a list of all words in a language with their connotations (this can be found online), one possible connotation being negation. Then, you can write code that inverts the connotation of a word if there is a word that implies negation near it.
If you have lots of sentences that are similar except for the word 'not', and label them accordingly. Then train a classifier like we do here, the 'not' would become a really strong signal towards the negative. Give it a try, instead of using the sarcasm dataset. The code would be very similar to this video.
What? You end it when I was expecting you to at least say to put the 1's in the input layer. This is how you tokenize in general, nothing to do with AI.
Yay😅
Exam after 9 hours (TT)
The answer was simple all along. It's just dog
Too late to the party Tensorflow!! It’s not 2010. Love the video though, thanks😎
Ha! I can only produce so many....
This is so complicated
Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.
@@laurencemoroney655 Its great. When we will get the next tutorial?
@@balachkhan1578 We're releasing them weekly
@@laurencemoroney655 TensorFlow can't be installed with Python 3.8. Will the issue be solved or i should switch to Python 3.7?
@@balachkhan1578 It's constantly being updated...so keep an eye on www.tensorflow.org/install. Right now it's up to 3.7 on there.
Very Difficult to learn
Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.
@@laurencemoroney655
It is a bit complicated the first time. But taking up a small dataset/project for nlp and then revisiting the video again makes everything a lot more clearer. Plus you pick up on things that slipped your mind the first time :)
Thank you very much