Part of Speech Tagging - Natural Language Processing With Python and NLTK p.4

sentdex

Просмотров 211 тыс.

1 900

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 5 фев 2025

Комментарии • 248

@yuli4227 7 лет назад
This semester, I dont know how much I watch your videos. Thanks
@muhammadcalfin4266 5 лет назад ⁺⁶
im so glad founding your youtube channel, its that you snowden? im really glad cause im get natural processing language to presentation in class and demo for the test.. im really want to say thanks to you
@snehaldahiphale5316 5 лет назад ⁺³
Would love to see more videos on NLP, keep up the great work! :)
@vaporr5929 2 года назад
Corpus (Singular)
Corpuses/Corpora (Plural)
Putting this out here for confused learners like me. I hope it helps!
@AnthonyBachour 9 лет назад ⁺⁸
These tutorials are great, thanks for sharing!
@DarlingMichael 4 года назад ⁺⁴
New follower/subscriber - and still really enjoy these old videos! Thank you! Do you have any videos or suggestions how to take tokenized/stemmed words and put them in a dataframe for anayltics use? Thanks again for the great content!
@themaggattack 6 лет назад ⁺¹⁴
"Whoa. Everybody settle down." 😆
@wingfan1388 5 лет назад ⁺²
wow thanks man! quick question: what if 'stemming' firstly and then followed by this 'punktsentenceTokenizer'? The pasted tense/present continuous probably will be stemmed as 'present simple'. So the tense may not be recognized by 'punktsentenceTokenizer'?
@TheHelvetican 4 года назад
I am writing a program that recursively defines a word to a set recursion depth. The problem is that almost all words have multitudes of definitions, so I need to eliminate definitions that are not appropriate to the context. Knowing the POS will be handy to narrow down the possible definitions.
@hubertus_putu 9 лет назад
all of your tutorials are totally great :)
@siddharthuniyal1615 7 лет назад ⁺⁷
Hey I am quite perplexed with the following part of code :-
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
I mean first "custom_sent_tokenizer = PunktSentenceTokenizer(train_text)" is separating text into sentence wise and storing into custom_sent_tokenizer, then what is and how "tokenized = custom_sent_tokenizer.tokenize(sample_text)" is working.
please help!!!
@soma440 3 года назад ⁺¹
'PunktSentenceTokenizer' is an object.
'custom_sent_tokenizer = PunktSentenceTokenizer(train_text)' creates an instance of the object 'PunktSentenceTokenizer' with the parameter 'train_text', you can say it trains the tokenizer.
Then calling the object i.e 'tokenized = custom_sent_tokenizer.tokenize(sample_text)
= custom_sent_tokenizer.tokenize(sample_text)
' , it tokenizes the sample text and stores the returned array in the variable 'tokenized'
@AnthonyRBarberini 9 лет назад ⁺⁶
Hey really enjoy your videos on python, really have helped me out a lot! One question though, do you have a page on your site for where you got the definitions for the different letter codes(like what you pasted in around @6:09)? That would be helpful to comment out in some of my code to use as a reference without needing to hunt for it each time.. If you could post the url itd be appreciated. thanks
@sentdex 9 лет назад ⁺¹²
+Anthony R Barberini pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/ list is posted there.
@AnthonyRBarberini 9 лет назад
Thanks
@mega6699 8 лет назад ⁺²
I downloaded all code examples used in videos from github.com/pythonprogramming, which was very convenient :-)
@AkhileshGotmare 8 лет назад ⁺²
Hi! Really liked your videos! Some questions though -
We could have used the simple nltk tokenizer for sentence tokenizing the way you did in video #2 right?
Also I didn't get how is PunktSentenceTokenizer an unsupervised method if you provide it training data here? I don't see how train_text is used for training when it is essentially similar to the 2006 txt file. How does PunktSentenceTokenizer train on it or is it a pre-trained model used now?
@omidasadi2264 5 лет назад
more than everything, I'm amazed just by your type speed. :))
@arkarnyanhein 5 лет назад ⁺¹
Can I know why you throw the exception in the function you have written ?
@lxxmi_ 7 лет назад ⁺¹
For the first time i didnt get a lecture of sentdex. You shoud have used simple texts and explained pos tagging based on it.
I tried pos_tag() on sample lines, but the output shows letters tagged instead of words. Super confused. :(
@rmt3589 3 года назад
Will we learn about using improper datasets in NLTK? Because most uses I have planned won't have people using "propper" English. For example, people could use kute for cute, or kewl for cool. Would this be in the stemming step?
@r5bc 7 лет назад
very very interesting video series... I really love your tutorials. please keep up the good work. regards
@sentdex 7 лет назад
Thank you for the kind words!
@battleofhastings7586 5 лет назад ⁺¹
Your videos are always great! I was wondering if you could use pos_tag in real time text input like with pynput?
@pritimoysarkar2048 4 года назад
Is it better to use PunktSrentenceTokenizer rather than SentenceTokenizer?
if yes then why?
& with what data I should train it with to tokenize some specific type of data (example: Medical records)
@elenasarita 6 лет назад
thanks for the video, I am new to all this NLP NLTK with python, so I got this @t it is asking me two work with 2 tagsets (Brown and Universal) and to divide the data en trainset and test set, find the most common tag (NN) and use it as a baseline, all good here - but I am struggling to understand the second part: which is creating taggers that I should train on training set and evaluate on Test set.- How do I create a tagger? a random one? then it says the taggers I am going to work with are (default/affix/unigram, taggers,... etc) ---> This confuses me a lot?? - then it says: These taggers can be created using the nltk.DefaultTagger(), nltk.AffixTagger(), nltk.UnigramTagger(),
nltk.BigramTagger(), and nltk.TrigramTagger() functions respectively.--> again???? - finally: create four of taggers in isolated taggers and cascade taggers? - its confusing = please help!
@yamiprincess 4 года назад
I want to know what is rule based and what is machine learning. Tokenizer and Stemmer in the previous videos were rule based? It's hard to find out since none of it is really labeled. What else is rule based in nltk? What else is model based? I'd really appreciate a thorough and complete answer!!!
@tochka832 4 года назад
there is nltk.org where you can find documentation for all modules and even source code itself
@TheDeadking100 8 лет назад
Can you make a video showing how to train the pos tagger? Currently, the default one has a limited number of words in its dictionary, and most of them are not accurate. Is there a way to expand this dictionary and make it more accurate? If so can you link me to site which shows this or can you make a video on it? thanks
@rajatsinghal4794 7 лет назад ⁺¹
why are we using punktsentencetokenizer here ? we can simply tokenize the text and use nltk.pos_tag on it .
@pwnweb5734 7 лет назад
would work great, actually all sentence tokenizers internally inherit punktsentencetokenizer. just to show he used a new sentence tokenizer, there are more
@rajatsinghal4794 7 лет назад
okk, got it.
@tushardey5830 7 лет назад
Is it possible to do this without sent_Tokenize or PunktSentenceTokenize and use word_tokenize only once and store it in a list and then in the function do the tagging?
@rmt3589 3 года назад
Is corpus just the sample database? Because I've been curating my own.
@ehtishamraza2623 6 лет назад
Love you sir you are doing very good Job
@carminetambascia6355 8 лет назад ⁺¹
Hey HS,
thanks for the great contents. I was searching but not able to find the Pos list tag for that corpus. Actually seems the NLTK 3.0 changed a bit.
Thanks for any advice
@carminetambascia6355 8 лет назад
Nevermind I have found it(www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Thanks anyway
@bambooindark1 8 лет назад
Or here :
pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
@zetaro7943 8 лет назад
love u sentdex bro
@dijo1469 6 лет назад ⁺³
Could we tokenize the doc using wordTokenizer directly rather than firstly using sentenceTokenizer then followed by wordTokenizer?
@GelsYT 5 лет назад
hahaha I was wondering the same thinng
@nczioox1116 4 года назад
It might make it harder to create chunks later if all the sentences are mixed together
@vaibhavambasta8053 7 лет назад
Why should i use PunktSentenceTokenizer? Sentence tokenizer is already there. And what actually happens when we train using PunktSentence Tokenizer?
@amanpreetsinghbawa1600 5 лет назад
really awesome work. This vid could also have been started by sample sentence & then moving on to state_union. Anyways you rock & keep up the awesome stuff.
@sentdex 5 лет назад
Great suggestion!
@manumalik5309 4 года назад
@@sentdex Hi could you please tell me from where can I find the files 2005-GWBush.txt and 2006-GWBush.txt
@srineeshsalur524 7 лет назад
why should we use sample and train text what would happen if we use only one among them?
I did use the only train and got a big dictionary which I couldn't compare both the outputs please help.✌🏼peace
@santosateos 7 лет назад
hi, thanks for sharing your videos on NLTK, I've found them super useful so far. Quick question, I'm completely new to programming and I'm trying to teach myself as I go along so maybe for someone experienced this is a quick fix. When I ran the function at 5:10 I got a list of numbers instead of words, why might that be?
thanks! :D
@sentdex 7 лет назад
Hmm, I am not sure. Try comparing your code to mine: pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
@santosateos 7 лет назад
Thanks! I did catch a few mistakes I made, now I get the POS tags! but I still get the numbers before the tags haha Will try again. Thank you for such a prompt reply!
@Pradeep-ih8zh 4 года назад
Hi,
Can you make an extension of the video and create a tabular form of each POSs?
@sayagulumurzakova1348 5 лет назад ⁺¹
Hi! Thanks for your videos! Do you teach how to do distribution of collocations and keywords from a big dataset (excel)?
@ladyinbluedress 6 лет назад
i did not get anything to print nor did i get an error message. any suggestion?
@Karan-ow4wl 8 лет назад
Please help me. I got the error " 'module' object is not callable"
@vishwaksubramaniam8607 6 лет назад
When I run the program it simply says 'global name 'tagged' is not defined'. Any ideas on what this may be?
@hardeekmistry9525 8 лет назад ⁺⁵
I dont have error but the output sequences is in numeric display not in words
ANY SOLUTION
@mizu643 7 лет назад
me too. Is there any solution of that type of error?
@khizarulhaq5381 7 лет назад ⁺¹
u must have forgotten to call the function!
@bambooindark1 8 лет назад
I'm kind of confused about the use of 2 set of texts used (2005-GWBush.txt and 2006-GWBush.txt)
Q : what is the use of PunktSentenceTokenizer() ?
A : It identifies sentence boundaries, just like a sentence tokenizer used in the previous videos.
Q : Why two set of texts used...I couldn't understand the use of 2005 & 2006 Set of texts?
A : Well you want to train the tokenizer first, so you have to train it on different data than you want to use it. In order to get valid results, training set and testing set have to be different.
... still confused.
@bellcanada719 6 лет назад
where do you find the corpus to see all the .txt articles/documents inside it and so we can reference or use it as needed?/..thxz for the vid series
@dushyantpatel8738 5 лет назад
Got an error while tokenizing different text.
Value Error: The truth of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
I don't know how to solve. Need your help. Please
@shreykotharib2 7 лет назад
What is use of train_text.....im passing train_text as 'None' then also i am getting same and correct ouput on sample text
@GelsYT 5 лет назад
sir? 2005-GWBush.txt? why is this not in my computer? or is it just built in on python ?
@ced4030 4 года назад
Love the content but I couldnt figure out how to use this in relation to a dataframe as opposed to a raw text file. I assume you would need a train_test_split but even then didnt get it all together
@정예은-r3k 4 года назад
how can I extract all pos tagged words by noun, adjectives, etc based on the most frequently shown?
@anandsharma157 3 года назад
process_content() showing
'module' object is not callable
how to resolve this
@sahilmotwani9310 5 лет назад
it is bit confusing why are we using PunktSentenceTokenizer what it does
@crashcourseofntanet8032 6 лет назад
sir can u tell me what is the syntax and rule in phython for transformation sentences
@joseenriquemartin6379 8 лет назад
And if you insert a text.txt with wrong words. What happen? Thanks.
@lilshadow3441 7 лет назад ⁺¹
Hi! Is it possible to create my own custom tags? Also, I am trying to extract certain relations based on my own entities, how exactly can I do that? Thanks!
@nihalprasad9130 6 лет назад
just wondering is there any natural language processing library for c language?
@soma440 3 года назад
Why didn't you tokenize by words in the begining?
@sushilasonare4324 4 года назад
can anyone tell me where he got that POS tag content that he pasted that shown tags and there correspondence meaning
@AliHassan-ff5pw 4 года назад
stackoverflow.com/a/15389153/12645703
@mysticbane7457 8 лет назад
how do I filter or find only the words with a specific tag? Say I only want to find or show the words with 'NN' tag. Is there a function that handles that already, or si there a way to write an if statement to get it accomplished?
@mysticbane7457 8 лет назад
+Mystic Bane nvm found it
print([i for i in tagged if i[1] == 'NNP'])
@ashishgusain3158 7 лет назад
why have you used " for i in tokenized[:5] instead of just using toknized var? why to just take the 1st five rows?
@p.snikhitha298 7 лет назад
what is the exception code here...can you explain the except block?
@dr.mikeybee 5 лет назад
Isn't there a switch to print out the full parts of speech in the tuple?
@poojarokade5781 4 года назад
Thank you so much for sharing!
@bondmanu 7 лет назад
Hi, i have this sentence for POS tagging - "What would a Trump presidency mean for current international master’s students on an F1 visa?". When I pass it through PunktSentenceTokenizer, it shows me "math domain error".... Not able to find out why :(
@torrtuganooh2484 9 лет назад ⁺¹⁹
And why two set of texts used...I couldn't understand the use of 2005 & 2006 Set of texts
@WillakeMattis 9 лет назад ⁺¹⁰
+Torrtuga Nooh well you want to train the tokenizer first, so you have to train it on different data than you want to use it. In order to get valid results, training set and testing set have to be different.
@harikrishnamalyala6214 6 лет назад ⁺¹
Matus Miklos what if I get the words which I dont train, will that be considered as a new word?
@mukeshbhatt817 6 лет назад
@@harikrishnamalyala6214 may be thats why he trained on similar kind of data so that most words are already recognized
@LazerPotatoe 9 лет назад
Is there a way to go a level higher and start to guess what the true subject and object of a sentence is? So that you could start to tell what the Context is, and when the Context changes?
@sentdex 9 лет назад ⁺¹
LazerPotatoe Yes. Subject / object can be done pretty simply with named entity recognition on a basic level. It gets complex fast, but people do it.
@sentdex 9 лет назад
LazerPotatoe Yes. Subject / object can be done pretty simply with named entity recognition on a basic level. It gets complex fast, but people do it.
@RiccardoTempesta 8 лет назад ⁺¹
Thank you very much for you awesome videos! Where can I find resources to perform a POS tagging on other languages? Such as italian?
@mega6699 8 лет назад
Sketch Engine will do POS tagging for you. It is a web-based application.
@sulavkhadka 9 лет назад ⁺¹
So I was trying different sentences on the pos_tag() and I noticed in sentences like "Check my email" or "Climb the tree" the verbs "Check" and "Climb" are never identified instead it takes it in as a NNP of a NN.
Is there a way to solve this or improve it?
@rachelhwung5822 8 лет назад
Hey, I am trying to run your part of speech tagging code. Unfortunately, it gave me this error: URLError: . I looked up online. People were talking about it is due to NLTK 3.2. They suggested that i force the version back to 3.1. I am not sure if it is right way to make the code work. Wondering if you can help me with that. I am using window10, i installed python 3.5, downloaded nltk 3.2.
@yuong1979 8 лет назад
further below - I met the same issue. Delete nltk 3.2 and install nltk 3.1 solved the problem.
stackoverflow.com/questions/35827859/python-nltk-pos-tag-throws-urlerror
@ansrhl9448 8 лет назад
What method is this tagger using ? is it a log linear model or a HMM model ?
@aayushikhandelwal6669 8 лет назад
After 5 minutes of this video, Instead of having stream of words I am getting this error when I run the code
expected string or bytes-like object
Please help me out. Thanks
@Fnatixful 7 лет назад
can you do it with nltk.NgramTagger method? Im having some words tagged has None is that the trianing set problem?
@shreykotharib2 7 лет назад
I am not getting output in form of words but in form of single letter like [('G',VB),('O',NN)] instead of GO
@tilaksharma7768 6 лет назад
Is it possible to achieve multilingual NLP. I mean in places where i come from, we have too many regional languages and people usually tend to mix different languages even for a simple sentence. So i want a system that can understand what the speaker is trying to say irrespective of the language?
@manasa41087 8 лет назад
hey how do you get the list of all the part of speech tags and their description. What code did you use to get that output 6.11?. Thanks!
@subhajyotidas1609 7 лет назад
I am trying to extract only noun words from tagged words.
The error being shown is ' too many values to unpack '..
do you have a working code???
@saurabhshah736 8 лет назад
Great videos...Thanks for all your efforts.
I have one issue - For me graph is not displaying even after installing mathplotlib. Basically it's not doing anything when i say "chunked.draw()". Do I need to do anything after installing mathplotlib. Do i need to refresh anything or someting? Please let me know.
@ayush2445 5 лет назад
update the libraries and modules
@hebajassem7575 9 лет назад
hi iam using python 3.3.2 put it doesn't support punktsentencetokenizer what should i do ?
@ErikKubica 7 лет назад
What if you need to do this in different (human not programming) language? Nice to have resources for english. But what if you don't have any resource for other language? You need to create these resources which is insane for full vocabulary of a language.
@jkscout 2 года назад
So what's the PunktSentenceTokenizer?
@muhammedahmet9615 9 лет назад
Hi , thanks the all Tutorials , I get this error message "unindent does not match any outer indentation level " why? thank you
@sentdex 9 лет назад
+M.Ahmet demirtaş indentation is very important with Python. If you aren't following indentation rules, you get errors like this. Compare your indentations to mine, sample code can be found here: pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
@muhammedahmet9615 9 лет назад
Thanks
@Arun-fr9vt 4 года назад
What is the use of POS tagging?
@kashishshah8417 7 лет назад ⁺¹
tokenizer= custom_sent_tokenizer.tokenize(sample_text)
What does this line do?
@mohanishnehete5317 9 лет назад ⁺⁷
Is it necessary to train the punktsentencetokenizer?
because i used the sent_tokenizer and got the same result.!!
so what exactly is the difference ?????!!!!!
@WillakeMattis 9 лет назад ⁺¹⁰
+Mohanish Nehete the sent_tokenizer is pre-trained already, sentdex just wanted to show you how to train it by yourself, if you need it.
@GelsYT 5 лет назад
@@WillakeMattis what do you mean by pre-trained, thanks anyways
@saileephadtare8043 8 лет назад ⁺¹
from where to get that two txt file?
@Slacquerr 6 лет назад
Does nltk.pos_tag() look forward and backward in the list to figure out whether a word like "running" is a gerond (noun) or verb?
@Slacquerr 6 лет назад
Just tried it, and, yes, Punkt does seem to recognize the difference. w00t.
@torrtuganooh2484 9 лет назад ⁺⁶
what is the use of PunktSentenceTokenizer() ?
@mega6699 8 лет назад ⁺¹
It identifies sentence boundaries, just like a sentence tokenizer used in the previous videos.
@hfrnd-hu2kz 8 лет назад
ergo... a tokenizer lol just module withing the tokenize side of the lib
@praveenkumarsk5957 7 лет назад ⁺¹¹
It's late but let it clarify ,so that will be used for others.
stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk
@lazypunk794 7 лет назад
^that was useful thanks
@yasminebelhadj9359 4 года назад
Hello, nice explanation thank you .
What is the use of speech tagging ?
@manumalik5309 4 года назад
Do you from where can we find the files 2005-GWBush.txt and 2006-GWBush.txt used in this video
@marces1009 5 лет назад
Thanks for sharing. In the case the tag is "None", how you can add the unknown word?
@bingochipspass08 8 лет назад ⁺¹
Why is Punkt unsupervised?...it uses a training data set right?..so it should be a supervised method!
@mega6699 8 лет назад ⁺²
This is not a "real" training set because it's just a text, not delimited by sentences. So noone told Punkt where the correct sentence boundaries are... It only receives raw input and trains himself on it. I don't understand how it does that, it's a mystery for me. Harrison should have explained it more in depth in the video, IMHO.
@srinidhiskanda754 8 лет назад ⁺¹
here training_text is just raw text without any information or LABELS. If it is supervised than we need to give information or LABELS (machine learning jargon).
Labels are feature information like if there is period its end of sentence etc
@bingochipspass08 8 лет назад
Thanks for the responses Everyone.. understood now!
@rexfarell 6 лет назад ⁺¹
How can I print a complete list of tags for reference?
@fergiejohnson633 5 лет назад ⁺¹
No module named 'nltk.corpus'; 'nltk' is not a package
@sentdex 5 лет назад ⁺¹
Did you name your file, any local file, or folder nltk? If so, change the name because you're importing that instead of the actual NLTK package.
@rohitsh26 8 лет назад
How to add a new word to tagger. Lets say an organization recognized as NNP but not as ORG. How can I add that organization to default nltk list?
@srinidhiskanda754 8 лет назад
for that try to use named entity module
@Anurag3466 5 лет назад
hi how can we apply this to all the rows in a dataframe ?
@royalstastista4757 5 лет назад
How can we use tagging to count the number of verbs ,nouns , etc in a sentence ?
@aliblue6990 5 лет назад
Use a for loop to iterate through tagged List
if taggedWord == verb THEN
verb_count += 1
Just use a for loop and count amount of tags in a sentence for specified - verb/ noun
@lalithamadhuri2130 8 лет назад
I've a problem with state_union ,it says there's no such file or directory
@pwnweb5734 7 лет назад
nltk.download() u ran this?
@lalithamadhuri2130 7 лет назад
Yes i did
@bamber101 6 лет назад
nltk.download('state_union')
@Deepak98Das 7 лет назад
Traceback (most recent call last):
File "H:/Python programs/speech_tagging.py", line 9, in
tokenized=custom_sent_tokenizer(sample_text)
TypeError: 'PunktSentenceTokenizer' object is not callable
This is the error i'm getting can someone help me with this
@cameron790 9 лет назад
wow thanks for much man! i dont even know what this is but ill give it a try XD
@MubashirullahD 4 года назад
Why did you tokenize twice?
@demystifiedcheif7506 6 лет назад
What was the use u PunktSentenceTokenizer ?/
@TJ-wo1xt 3 года назад
didnt understand the code under the define function
@karenschwarze6391 4 года назад
Great video, thanks! Does anyone here know how to search within a tagged corpora? For example, if, after processing the text so it is all tagged, I wanted to find every instance of noun-"is"-noun, how would I do that? Thanks :)
@sukumarh3646 8 лет назад
For some reason series of these type of numbers were printed before the tagged speech. (0.0006539153179663233 0.0029498525073746312 0.0005192107995846313 6117 339 4 1 )

Следующие

Автовоспроизведение

Chunking - Natural Language Processing With Python and NLTK p.5