Stop Words - Natural Language Processing With Python and NLTK p.2

Поделиться
HTML-код
  • Опубликовано: 29 янв 2025

Комментарии • 179

  • @galgilor3241
    @galgilor3241 5 лет назад +11

    Thank you very much for taking the time to produce this awesome tutorial! I just wanted to mention that the stop words are all lower case. Thus, "A" "The" etc... require more preprocessing by creating your own stop words or lower case all the text.
    Thank you!

  • @cemsarer9690
    @cemsarer9690 4 года назад +1

    I love this guys tutorials. In any ds related tutorial I search for, I always come here first. And boom, he has a tutorial about it. Awesome.

  • @吴琼-c9i
    @吴琼-c9i 9 лет назад

    Your video is awesome! Although I'm a Chinese and actually sometimes can't catch up with you, I like it very much and your expressing style attracts me. I've learned a lot from it and I'll keep going. Thanks!

    • @sentdex
      @sentdex  9 лет назад

      +吴琼 Great to hear you're enjoying!

  • @pedroportela7304
    @pedroportela7304 5 лет назад

    I dont know if you will see this, but your videos are so good.
    You make this subjects not fucking boring.
    Ty so mutch

  • @ngollo3570
    @ngollo3570 9 лет назад +6

    Oh man, thank you so much! After a lot of theory and math I can finally do some pratical stuff.

  • @buckbuck2773
    @buckbuck2773 Год назад

    Useful for getting content carrying words like if you were going to do meta tags for a website.

  • @seanraf2554
    @seanraf2554 9 лет назад +1

    Your videos are extremely informative and easy to follow. Keep up the good work man!

  • @chrisv5330
    @chrisv5330 8 лет назад +24

    You're putting a 'd' after the 't' in "sentence" because you're used to writing "sentdex"!

  • @georgewang7770
    @georgewang7770 7 лет назад

    What a great series of tutorials. Thanks a lot.

    • @sentdex
      @sentdex  7 лет назад +1

      Happy to share!

  • @kantshribaronia3513
    @kantshribaronia3513 8 лет назад +10

    Subtitles are hilarious!
    nltk jotted down as "auntie care" lol

    • @sentdex
      @sentdex  8 лет назад +7

      They usually are good for a laugh!

  • @PaoloPignatelli
    @PaoloPignatelli 8 лет назад

    Stop Words and their information content are fascinating from the perspective of a translator - and even more so, a simultaneous interpreter. I know that this is "applied" , but even from a Shannon pov, I would be curious about experiments cutting out "stop words" and seeing "even" a machine translation difference (entropy).

  • @badarraja8107
    @badarraja8107 5 лет назад

    im from pakistan. your videos are very informative and helping me. thank u sir. Allah bless u. always be happy :)

  • @Neceros
    @Neceros 9 лет назад

    Wow, you could use this in almost any application.

  • @Dexteritye
    @Dexteritye 7 лет назад +1

    Another one liner I can think of is:
    set(words) - stop_words
    But that would remove duplicated non-stopwords though.

  • @ashrafal-warraquiy6614
    @ashrafal-warraquiy6614 Год назад

    Great thanks! Please can you update the playlist for NLTK with new articles have been added to NLTK. Great thanks in advance.

  • @umdbest001
    @umdbest001 4 года назад

    actually sentence matches with sentdex thats why there is that d again
    love you sir
    big time fan of your channel

  • @jankipatel118
    @jankipatel118 4 года назад

    great teaching and funny way.. it's awesome

  • @viditkhanna3721
    @viditkhanna3721 6 лет назад

    Appreciate short duration videos.

  • @GRENKEChess
    @GRENKEChess 5 лет назад

    Is there a nice German stop words list out there?

  • @TJ-wo1xt
    @TJ-wo1xt 3 года назад

    great one sentdex

  • @grow3384
    @grow3384 6 лет назад

    great video. nlp learners needs you;)

  • @janakeeherath6808
    @janakeeherath6808 4 года назад

    Thank you! It is really helpful

  • @pawandeepsingh2155
    @pawandeepsingh2155 9 лет назад

    Your videos are awesome man. keep it up ...

  • @yummiem1811
    @yummiem1811 5 лет назад

    I love you!!!!!! Ur videos are really helpful and easy to learn!

  • @yuvrajkadam
    @yuvrajkadam 9 лет назад

    Thank you mate.The videos are really helpful.

  • @sbenavides00
    @sbenavides00 7 лет назад

    Awesome. Also works for stop words in another language (spanish)!

  • @ShivamSingh-bx5lg
    @ShivamSingh-bx5lg 4 года назад +1

    filtered_sentence = set(word_tokenize(example)) - set(stopwords.words("english"))

  • @rookiebounty4394
    @rookiebounty4394 9 лет назад

    Hi, I have been following your tutorials for a while now and they make complex concepts easy to understand and doesn't drown you under a wave so theory before you can grasp anything, so thanks for making these tutorials.
    I noticed that you mostly only cover topics you are using but i thought i'd still ask you if you could do a tutorial on distributed computing with multiple nodes and as a extra i would like to ask you if you could do a proper tutorial on multithreading. Most tutorials i have seen limit them self do 5 or something thread, instead of using the maximum amount. The functions the threads call have nothing to do with each other, meaning they don't attempt to reach a common goal and do not return any values. There doesn't seem to be any threading tutorials that show how to practically use threads for performance. It would be great if you could expand on multithreading where other tutorials don't..
    Thank you

    • @sentdex
      @sentdex  9 лет назад

      Rookie Bounty Distributed computing has already been covered with MPI: pythonprogramming.net/learning-use-mpi-python-mpi4py-module/

  • @juanzamorarey7633
    @juanzamorarey7633 4 года назад

    Thank you very much, easy to understand, easy to download and very well explained. And english is not my mother tongue. Thanks

    • @sentdex
      @sentdex  4 года назад

      Glad to hear that

  • @shekhariyer3819
    @shekhariyer3819 Год назад

    Can you explain the line "for w in words : if w not in stopwords:" at 5:52 ?

  • @prawigya
    @prawigya 6 лет назад +1

    why did you use a set to get the stop words if stopwords.word("english") is already a list. Is there any reason behind that?

  • @TriptiAgrawalMCA
    @TriptiAgrawalMCA 7 лет назад

    Very nice tutorial. I have a question regarding how to create dynamic stop word list or how we can add new words in stop word list?

  • @fazleysanusi1463
    @fazleysanusi1463 7 лет назад

    this help me a lot. thanx bro

  • @rossmoffitt7364
    @rossmoffitt7364 6 лет назад

    Thank you sir. This is very helpful.

  • @Stevesteacher
    @Stevesteacher 4 года назад

    Thanks dude :)))

  • @vsprout4812
    @vsprout4812 8 лет назад +5

    I Love your videos, Super...... enjoying... Thank you

    • @sentdex
      @sentdex  8 лет назад +2

      +Umesh D very welcome!

  • @sherazr
    @sherazr 4 года назад

    Great work on the videos. Can you show how to search through a text file using a theme? All related words to that theme are counted and displayed... If you have already done a video on this, please point me to that video... Thanks a million...

  • @shokrymohamed3127
    @shokrymohamed3127 9 лет назад +2

    your video is awesome!
    thank you

    • @sentdex
      @sentdex  9 лет назад +3

      +Shokry Mohamed Happy to share!

  • @bhargabsarma9030
    @bhargabsarma9030 6 лет назад

    thaks.......very nicely explained

  • @traildataanalytics407
    @traildataanalytics407 6 лет назад

    Great one !

  • @chrisguttler7897
    @chrisguttler7897 5 лет назад

    thank you very much for your great videos on NLTK, I try to apply it on my German corpus.
    Is there's a special reason first to tokenize and then remove stopwords instead the other way, first remove punctuation and stopwords and than tokenize? thanks a lot again!

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 4 года назад

    Nobudy who knows what they are doing uses NLTK any more. Could you do this over with spacy or tenserflow?

  • @chriscasella2903
    @chriscasella2903 4 года назад

    Love these videos, you did a great job. Thank you!

  • @saurabhav4202
    @saurabhav4202 7 лет назад

    Hey! Nice video series, really helpful. A doubt: you said we could train the data according to our own corpora. Are the corpora supposed to be standard? Can we use any text of the similar type? Say I need models on Resume data, is it okay to train the models using a bunch of Resumes only?

  • @apekshatadge5875
    @apekshatadge5875 6 лет назад +1

    Sometimes removing the stop words can change the meaning of sentence ..how can we handle this ? help needed words like cannot,not etc

  • @Neceros
    @Neceros 9 лет назад

    Question: Does one import nltk.corpus instead of nltk to limit memory usage?

    • @sentdex
      @sentdex  9 лет назад +2

      Neceros well, we actually do from nltk.corpus import stopwords
      We do that because we just want to reference stopwords, and nothing else. The corpus is quite large, and yeah... you wouldn't want to import everything from it by doing something like from nltk import corpus. So, instead we do from nltk.corpus import stopwords ...or whatever specific corpora we want.

  • @nuthalapativarun1871
    @nuthalapativarun1871 8 лет назад

    You're AWESOME.

    • @sentdex
      @sentdex  8 лет назад

      +Nuthalapati Varun You too!

  • @softwaredeveloper4304
    @softwaredeveloper4304 6 лет назад

    When I compare the word to the stop words the loop removes the letters, not the word.

  • @danielinformatic
    @danielinformatic 8 лет назад

    Very good!!!!

  • @frankolsen144
    @frankolsen144 4 года назад

    Hi sentdex
    First of all, great videos! I'm doing a sentiments analysis for the first time and I want to analyse twitter data. I have all tweets in a csv file with comma seperated columns (source, text, datetime, etc) I can follow the steps that you explain in the video when you create your own sentence and then tokenize it or removing stop words, but how can I apply it to a file, where I need to iterate through the rows in the csv file in excel just using the column with text (the tweet)??

  • @shrinivasiyengar5799
    @shrinivasiyengar5799 6 лет назад

    Hi. In your previous video, you had an example text "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You should not eat cardboard." When you run the same "stop-word-removal" code on this text, the output is: ['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'Python', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', 'eat', 'cardboard', '.']
    Don't you think this just omits much of the important parts of the text?

    • @josechavezuriarte9224
      @josechavezuriarte9224 6 лет назад

      If you put all the tokenized list elements to a wordcloud the elements that can give the greatest insights are prioritized. This is easy to see when you have more than just a sentence because you are able to find some interesting patterns...

  • @TheDeadking100
    @TheDeadking100 8 лет назад

    Hi sentdex, I am working on a project which involves creating a short story generator. Can NLTK be used for this? The general concept is the program will take in a few words from the user, and with those words, the program will generate sentences/ a short story using the application/ meaning of those words with some form of grammatical sense. Is this possible with NLTK? Thanks

  • @mohamednibras53
    @mohamednibras53 7 лет назад

    Good explanation. Thank you. In your example, "between", "during" are mentioned as stop words. But i want it to be not stop words in my case. So can i remove them in predefined nltk corpus??

    • @rizwandudekula4652
      @rizwandudekula4652 7 лет назад +1

      stop_words = set(stopwords.words("english")) -- this line returns a SET of stop words. So to add a new stop word to the set you can use : stop_words.add("my_wanted_stopword") , similarly you can remove unwanted stopword from the set using : stop_words.remove("my_unwanted_stopword")

  • @logosfabula
    @logosfabula 7 лет назад

    Thanks for the great video! What about a third option for filtration? Set Difference in Python should go as easy as:
    filtered_words = words - stop_words
    But I guess we're missing the word order in the sentence.

    • @sourabhsharma7805
      @sourabhsharma7805 2 года назад

      "words" isn't even a set. First convert "words" to set with "set()" then proceed with "set(words) - stop_words". Also, no it's not going to mess the order.

  • @AbhishekTiwari-xk5go
    @AbhishekTiwari-xk5go 7 лет назад

    sir how to remove special characters from a string , please make a video for it, really i would be very helpful

  • @vadivel4846
    @vadivel4846 5 лет назад

    Pretty simple and clean. Keep it up. If possible pl reduce key stroke sound.

  • @mohammadzamandehbashi5416
    @mohammadzamandehbashi5416 6 лет назад

    thank you man

  • @vineethkanaparthi785
    @vineethkanaparthi785 8 лет назад

    how can I further remove the punctuation like commas,dots etc?
    Like, Is there any method to do that?

  • @Bpstive
    @Bpstive 8 лет назад

    There are more stop words from stop_words python package compare to nltk.stopwords. which is recommended?

  • @homerobaroni1655
    @homerobaroni1655 6 лет назад

    Hello, I'm still trying to understand this w letters between the "for" and "in". Are they specific to nltk?

  • @siddharthkakade9989
    @siddharthkakade9989 Год назад

    hi can you help me with a small project i am working on ?i would be very greatful. i have to two data set , bugid and feature list , so i have to predict bugid using nltk i have to understand what the feature is and correlate it to the bugid

  • @vaibhavshah4880
    @vaibhavshah4880 8 лет назад +5

    Shouldn't it be like:
    filtered_sentence = [w for w in words if not w.lower() in stop_words]
    ?

    • @SandeepKumar-oc2ox
      @SandeepKumar-oc2ox 7 лет назад

      yes you are right. "this" is a stop word which will be removed. Any idea of removing the last word ie: " . "

    • @quadeershaikh2884
      @quadeershaikh2884 6 лет назад

      use the regular expression library of python named re

  • @asmeratadesa5989
    @asmeratadesa5989 4 года назад

    what about for another language . after i identify my stop words depending on my language it can't remove please help me

  • @LConstP
    @LConstP 7 лет назад

    The NLTK data repository seems to be down. Requests to download the stopword list result in 405s.

  • @shravankumarshetty9012
    @shravankumarshetty9012 6 лет назад +1

    Hey I'm getting an error as first download nltk.download("stopwords"). But once I do it, still getting an error as, 'WordListCorpusReader' object has no attribute 'word'
    UPDATE: it's working fine, once I did nltk.download("punkt")

  • @borispsalman
    @borispsalman 4 года назад

    Im new to NLP but its making me wonder why is 'and' considered an irrelevant stop word because on the contrary in maths and logic the 'and' and 'or' carry a significant importance in the meaning.

  • @trieule2012
    @trieule2012 6 лет назад

    can I build other kinds of language with NLTK, or it only apply English?

  • @sukumarh3646
    @sukumarh3646 8 лет назад

    I believe the stop words should be only 'a', 'an' and 'the'. The context is lost if prepositions and auxiliary verbs are dropped. It would result in less accurate training and hence a bad output.

    • @sentdex
      @sentdex  8 лет назад

      There are often multiple operations that go on. We also get to doing things like lemmatizing words, which removes tense, for example. Yes tense has meaning, but, when we're determining pure definition meaning, we want only words of value, then we can later go back and concern ourselves with things like tense.

  • @LYewJie
    @LYewJie 9 лет назад

    Awesome tutorial! Just want to know more, how can I remove symbols such as ", . ? ! ( )". Tried to append the list, but it didn't work.
    Any advice? Thanks in advance!

  • @ghezalahmad
    @ghezalahmad 7 лет назад

    Hi, I am new in Python. I have a question, you have import the english stopword. But I am working on Dari Language, and I have a list of stop words, how can I import that. Thanks

  • @Brickkzz
    @Brickkzz 8 лет назад

    Hi, is there a way of using nltk to split words? (i.e. splitword -> split word) Perhaps using a dictionary?

  • @hadibuxmahessar8903
    @hadibuxmahessar8903 3 года назад

    thanks

  • @nomadsoulkarma
    @nomadsoulkarma 9 лет назад

    I wonder how much trouble it would be to write a py/nlp algorithm that would segregate unstructured from structured data so that the sensitive data remains in the RDB and the pass-through stuff goes into JSON perhaps?

  • @Cherry-jr2kq
    @Cherry-jr2kq 8 лет назад

    awesome tutorial! just want to know: how can I understand and remember the codes?

  • @hfrnd-hu2kz
    @hfrnd-hu2kz 8 лет назад

    Your leet with nltk bro! question, is nltk useful when creating bots for... api.ai, alexa, luis... etc etc... I know the answer is probly yes I just have no idea wheree or how to parse one with the other lol... any tips or good refrence material?

  • @mariacamiladurangobarrera2821
    @mariacamiladurangobarrera2821 4 года назад

    Thanks, crack.

  • @sofianfadli7910
    @sofianfadli7910 7 лет назад

    Hello,Bung. I like your video very much. Your explanation is very clearly. But, i want to ask you. Where is the location of file which contain words in english stopword??? I curious. Thank you very much...

    • @pradeepgowda89
      @pradeepgowda89 7 лет назад

      Hi Sofian, I assume you have installed NLTK Corpus on your system. If so, please execute below python code to find the path where your Corpus(This contain Stoppwords, grammers, corpura, stemmers, etc) are saved. Thank you!
      import nltk
      print(nltk.data.path)

    • @sofianfadli7910
      @sofianfadli7910 7 лет назад

      Thanks!!!You are very helpful... :)

  • @gokulsundeep3610
    @gokulsundeep3610 6 лет назад

    6:10 how did you do that commenting?

  • @rakeshlr3835
    @rakeshlr3835 5 лет назад

    Sir, how to remove stop words from excel sheet, please make a video for it.

    • @muzammilmajeed2532
      @muzammilmajeed2532 5 лет назад

      store it in pandas Dataframe and then it will be very easy to remove

  • @bankingstudy7377
    @bankingstudy7377 7 лет назад

    please tell me what is
    from nltk.corpus import stopwords
    when i am typing this it gives Resource 'corpora/stopwords' not found
    how can i extraxt a .txt file containing twitter data into python for data extraction and removal of stop words

  • @Heyhihello001
    @Heyhihello001 7 лет назад

    not sure why the error ' 'filtered' is not defined ' keeps coming up. Im using jupyter notebook if that makes a difference

  • @yosephsolomon7536
    @yosephsolomon7536 6 лет назад +6

    hehehe, "there's that D again...."

  • @akshitmadan5232
    @akshitmadan5232 4 года назад

    Cool vro !!

  • @bharatalam1815
    @bharatalam1815 7 лет назад

    sir i have problem with my code script is runinng stop when is execute it says python is stoped working help me sir

  • @karanhebbar4567
    @karanhebbar4567 8 лет назад

    How is 'not' a stop word?
    It does change the meaning of the senctences right?

    • @sentdex
      @sentdex  8 лет назад

      Any word that we want to either ignore, or stop bothering to read the sentence is a stop word. Different people might have different stop words for different projects.
      It's also highly likely that you're running multiple NLP algorithms when parsing text.
      If you're trying to get absolute meaning of a sentence, then you're trying to parse for grammar, sentiment, and word definitions. Each of those 3 things might have a different algorithm. For grammar, the word "not" is not useful in tasks like trying to figure out which words modify which. For sentiment, sure, you want to make sure that word is flipping the value of the word it modifies, but only after you know which word is actually supposed to be modified.

    • @karanhebbar4567
      @karanhebbar4567 8 лет назад

      Okay,I got it!
      I know you're a busy person,Thank you so much for the reply!

  • @cwinhall
    @cwinhall 8 лет назад

    Hi Harrison,
    Hoping you can help me out here... im getting this error...
    NameError: name 'stopwords' is not defined
    I have copied exactly as written in the video and I recieve this error. What have I done wrong?

    • @dirtyybiird
      @dirtyybiird 8 лет назад

      is there a type in your import reference of "stopwords"?

  • @okao08
    @okao08 7 лет назад +2

    if you could only mention how to add our own stop words the video would have been perfect

    • @rizwandudekula4652
      @rizwandudekula4652 7 лет назад +4

      stop_words = set(stopwords.words("english")) -- this line returns a SET of stop words. So to add a new stop word to the set you can use : stop_words.add("my_wanted_stopword") , similarly you can remove unwanted stopword from the set using : stop_words.remove("my_unwanted_stopword")

    • @ghezalahmad
      @ghezalahmad 7 лет назад

      How to mention the path, because my stop words file is on the Desktop, and you easily mentioned to just add or I have to mention it inside the parameter to mention the path.

  • @vegancyberpunk
    @vegancyberpunk 7 лет назад

    how can i add custom stopwords to this set

    • @rizwandudekula4652
      @rizwandudekula4652 7 лет назад +1

      stop_words = set(stopwords.words("english")) -- this line returns a SET of stop words. So to add a new stop word to the set you can use : stop_words.add("my_wanted_stopword") , similarly you can remove unwanted stopword from the set using : stop_words.remove("my_unwanted_stopword")

  • @demonlord4712
    @demonlord4712 7 лет назад

    how do you comment multiple lines by selecting ? can you tell me shortcut key

    • @shashiraj3865
      @shashiraj3865 6 лет назад

      windows = ctrl+'/'

    • @demonlord4712
      @demonlord4712 6 лет назад

      your keys dont work, and what does these key even mean? windows =??
      i found real key long ago, in formate in menu bar. which is Alt+3

  • @abdulquadirkhan8558
    @abdulquadirkhan8558 5 лет назад

    It's giving error no module named nltk

  • @mitu9881
    @mitu9881 9 лет назад

    when i run this same code i got Import error . why ?

  • @sidharthbabu4709
    @sidharthbabu4709 6 лет назад

    Hi I'd like to do a question answering system using nltk and I'd like to know that the necessary tools for the same and the procedures/stages also...hope that you'd notice this comment and process my request ..thanku☺

  • @sambithdas921
    @sambithdas921 7 лет назад

    how to deal with words like gonna,gotcha etc. Because when i'm giving sentence "Just gonna stand there and watch me burn"
    I'm getting output as ['Just', 'gon', 'na', 'stand', 'watch', 'burn', '.']

  • @mandarkulkarni9999
    @mandarkulkarni9999 4 года назад

    Great .... But I don't know why people use idle .... I have used all and jupyter notebook is way better than all

  • @janellekoh1042
    @janellekoh1042 7 лет назад

    can i ask how do you remove the punctuation?

    • @therealanalysisparalysis
      @therealanalysisparalysis 7 лет назад

      import string
      no_punc = [w for w in filtered if w not in string.punctuation]
      where filtered is the tokenized words..

  • @hdiepeveen8907
    @hdiepeveen8907 3 года назад

    Could someone explain what the code below does exactly? I don;t quite understand what's happening.
    for w in words:
    if w not in stop_words:
    filtered_sentence.append(w)

    • @RobDawsonjr
      @RobDawsonjr 2 года назад

      Iterates through the word list and places the element of the list into the variable "w" (This will iterate through the entire list and so the variable "w" will change)
      Checks the value of "w" and determines if it is inside the "stop_words" list
      If the value of "w" is not in "stop_words"
      The value of "w" is placed into a new list called "filtered_sentence"
      Make sure you check out the Python fundamentals before tackling more difficult problems

  • @syedabushra825
    @syedabushra825 6 лет назад

    how to remove stopwords in a complete file????how to set a directory of a file???

  • @evanyates965
    @evanyates965 9 лет назад +5

    When I print my list of stopwords, I am getting u's before each word. (Ex. [u'i', u'me', u'my', u'myself', u'we', ... ]) do you know why this is? Great video btw!

    • @sergiolucero
      @sergiolucero 8 лет назад

      unicode is the python default for strings

    • @mediahighlights-r1w
      @mediahighlights-r1w 8 лет назад +3

      In Python, a string with an u in front of it it denotes a unicode string rather than a regular str.

    • @AkashKumar-ip2mg
      @AkashKumar-ip2mg 7 лет назад +1

      how to remove 'u'

    • @MuhammadAsif12030
      @MuhammadAsif12030 6 лет назад

      How to make your own list of stopwords? can you please guide

    • @vadivel4846
      @vadivel4846 5 лет назад +1

      pymotw.com/2/codecs/. PL refer this tutorial,it will help you

  • @krakenmetzger
    @krakenmetzger 5 лет назад

    In common parlance these are called "safe words"

  • @ebtesamh9624
    @ebtesamh9624 8 лет назад

    how can i remove Arabic stop words