Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1)

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024
  • Welcome to Zero to Hero for Natural Language Processing using TensorFlow! If you’re not an expert on AI or ML, don’t worry -- we’re taking the concepts of NLP and teaching them from first principles with our host Laurence Moroney (@lmoroney).
    In this first lesson we’ll talk about how to represent words in a way that a computer can process them, with a view to later training a neural network to understand their meaning.
    Hands-on Colab → goo.gle/2uO6Gee
    NLP Zero to Hero playlist → goo.gle/nlp-z2h
    Subscribe to the TensorFlow channel → goo.gle/Tensor...

Комментарии • 148

  • @TheAdamSmithh
    @TheAdamSmithh 4 года назад +22

    Thank you so much! This is so informative, so quickly, in well structured lessons. I'm using a TensorFlow package for R and this helps me understand my project so much better!

  • @Code4You1
    @Code4You1 Год назад +12

    Simple and straight to the point I love it!

  • @chiomaanyiam1138
    @chiomaanyiam1138 2 года назад +9

    Wow! Thank you for breaking this down in such an easy way.

  • @jawadmansoor6064
    @jawadmansoor6064 Год назад +4

    Such a great lecture on NLP wow. I wish I had found it when it was uploaded, saving me two years.

    • @abdulbasitnisar
      @abdulbasitnisar Месяц назад

      What did you did for two years? I mean which course?

  • @asadanees781
    @asadanees781 3 года назад +5

    Thanks Laurence Moroney are blessing for us! Awesome information

  • @rabadaba7
    @rabadaba7 Год назад +5

    I love your videos! They are very professional and concepts are very clearly explained.

  • @muhammadhananasghar4326
    @muhammadhananasghar4326 3 года назад +2

    Best Explanation Ever. Best Sir I had ever listened

  • @chowadagod
    @chowadagod 4 года назад +4

    I've always been discouraged learning NLP ..But you've just made it a whole lot easier

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +1

      It's a huge field, and I'm just scratching the surface. I hope it's useful! :)

  • @srikrithibharadwaj6779
    @srikrithibharadwaj6779 4 года назад +20

    Thank u so much 🙏🏻 such a great information.

  • @nishalk781
    @nishalk781 4 года назад +5

    Thanks for making it clear waiting for the next one

  • @kelvinsmith4894
    @kelvinsmith4894 4 года назад +9

    Lol, you explained this so well that it made me want to implement my own library for tokenization

  • @coded6799
    @coded6799 2 года назад +2

    This is a godsend.
    No other definition is possible.

  • @Idontknowcode512
    @Idontknowcode512 Год назад +1

    Thanks you made it so easy for me to understand nlp 🙏

    • @TensorFlow
      @TensorFlow  Год назад

      We're happy to hear that the video was helpful. If you'd like to learn more about NLP, check out the NLP Zero to Hero playlist → goo.gle/nlp-z2h

    • @Idontknowcode512
      @Idontknowcode512 Год назад

      @@TensorFlow I have checked it. But I have one request can we build a model like chatgpt using tensorflow 🤔

  • @harmitchhabra989
    @harmitchhabra989 3 года назад +2

    So, its like markov lempel compression?

  • @harikrishnanb7273
    @harikrishnanb7273 Год назад +7

    Tokenizer is deprecated now

  • @mattymallz4207
    @mattymallz4207 4 года назад +3

    Fantastic video! Very informative. Thank you for sharing TensorFlow!

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +1

      Thanks, Matty!

    • @mattymallz4207
      @mattymallz4207 4 года назад

      Laurence Moroney, I have a specific tensor flow question regarding beautiful soup and specifically gathering text from an html output. Is there anyway we could start a dialogue?

  • @BeGreatttt
    @BeGreatttt 4 года назад +4

    Great explanation, thanks a lot!!!

  • @akshayshah483
    @akshayshah483 4 года назад +1

    Yeah. Zero to hero back

    • @laurencemoroney655
      @laurencemoroney655 4 года назад

      For 3 episodes, and I'm working on another 3 for text generation to come out in the not-too-distant future. I hope!

  • @rishibhatia5056
    @rishibhatia5056 6 месяцев назад +1

    Thanks for making.clear

  • @18lan
    @18lan 2 года назад +1

    If you are confused like I was why love receive index 1 then go to the end of that video where it's explained:
    Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

  • @ronnierendel9503
    @ronnierendel9503 4 года назад +1

    Amazingly well said

  • @sharawyabdul6222
    @sharawyabdul6222 3 года назад +1

    Thank u so much , This is very well explained.

  • @muskanjain1256
    @muskanjain1256 3 года назад +2

    @lmoroney I have come across the chatbot deployments recently. It is said that there is a problem with the continued conversation in the case of chatbots. But I have a query that why can't we add a lstm on a lstm model? I mean that if suppose we are able to provide a memory on sentences too along with memory on particular sentence then it may able to store the essentials of the previous conversations. Please help me with this query actually I am new to nlp and lot more excited to know.

  • @sunildingankar8657
    @sunildingankar8657 Месяц назад

    I was working in Marathi (Indian regional language) language for last 20 odd years. Since last 8 years I am working as a writer-translator. If I learn NLP, will I be able to combine Marathi linguistic skills and NLP skills in practical use? If yes, how it will be and where can I use it?

  • @rahulbhardwaj4568
    @rahulbhardwaj4568 4 года назад +2

    Great, thanks for the info!

  • @819rajiv
    @819rajiv 2 года назад +1

    think you so much sir for grate videos

  • @balachkhan1578
    @balachkhan1578 4 года назад +1

    Its great. Waiting for the next.

  • @quadraticlife8314
    @quadraticlife8314 2 года назад

    Incredibly amazing!

  • @georgesteele4838
    @georgesteele4838 Год назад

    Excellent presentation.

  • @fahemhamou6170
    @fahemhamou6170 2 года назад +1

    Thank you very much

  • @arpanghoshal6910
    @arpanghoshal6910 3 года назад

    He's a Tensorflow guru!

  • @muhammadyaqoob9129
    @muhammadyaqoob9129 23 дня назад

    I need little more help; can you please mention the books you have followed? or Reseach papers?
    Basically, I am asking for References, so I read them by myself.

  • @biswanthpinnika7149
    @biswanthpinnika7149 Год назад

    we can also use tokenization for converting sentences to words

  • @M_Zaroug
    @M_Zaroug 4 года назад +1

    🤩😍😍🤩 Very informative, waiting for the rest

  • @ashimkarki9652
    @ashimkarki9652 4 года назад

    The legend is back

  • @yousefsharrab1093
    @yousefsharrab1093 Год назад

    Great introduction

  • @benjaminkimmang1962
    @benjaminkimmang1962 Год назад +1

    quite informative. thanks.

  • @eyasulencha5136
    @eyasulencha5136 3 года назад

    amazing presentation.thanks dear for the info

  • @_petrok
    @_petrok 4 года назад +2

    Great introduction which is easy to understand. Can't wait for the next videos of this series!
    But is there any way to group words ignoring some grammar? Like: "He plays piano - I play piano" where "plays" != "play", but it basically is the same word and tempus.
    The part of ignoring the "!" in "dog!" is fascinating.

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +1

      Yeah...that's a little more difficult in preprocessing text. I won't be covering that...sorry!

    • @NelsonYalta
      @NelsonYalta 4 года назад +3

      Those are sub-words, and a different tools can be used for obtaining them, such as sentencepiece (github.com/google/sentencepiece). In this case the model searches for common sub words such as play and in case of plays it tokenizes as . It is also possible to tokenize as the character and as a sub-word.

    • @mayankdewli1010
      @mayankdewli1010 2 года назад +1

      yup ofcourse. you can lemmatize these words or stem these words

  • @ouissemmouheb5283
    @ouissemmouheb5283 Год назад

    Thank you so much!

  • @cloudlover9186
    @cloudlover9186 Месяц назад

    Good one. Sir i want to know if nlp(" I have III years of exp"), if i check for ,_.ISNUM is not working, do we have any work around for this, is ROMAN letters will not be detected ?

  • @oumelkheirofficial5216
    @oumelkheirofficial5216 3 года назад +1

    What an amazing and simple way of explication, Thank you

  • @singhanubhav
    @singhanubhav 3 года назад

    For NLP freshers - this video is more about encoding than being about tokenization itself. Read about both topics separately before going through this video to better understand it.

  • @fakrulislam3140
    @fakrulislam3140 4 года назад

    Amazing presentation

  • @Ricocase
    @Ricocase 3 года назад +1

    Excellence! How do I leverage kMeans clustering to find similarities or segment sentences from one another?

  • @rameshsrivastavachandra
    @rameshsrivastavachandra 4 года назад +1

    This code has apache license, so can it be reused?

  • @sharjeelzubair4106
    @sharjeelzubair4106 3 месяца назад

    sentences = [
    'كم سعر الراجحي',
    'ما هي قيمة الراجحي؟',
    'هل تعرف سعر أرامكو؟'
    ]
    {'سعر': 1, 'كم': 2, 'الراجحي': 3, 'ما': 4, 'هي': 5, 'قيمة': 6, 'الراجحي؟': 7, 'هل': 8, 'تعرف': 9, 'أرامكو؟': 10}
    its putting الراجحي and الراجحي? as two tokens, is that becuase of arabic?

  • @ubaydullo_a757
    @ubaydullo_a757 2 года назад

    thank you, it was helpful :)

  • @narendrapratapsinghparmar91
    @narendrapratapsinghparmar91 8 месяцев назад

    Thanks

  • @yami6499
    @yami6499 3 года назад

    great video

  • @xuantungnguyen9719
    @xuantungnguyen9719 4 года назад

    Hi. What happens if I set nums_words to 0? I tried and it still prints all the words

  • @amaltej9372
    @amaltej9372 Год назад

    THANKS 😇

  • @harsh.vision
    @harsh.vision 3 месяца назад

    print(Hashing == Tokenization )
    whats the output??

  • @meg33333
    @meg33333 Год назад

    Hello everyone
    So I am new to the ML NLP world. I need some tips my team is working on a project in which we want to convert text ( especially Hindi or Sanskrit) to a set of specific images. Which algorithm or model we go for or form where we should start we have made the data set but now what?

  • @TallRiderX
    @TallRiderX 4 года назад +2

    The colab is labeld as Course 3 - Week 1 - Lesson 1.ipynb - where can I sign up for the full course? Thank you!

    • @laurencemoroney655
      @laurencemoroney655 4 года назад

      The colab was adapted for one I wrote at Coursera, where it was course 3 in teh TensorFlow:In Practice specialization. There's more there. Otherwise, this is a 3 part series, with part 2 now on the YT channel :)

  • @lencazero4712
    @lencazero4712 4 месяца назад

    Awesome

  • @theobellash6440
    @theobellash6440 Год назад

    Nice video

  • @aravindravindranatha4260
    @aravindravindranatha4260 4 года назад

    I need your advise on finding the text similarity

  • @ipekbar
    @ipekbar 3 года назад +1

    Thank you for the video. Sometimes exclamation mark could be informative for tasks such as sentiment classification. But the tokenizer filters out. Is there way for preventing this?

    • @vishnurajyadav8917
      @vishnurajyadav8917 3 года назад

      did you got answer for this from any other source ?

    • @ipekbar
      @ipekbar 3 года назад +1

      @@vishnurajyadav8917 yes, we can control by changing filters parameter www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

  • @ishanghutake1566
    @ishanghutake1566 3 года назад

    Suppose if you have 30 textfile in one folder how do you tokenize the word?

  • @WassupCarlton
    @WassupCarlton Месяц назад

    perhaps this is coming in a later video, but is there any rhyme/reason to the integers that get assigned to the words? or is it PURELY arbitrary?

    • @WassupCarlton
      @WassupCarlton Месяц назад

      ope -- looks like the more frequent you are, the smaller your assigned integer. Correct?

  • @sunanthakrishnan
    @sunanthakrishnan 4 года назад

    Could you help me with python 3.8.2 compatible version of Tensorflow and Keras.

  • @Promptgeek2
    @Promptgeek2 Год назад

    Better explanation [imposible].

  • @HuyNguyen-kd5vz
    @HuyNguyen-kd5vz Год назад

    Thiis is awesome

  • @LearnWithMilind
    @LearnWithMilind 4 года назад

    How many languages are supported? Or only English is supporting.

    • @LaurenceMoroney
      @LaurenceMoroney 4 года назад

      I've only tried English, but this technique should work with most languages. Try the notebook linked, and change the language and see what happens?

  • @oliverli9630
    @oliverli9630 4 года назад

    Can this be called hacked? Or are there reasons that Keras doesn't include this? (Notice: "'you're", the left quote is still there, and it's got "'": 11 recognized as a word. and num_words=4 doesn't really limit the word count down to 4.)
    from tensorflow.keras.preprocessing.text import Tokenizer
    sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!',
    "Jack said, 'You're gonna love my cat!'"
    ]
    tokenizer = Tokenizer(num_words = 4)
    tokenizer.fit_on_texts(sentences)
    word_index = tokenizer.word_index
    print(word_index)
    {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6, 'jack': 7, 'said': 8, "'you're": 9, 'gonna': 10, "'": 11}

    • @rajareivan2417
      @rajareivan2417 Год назад +2

      i'm also wondering why this is the case, especially when i set num_words to be 1 or even 0 it still tokenizes all the provided words. have u got the answer for this?

  • @renderdreality
    @renderdreality 4 года назад

    Does NLP only process english? Could it do another language? My question is really if it could be used to learn a different language as basis and go from there.

    • @jinzo1171
      @jinzo1171 2 года назад

      It can be used for any language:)

  • @deepakdakhore
    @deepakdakhore 4 года назад

    Very nice

  • @mohithshivu5475
    @mohithshivu5475 4 года назад

    Sir my name is mohith I am final year BE student can you help me out some doubt on nlp I am working on data generalization and data sanitization our task is identifying given text weather it is sanitized or not generalized or not how it work in python can you help out sir please.... it is helpfull to me

  • @shivibhatia1613
    @shivibhatia1613 4 года назад

    Too good

  • @Acumen928
    @Acumen928 4 года назад +1

    Thanks. Where is the next episode?

  • @actu_r
    @actu_r 4 года назад

    Should we keep only nouns when topic modelling ? I am quite new with NLP and it seems there is no clear universal thumb rule for extracting topics information, what would you advise ?

    • @Aoitetsugakusha
      @Aoitetsugakusha 4 года назад +2

      You can probably do decently well using just nouns, but you will probably also lose a lot of information if you filter out non-nouns at the Tokenization or pre-processing step. For example, if you only use nouns, you could very well pick up on a topic like "machine learning" in your dataset, but you might miss separate discussions of "deep learning," because "deep" is an adjective that would get filtered out and you would be left with just general "learning." An ultra-crude way you might augment this a bit is to instead do topic modeling on n-grams and keep only those n-grams that contain at least one noun, but I haven't tried this, so I can't assert it will actually work.

    • @laurencemoroney655
      @laurencemoroney655 4 года назад

      @@Aoitetsugakusha +1 Great answer

    • @kumarvikas_134
      @kumarvikas_134 4 года назад

      Depends what your objective is. If the end result is only centered around identifying entities then keeping NN/NNP may make sense(given your POS tagger is not making errors). It all depends upon the objective, for my use case I remembered I have extracted chunks of SVO phrases(Subject-Verb-Object) and then performed topic modeling, that had worked well for me, but I had made adjustments to my POS tagger to do this task well.

  • @samrasoli
    @samrasoli Год назад

    useful

  • @RS-vu5um
    @RS-vu5um 4 года назад

    Is the link to Part 2: Sequencing - Turning Sentences into Data available?

    • @laurencemoroney655
      @laurencemoroney655 4 года назад

      Yep, came out yesterday, check yt.com/tf for details

  • @actu_r
    @actu_r 4 года назад

    What are the advantages of using TF framework instead of other preprocessing method such as thoose spacy or nltk provides for example ? :) thank you

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +1

      I can't compare with the others...but this way they're in a unified framework that makes it less code when I get around to training a NN with them (in episode 3)

  • @yunishuseynzade5630
    @yunishuseynzade5630 3 года назад

    Thanks you so much. But I have a question. How can I use words in other language than English. For example building a NLP in Azerbaijani.

    • @தமிழோன்
      @தமிழோன் 3 года назад

      You need to find and download Azerbaijani corpus from the Internet. You can then prepare the word index using Tensorflow. The rest of the steps should be the same as the English example shown in the video. I don't know about the Azerbaijani language but some languages, like Tamil, don't have separate grammatical words like English. You need to make heaps of preprocessing before you prepare the word index. This is something you need to be aware of. Also, if you can't find a corpus for your language, use something called "hashing trick" (or "feature hashing") to hash the individual words in your language. Luckily, Tensorflow supports hashing trick.

  • @hajar2629
    @hajar2629 4 года назад +1

    thank you
    how can make the same example in my raspberry?

    • @abail7010
      @abail7010 4 года назад

      Medogb Medo exact the same way when you’ve installed python and TensorFlow

  • @rupeshmalpani
    @rupeshmalpani 4 года назад

    can 1679 15223 2 153692 be a word?

  • @douggale5962
    @douggale5962 Год назад

    What? You end it when I was expecting you to at least say to put the 1's in the input layer. This is how you tokenize in general, nothing to do with AI.

  • @sujeeshsvalath
    @sujeeshsvalath 4 года назад

    How to detect difference between "I love my dog" and "l love not my dog"?

    • @Metalocif
      @Metalocif 4 года назад

      Beyond the obvious "it has one more word", there are several approaches.
      One that is fairly easy is to have a list of all words in a language with their connotations (this can be found online), one possible connotation being negation. Then, you can write code that inverts the connotation of a word if there is a word that implies negation near it.

    • @LaurenceMoroney
      @LaurenceMoroney 4 года назад

      If you have lots of sentences that are similar except for the word 'not', and label them accordingly. Then train a classifier like we do here, the 'not' would become a really strong signal towards the negative. Give it a try, instead of using the sarcasm dataset. The code would be very similar to this video.

  • @carlossegura403
    @carlossegura403 4 года назад

    Love Tokenizers ❤️

  • @vi.kran.t
    @vi.kran.t 4 года назад

    I want TensorFlow track jacket that you have wear

    • @tcidude
      @tcidude 4 года назад

      Шансон

  • @tingnews7273
    @tingnews7273 4 года назад

    Anyone can tell me what is first princeple method teach

    • @LaurenceMoroney
      @LaurenceMoroney 4 года назад

      "From first principles" means teaching with zero (or at least very few) assumptions

    • @tingnews7273
      @tingnews7273 4 года назад

      @@LaurenceMoroney sank u , I hope I can figure it out

    • @felixakwerh5189
      @felixakwerh5189 4 года назад

      ‘From first principle’ could also mean from the smallest to the biggest:from the known to the unknown basically it’s a way of breaking concepts down to the simplest form

  • @danylobaibak317
    @danylobaibak317 4 года назад

    A question of ignoring the "!". It seems, the Tokenizer doesn't include "!" because it was filtered as punctuation. Let's assume, that we want to use punctuation and set `filters=''` for Tokenizer. In this case, Tokenizer is not smart enough to separate the token "dog" from the token "!"
    Here's the example in Colab colab.research.google.com/drive/1M6Nf-WQxorf_X9z2jFnCSJ_QjrY3i5BJ

  • @VibhootiKishor
    @VibhootiKishor 4 года назад

    Cool

  • @alexanderpohl1949
    @alexanderpohl1949 4 года назад +3

    04:15 is really misleading for anyone watching this as their entry to nlp. There are too many steps missing that need to be talked about it in a 'Zero to Hero' tutorial series after this point, instead of jumping into sequenzing. Even steps before this point.
    I see why these aren't included (because these are not included in tensorflow). But at the same time, this is just setting an unrealistic standard.
    In machine learning terms, I'd say... This video is just mislabeled

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +2

      ...and what are these steps? With these videos and the codelabs, we'll have everything we need to build a simple text classifier, the beginnings of NLP.

    • @தமிழோன்
      @தமிழோன் 3 года назад

      @@laurencemoroney655 Maybe he's referring to the clean up required for the grammar (like someone pointed out: play vs plays)? However, Tensorflow cannot include that in the library as he's suggesting. Because Tensorflow is not an English-only library rather a more generic one.

  • @HealthyFoodBae_
    @HealthyFoodBae_ 4 года назад

    Yay😅

  • @rawnakfreak3539
    @rawnakfreak3539 8 месяцев назад

    Exam after 9 hours (⁠T⁠T⁠)

  • @cr0wzzz
    @cr0wzzz Год назад

    The answer was simple all along. It's just dog

  • @siddvideos
    @siddvideos 4 года назад

    Too late to the party Tensorflow!! It’s not 2010. Love the video though, thanks😎

  • @masternobody1896
    @masternobody1896 4 года назад +1

    This is so complicated

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +2

      Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.

    • @balachkhan1578
      @balachkhan1578 4 года назад

      @@laurencemoroney655 Its great. When we will get the next tutorial?

    • @laurencemoroney655
      @laurencemoroney655 4 года назад

      @@balachkhan1578 We're releasing them weekly

    • @balachkhan1578
      @balachkhan1578 4 года назад +1

      @@laurencemoroney655 TensorFlow can't be installed with Python 3.8. Will the issue be solved or i should switch to Python 3.7?

    • @LaurenceMoroney
      @LaurenceMoroney 4 года назад

      @@balachkhan1578 It's constantly being updated...so keep an eye on www.tensorflow.org/install. Right now it's up to 3.7 on there.

  • @vinay1057
    @vinay1057 4 года назад +1

    Very Difficult to learn

    • @laurencemoroney655
      @laurencemoroney655 4 года назад +1

      Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.

    • @tanvipurwar6048
      @tanvipurwar6048 3 года назад

      @@laurencemoroney655
      It is a bit complicated the first time. But taking up a small dataset/project for nlp and then revisiting the video again makes everything a lot more clearer. Plus you pick up on things that slipped your mind the first time :)

  • @khlatarbaatarchuluun787
    @khlatarbaatarchuluun787 3 года назад

    Thank you very much