Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1)

Поделиться
HTML-код
  • Опубликовано: 9 фев 2025
  • Welcome to Zero to Hero for Natural Language Processing using TensorFlow! If you’re not an expert on AI or ML, don’t worry -- we’re taking the concepts of NLP and teaching them from first principles with our host Laurence Moroney (@lmoroney).
    In this first lesson we’ll talk about how to represent words in a way that a computer can process them, with a view to later training a neural network to understand their meaning.
    Hands-on Colab → goo.gle/2uO6Gee
    NLP Zero to Hero playlist → goo.gle/nlp-z2h
    Subscribe to the TensorFlow channel → goo.gle/Tensor...

Комментарии • 151

  • @jawadmansoor6064
    @jawadmansoor6064 2 года назад +11

    Such a great lecture on NLP wow. I wish I had found it when it was uploaded, saving me two years.

    • @abdulbasitnisar
      @abdulbasitnisar 6 месяцев назад

      What did you did for two years? I mean which course?

  • @Code4You1
    @Code4You1 2 года назад +14

    Simple and straight to the point I love it!

  • @muhammadhananasghar4326
    @muhammadhananasghar4326 3 года назад +2

    Best Explanation Ever. Best Sir I had ever listened

  • @chiomaanyiam1138
    @chiomaanyiam1138 2 года назад +10

    Wow! Thank you for breaking this down in such an easy way.

  • @TheAdamSmithh
    @TheAdamSmithh 4 года назад +23

    Thank you so much! This is so informative, so quickly, in well structured lessons. I'm using a TensorFlow package for R and this helps me understand my project so much better!

  • @asadanees781
    @asadanees781 4 года назад +5

    Thanks Laurence Moroney are blessing for us! Awesome information

  • @kelvinsmith4894
    @kelvinsmith4894 4 года назад +10

    Lol, you explained this so well that it made me want to implement my own library for tokenization

  • @rabadaba7
    @rabadaba7 Год назад +5

    I love your videos! They are very professional and concepts are very clearly explained.

  • @chowadagod
    @chowadagod 5 лет назад +4

    I've always been discouraged learning NLP ..But you've just made it a whole lot easier

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +1

      It's a huge field, and I'm just scratching the surface. I hope it's useful! :)

  • @nishalk781
    @nishalk781 5 лет назад +5

    Thanks for making it clear waiting for the next one

  • @coded6799
    @coded6799 3 года назад +2

    This is a godsend.
    No other definition is possible.

  • @Idontknowcode512
    @Idontknowcode512 Год назад +1

    Thanks you made it so easy for me to understand nlp 🙏

    • @TensorFlow
      @TensorFlow  Год назад

      We're happy to hear that the video was helpful. If you'd like to learn more about NLP, check out the NLP Zero to Hero playlist → goo.gle/nlp-z2h

    • @Idontknowcode512
      @Idontknowcode512 Год назад

      @@TensorFlow I have checked it. But I have one request can we build a model like chatgpt using tensorflow 🤔

  • @srikrithibharadwaj6779
    @srikrithibharadwaj6779 4 года назад +20

    Thank u so much 🙏🏻 such a great information.

  • @rishibhatia5056
    @rishibhatia5056 11 месяцев назад +1

    Thanks for making.clear

  • @mattymallz4207
    @mattymallz4207 5 лет назад +3

    Fantastic video! Very informative. Thank you for sharing TensorFlow!

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +1

      Thanks, Matty!

    • @mattymallz4207
      @mattymallz4207 4 года назад

      Laurence Moroney, I have a specific tensor flow question regarding beautiful soup and specifically gathering text from an html output. Is there anyway we could start a dialogue?

  • @quadraticlife8314
    @quadraticlife8314 3 года назад

    Incredibly amazing!

  • @sharawyabdul6222
    @sharawyabdul6222 4 года назад +1

    Thank u so much , This is very well explained.

  • @BeGreatttt
    @BeGreatttt 4 года назад +4

    Great explanation, thanks a lot!!!

  • @balachkhan1578
    @balachkhan1578 5 лет назад +1

    Its great. Waiting for the next.

  • @arpanghoshal6910
    @arpanghoshal6910 3 года назад

    He's a Tensorflow guru!

  • @ashimkarki9652
    @ashimkarki9652 5 лет назад

    The legend is back

  • @georgesteele4838
    @georgesteele4838 2 года назад

    Excellent presentation.

  • @ronnierendel9503
    @ronnierendel9503 4 года назад +1

    Amazingly well said

  • @akshayshah483
    @akshayshah483 5 лет назад +1

    Yeah. Zero to hero back

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад

      For 3 episodes, and I'm working on another 3 for text generation to come out in the not-too-distant future. I hope!

  • @819rajiv
    @819rajiv 3 года назад +1

    think you so much sir for grate videos

  • @M_Zaroug
    @M_Zaroug 5 лет назад +1

    🤩😍😍🤩 Very informative, waiting for the rest

  • @benjaminkimmang1962
    @benjaminkimmang1962 Год назад +1

    quite informative. thanks.

  • @yousefsharrab1093
    @yousefsharrab1093 Год назад

    Great introduction

  • @singhanubhav
    @singhanubhav 3 года назад

    For NLP freshers - this video is more about encoding than being about tokenization itself. Read about both topics separately before going through this video to better understand it.

  • @18lan
    @18lan 2 года назад +1

    If you are confused like I was why love receive index 1 then go to the end of that video where it's explained:
    Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

  • @rahulbhardwaj4568
    @rahulbhardwaj4568 4 года назад +2

    Great, thanks for the info!

  • @eyasulencha5136
    @eyasulencha5136 4 года назад

    amazing presentation.thanks dear for the info

  • @biswanthpinnika7149
    @biswanthpinnika7149 Год назад

    we can also use tokenization for converting sentences to words

  • @muskanjain1256
    @muskanjain1256 4 года назад +2

    @lmoroney I have come across the chatbot deployments recently. It is said that there is a problem with the continued conversation in the case of chatbots. But I have a query that why can't we add a lstm on a lstm model? I mean that if suppose we are able to provide a memory on sentences too along with memory on particular sentence then it may able to store the essentials of the previous conversations. Please help me with this query actually I am new to nlp and lot more excited to know.

  • @ouissemmouheb5283
    @ouissemmouheb5283 2 года назад

    Thank you so much!

  • @khlatarbaatarchuluun787
    @khlatarbaatarchuluun787 3 года назад

    Thank you very much

  • @Promptgeek2
    @Promptgeek2 Год назад

    Better explanation [imposible].

  • @amaltej9372
    @amaltej9372 2 года назад

    THANKS 😇

  • @narendrapratapsinghparmar91
    @narendrapratapsinghparmar91 Год назад

    Thanks

  • @ubaydullo_a757
    @ubaydullo_a757 3 года назад

    thank you, it was helpful :)

  • @yami6499
    @yami6499 4 года назад

    great video

  • @fakrulislam3140
    @fakrulislam3140 4 года назад

    Amazing presentation

  • @harmitchhabra989
    @harmitchhabra989 3 года назад +2

    So, its like markov lempel compression?

  • @cloudlover9186
    @cloudlover9186 6 месяцев назад

    Good one. Sir i want to know if nlp(" I have III years of exp"), if i check for ,_.ISNUM is not working, do we have any work around for this, is ROMAN letters will not be detected ?

  • @muhammadyaqoob9129
    @muhammadyaqoob9129 5 месяцев назад

    I need little more help; can you please mention the books you have followed? or Reseach papers?
    Basically, I am asking for References, so I read them by myself.

  • @dannerrera
    @dannerrera 4 месяца назад

    Thanks for the breakdown! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (air carpet target dish off jeans toilet sweet piano spoil fruit essay). How should I go about transferring them to Binance?

  • @theobellash6440
    @theobellash6440 2 года назад

    Nice video

  • @oumelkheirofficial5216
    @oumelkheirofficial5216 4 года назад +1

    What an amazing and simple way of explication, Thank you

  • @harikrishnanb7273
    @harikrishnanb7273 Год назад +9

    Tokenizer is deprecated now

  • @sunildingankar8657
    @sunildingankar8657 6 месяцев назад

    I was working in Marathi (Indian regional language) language for last 20 odd years. Since last 8 years I am working as a writer-translator. If I learn NLP, will I be able to combine Marathi linguistic skills and NLP skills in practical use? If yes, how it will be and where can I use it?

  • @WassupCarlton
    @WassupCarlton 6 месяцев назад

    perhaps this is coming in a later video, but is there any rhyme/reason to the integers that get assigned to the words? or is it PURELY arbitrary?

    • @WassupCarlton
      @WassupCarlton 6 месяцев назад

      ope -- looks like the more frequent you are, the smaller your assigned integer. Correct?

  • @lencazero4712
    @lencazero4712 9 месяцев назад

    Awesome

  • @HuyNguyen-kd5vz
    @HuyNguyen-kd5vz 2 года назад

    Thiis is awesome

  • @meg33333
    @meg33333 Год назад

    Hello everyone
    So I am new to the ML NLP world. I need some tips my team is working on a project in which we want to convert text ( especially Hindi or Sanskrit) to a set of specific images. Which algorithm or model we go for or form where we should start we have made the data set but now what?

  • @deepakdakhore
    @deepakdakhore 5 лет назад

    Very nice

  • @PaulineLepre
    @PaulineLepre 4 месяца назад

    I appreciate your efforts! 🙏 I’ve got a question: 🤨 I have these words 🤨. (behave today finger ski upon boy assault summer exhaust beauty stereo over). What should I do with this? 🤷‍♂️

  • @Ricocase
    @Ricocase 4 года назад +1

    Excellence! How do I leverage kMeans clustering to find similarities or segment sentences from one another?

  • @ipekbar
    @ipekbar 4 года назад +1

    Thank you for the video. Sometimes exclamation mark could be informative for tasks such as sentiment classification. But the tokenizer filters out. Is there way for preventing this?

    • @vishnurajyadav8917
      @vishnurajyadav8917 4 года назад

      did you got answer for this from any other source ?

    • @ipekbar
      @ipekbar 4 года назад +1

      @@vishnurajyadav8917 yes, we can control by changing filters parameter www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

  • @oliverli9630
    @oliverli9630 5 лет назад

    Can this be called hacked? Or are there reasons that Keras doesn't include this? (Notice: "'you're", the left quote is still there, and it's got "'": 11 recognized as a word. and num_words=4 doesn't really limit the word count down to 4.)
    from tensorflow.keras.preprocessing.text import Tokenizer
    sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!',
    "Jack said, 'You're gonna love my cat!'"
    ]
    tokenizer = Tokenizer(num_words = 4)
    tokenizer.fit_on_texts(sentences)
    word_index = tokenizer.word_index
    print(word_index)
    {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6, 'jack': 7, 'said': 8, "'you're": 9, 'gonna': 10, "'": 11}

    • @rajareivan2417
      @rajareivan2417 Год назад +2

      i'm also wondering why this is the case, especially when i set num_words to be 1 or even 0 it still tokenizes all the provided words. have u got the answer for this?

  • @samrasoli
    @samrasoli Год назад

    useful

  • @rameshsrivastavachandra
    @rameshsrivastavachandra 4 года назад +1

    This code has apache license, so can it be reused?

  • @aravindravindranatha4260
    @aravindravindranatha4260 4 года назад

    I need your advise on finding the text similarity

  • @sharjeelzubair4106
    @sharjeelzubair4106 8 месяцев назад

    sentences = [
    'كم سعر الراجحي',
    'ما هي قيمة الراجحي؟',
    'هل تعرف سعر أرامكو؟'
    ]
    {'سعر': 1, 'كم': 2, 'الراجحي': 3, 'ما': 4, 'هي': 5, 'قيمة': 6, 'الراجحي؟': 7, 'هل': 8, 'تعرف': 9, 'أرامكو؟': 10}
    its putting الراجحي and الراجحي? as two tokens, is that becuase of arabic?

  • @shivibhatia1613
    @shivibhatia1613 4 года назад

    Too good

  • @actu_r
    @actu_r 5 лет назад

    Should we keep only nouns when topic modelling ? I am quite new with NLP and it seems there is no clear universal thumb rule for extracting topics information, what would you advise ?

    • @Aoitetsugakusha
      @Aoitetsugakusha 5 лет назад +2

      You can probably do decently well using just nouns, but you will probably also lose a lot of information if you filter out non-nouns at the Tokenization or pre-processing step. For example, if you only use nouns, you could very well pick up on a topic like "machine learning" in your dataset, but you might miss separate discussions of "deep learning," because "deep" is an adjective that would get filtered out and you would be left with just general "learning." An ultra-crude way you might augment this a bit is to instead do topic modeling on n-grams and keep only those n-grams that contain at least one noun, but I haven't tried this, so I can't assert it will actually work.

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад

      @@Aoitetsugakusha +1 Great answer

    • @kumarvikas_134
      @kumarvikas_134 5 лет назад

      Depends what your objective is. If the end result is only centered around identifying entities then keeping NN/NNP may make sense(given your POS tagger is not making errors). It all depends upon the objective, for my use case I remembered I have extracted chunks of SVO phrases(Subject-Verb-Object) and then performed topic modeling, that had worked well for me, but I had made adjustments to my POS tagger to do this task well.

  • @ishanghutake1566
    @ishanghutake1566 4 года назад

    Suppose if you have 30 textfile in one folder how do you tokenize the word?

  • @_petrok
    @_petrok 5 лет назад +2

    Great introduction which is easy to understand. Can't wait for the next videos of this series!
    But is there any way to group words ignoring some grammar? Like: "He plays piano - I play piano" where "plays" != "play", but it basically is the same word and tempus.
    The part of ignoring the "!" in "dog!" is fascinating.

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +1

      Yeah...that's a little more difficult in preprocessing text. I won't be covering that...sorry!

    • @NelsonYalta
      @NelsonYalta 4 года назад +3

      Those are sub-words, and a different tools can be used for obtaining them, such as sentencepiece (github.com/google/sentencepiece). In this case the model searches for common sub words such as play and in case of plays it tokenizes as . It is also possible to tokenize as the character and as a sub-word.

    • @mayankdewli1010
      @mayankdewli1010 2 года назад +1

      yup ofcourse. you can lemmatize these words or stem these words

  • @sunanthakrishnan
    @sunanthakrishnan 4 года назад

    Could you help me with python 3.8.2 compatible version of Tensorflow and Keras.

  • @xuantungnguyen9719
    @xuantungnguyen9719 4 года назад

    Hi. What happens if I set nums_words to 0? I tried and it still prints all the words

  • @actu_r
    @actu_r 5 лет назад

    What are the advantages of using TF framework instead of other preprocessing method such as thoose spacy or nltk provides for example ? :) thank you

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +1

      I can't compare with the others...but this way they're in a unified framework that makes it less code when I get around to training a NN with them (in episode 3)

  • @yunishuseynzade5630
    @yunishuseynzade5630 4 года назад

    Thanks you so much. But I have a question. How can I use words in other language than English. For example building a NLP in Azerbaijani.

    • @தமிழோன்
      @தமிழோன் 4 года назад

      You need to find and download Azerbaijani corpus from the Internet. You can then prepare the word index using Tensorflow. The rest of the steps should be the same as the English example shown in the video. I don't know about the Azerbaijani language but some languages, like Tamil, don't have separate grammatical words like English. You need to make heaps of preprocessing before you prepare the word index. This is something you need to be aware of. Also, if you can't find a corpus for your language, use something called "hashing trick" (or "feature hashing") to hash the individual words in your language. Luckily, Tensorflow supports hashing trick.

  • @RS-vu5um
    @RS-vu5um 5 лет назад

    Is the link to Part 2: Sequencing - Turning Sentences into Data available?

  • @renderdreality
    @renderdreality 5 лет назад

    Does NLP only process english? Could it do another language? My question is really if it could be used to learn a different language as basis and go from there.

    • @jinzo1171
      @jinzo1171 3 года назад

      It can be used for any language:)

  • @harsh.vision
    @harsh.vision 8 месяцев назад

    print(Hashing == Tokenization )
    whats the output??

  • @mohithshivu5475
    @mohithshivu5475 4 года назад

    Sir my name is mohith I am final year BE student can you help me out some doubt on nlp I am working on data generalization and data sanitization our task is identifying given text weather it is sanitized or not generalized or not how it work in python can you help out sir please.... it is helpfull to me

  • @TallRiderX
    @TallRiderX 5 лет назад +2

    The colab is labeld as Course 3 - Week 1 - Lesson 1.ipynb - where can I sign up for the full course? Thank you!

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад

      The colab was adapted for one I wrote at Coursera, where it was course 3 in teh TensorFlow:In Practice specialization. There's more there. Otherwise, this is a 3 part series, with part 2 now on the YT channel :)

  • @Acumen928
    @Acumen928 5 лет назад +1

    Thanks. Where is the next episode?

  • @LearnWithMilind
    @LearnWithMilind 5 лет назад

    How many languages are supported? Or only English is supporting.

    • @LaurenceMoroney
      @LaurenceMoroney 5 лет назад

      I've only tried English, but this technique should work with most languages. Try the notebook linked, and change the language and see what happens?

  • @carlossegura403
    @carlossegura403 5 лет назад

    Love Tokenizers ❤️

  • @JefferyCampos-r7z
    @JefferyCampos-r7z 4 месяца назад

    Ashleigh Park

  • @hajar2629
    @hajar2629 4 года назад +1

    thank you
    how can make the same example in my raspberry?

    • @abail7010
      @abail7010 4 года назад

      Medogb Medo exact the same way when you’ve installed python and TensorFlow

  • @tingnews7273
    @tingnews7273 5 лет назад

    Anyone can tell me what is first princeple method teach

    • @LaurenceMoroney
      @LaurenceMoroney 4 года назад

      "From first principles" means teaching with zero (or at least very few) assumptions

    • @tingnews7273
      @tingnews7273 4 года назад

      @@LaurenceMoroney sank u , I hope I can figure it out

    • @felixakwerh5189
      @felixakwerh5189 4 года назад

      ‘From first principle’ could also mean from the smallest to the biggest:from the known to the unknown basically it’s a way of breaking concepts down to the simplest form

  • @vi.kran.t
    @vi.kran.t 4 года назад

    I want TensorFlow track jacket that you have wear

    • @tcidude
      @tcidude 4 года назад

      Шансон

  • @rupeshmalpani
    @rupeshmalpani 4 года назад

    can 1679 15223 2 153692 be a word?

  • @alexanderpohl1949
    @alexanderpohl1949 5 лет назад +3

    04:15 is really misleading for anyone watching this as their entry to nlp. There are too many steps missing that need to be talked about it in a 'Zero to Hero' tutorial series after this point, instead of jumping into sequenzing. Even steps before this point.
    I see why these aren't included (because these are not included in tensorflow). But at the same time, this is just setting an unrealistic standard.
    In machine learning terms, I'd say... This video is just mislabeled

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +2

      ...and what are these steps? With these videos and the codelabs, we'll have everything we need to build a simple text classifier, the beginnings of NLP.

    • @தமிழோன்
      @தமிழோன் 4 года назад

      @@laurencemoroney655 Maybe he's referring to the clean up required for the grammar (like someone pointed out: play vs plays)? However, Tensorflow cannot include that in the library as he's suggesting. Because Tensorflow is not an English-only library rather a more generic one.

  • @VibhootiKishor
    @VibhootiKishor 5 лет назад

    Cool

  • @danylobaibak317
    @danylobaibak317 5 лет назад

    A question of ignoring the "!". It seems, the Tokenizer doesn't include "!" because it was filtered as punctuation. Let's assume, that we want to use punctuation and set `filters=''` for Tokenizer. In this case, Tokenizer is not smart enough to separate the token "dog" from the token "!"
    Here's the example in Colab colab.research.google.com/drive/1M6Nf-WQxorf_X9z2jFnCSJ_QjrY3i5BJ

  • @sujeeshsvalath
    @sujeeshsvalath 5 лет назад

    How to detect difference between "I love my dog" and "l love not my dog"?

    • @Metalocif
      @Metalocif 5 лет назад

      Beyond the obvious "it has one more word", there are several approaches.
      One that is fairly easy is to have a list of all words in a language with their connotations (this can be found online), one possible connotation being negation. Then, you can write code that inverts the connotation of a word if there is a word that implies negation near it.

    • @LaurenceMoroney
      @LaurenceMoroney 4 года назад

      If you have lots of sentences that are similar except for the word 'not', and label them accordingly. Then train a classifier like we do here, the 'not' would become a really strong signal towards the negative. Give it a try, instead of using the sarcasm dataset. The code would be very similar to this video.

  • @douggale5962
    @douggale5962 Год назад

    What? You end it when I was expecting you to at least say to put the 1's in the input layer. This is how you tokenize in general, nothing to do with AI.

  • @HealthyFoodBae_
    @HealthyFoodBae_ 4 года назад

    Yay😅

  • @rawnakfreak3539
    @rawnakfreak3539 Год назад

    Exam after 9 hours (⁠T⁠T⁠)

  • @cr0wzzz
    @cr0wzzz Год назад

    The answer was simple all along. It's just dog

  • @siddvideos
    @siddvideos 5 лет назад

    Too late to the party Tensorflow!! It’s not 2010. Love the video though, thanks😎

  • @masternobody1896
    @masternobody1896 5 лет назад +1

    This is so complicated

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +2

      Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.

    • @balachkhan1578
      @balachkhan1578 5 лет назад

      @@laurencemoroney655 Its great. When we will get the next tutorial?

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад

      @@balachkhan1578 We're releasing them weekly

    • @balachkhan1578
      @balachkhan1578 5 лет назад +1

      @@laurencemoroney655 TensorFlow can't be installed with Python 3.8. Will the issue be solved or i should switch to Python 3.7?

    • @LaurenceMoroney
      @LaurenceMoroney 5 лет назад

      @@balachkhan1578 It's constantly being updated...so keep an eye on www.tensorflow.org/install. Right now it's up to 3.7 on there.

  • @Vinaybabuk
    @Vinaybabuk 5 лет назад +1

    Very Difficult to learn

    • @laurencemoroney655
      @laurencemoroney655 5 лет назад +1

      Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.

    • @tanvipurwar6048
      @tanvipurwar6048 3 года назад

      @@laurencemoroney655
      It is a bit complicated the first time. But taking up a small dataset/project for nlp and then revisiting the video again makes everything a lot more clearer. Plus you pick up on things that slipped your mind the first time :)

  • @fahemhamou6170
    @fahemhamou6170 2 года назад +1

    Thank you very much