Train Custom Tokenizer using Hugging Face from Scratch | NLP | Byte Pair Tokenizer

Поделиться
HTML-код
  • Опубликовано: 24 окт 2024
  • Oscar Dataset : oscar-corpus.c...
    Notebook :github.com/kar...
    ✅Recommended Gaming Laptops For Machine Learning and Deep Learning :
    👉 1. HP Pavillion (Ryzen 5 / RTX 3050) - amzn.to/3HM2hI1
    👉 2. Asus TUF (Ryzen 7 / RT 3050) - amzn.to/3sISj5P
    👉 3. Acer Nitro 5 (Ryzen 5/ GTX 1650) - amzn.to/3HII8mi
    👉 4. Acer Nitro 5 (Intel Core i5-11th Gen/ GTX 1650) - amzn.to/3hHBAcN
    👉 5. Lenovo Legion 5 (Ryzen 5/ GTX 1650) - amzn.to/3KjpB1r
    ✅ Best Work From Home utilities to Purchase for Data Scientist :
    👉 1. Wifi Range Extender - amzn.to/3INxUCf
    👉 2. Samsung LED Monitor (24 Inches) - amzn.to/35U8sN3
    👉 3. Laptop Stand - amzn.to/3KhUzqS
    👉 3. Office Chair - amzn.to/3IJoiZl
    👉 4. Power bank - amzn.to/3IMISrQ
    👉 5. Wireless Keyboard and Mouse (Without Backlit) - amzn.to/3tthnNC
    👉 6. Table Lamp - amzn.to/3IJIieg
    👉 7. Table - amzn.to/3tv6tXA
    👉 8. Mic - amzn.to/35rnzOb
    ✅ Recommended Books to Read on Machine Learning And Deep Learning:
    👉 1. Natural Language Processing - amzn.to/3KhqszI
    👉 2. Hands-On Machine Learning with Keras and Tensorflow - amzn.to/3KddeE2
    👉 3. Deep Learning with Pytorch - amzn.to/35Lk2Kd
    👉 4. Practical Machine Learning for Computer Vision - amzn.to/3HFfaDz
    👉 5. Applied Data Science using Pyspark - amzn.to/3sLaV5s
    Connect with me on :
    1. LinkedIn: / karndeepsingh
    2. Github: www.github.com...
    #datascience #nlp #deeplearning #ecommercee

Комментарии • 8

  • @VLM234
    @VLM234 2 года назад

    This video is very very helpful. Very useful for modeling new languages. I am waiting for the next part, training the model from scratch on new language.

  • @englishsimplified786
    @englishsimplified786 Год назад

    It was really helpful.Can you continue this series? step by step process ,now how can we use this tokenize data to predict results?

  • @mohammedarsalan7164
    @mohammedarsalan7164 2 года назад

    thanks for shating

  • @gunasekharvenkatachennaiah1033
    @gunasekharvenkatachennaiah1033 9 месяцев назад

    Bro i have doubt for choosing the vocabsize fro my dataset. I have used oscar's telugu dataset. I have got total 97 chunk file texts. But in your video you have choosen vocab size as 3000. But my chunks are more in number. What should be the preferred vocab size fro me case??

  • @user-lh5fx8if6b
    @user-lh5fx8if6b 2 года назад

    Do you have an idea of how to convert a csv file into a dataset for the same purpose? I want to make a training and data set out of it. it is only one column csv

    • @karndeepsingh
      @karndeepsingh  Год назад

      You can convert the csv data into huggingface Dataset by using dataset pythong library

  • @jesusbaug
    @jesusbaug Год назад

    What are the consequences of using "max_length"?