Train Custom Tokenizer using Hugging Face from Scratch | NLP | Byte Pair Tokenizer
HTML-код
- Опубликовано: 24 окт 2024
- Oscar Dataset : oscar-corpus.c...
Notebook :github.com/kar...
✅Recommended Gaming Laptops For Machine Learning and Deep Learning :
👉 1. HP Pavillion (Ryzen 5 / RTX 3050) - amzn.to/3HM2hI1
👉 2. Asus TUF (Ryzen 7 / RT 3050) - amzn.to/3sISj5P
👉 3. Acer Nitro 5 (Ryzen 5/ GTX 1650) - amzn.to/3HII8mi
👉 4. Acer Nitro 5 (Intel Core i5-11th Gen/ GTX 1650) - amzn.to/3hHBAcN
👉 5. Lenovo Legion 5 (Ryzen 5/ GTX 1650) - amzn.to/3KjpB1r
✅ Best Work From Home utilities to Purchase for Data Scientist :
👉 1. Wifi Range Extender - amzn.to/3INxUCf
👉 2. Samsung LED Monitor (24 Inches) - amzn.to/35U8sN3
👉 3. Laptop Stand - amzn.to/3KhUzqS
👉 3. Office Chair - amzn.to/3IJoiZl
👉 4. Power bank - amzn.to/3IMISrQ
👉 5. Wireless Keyboard and Mouse (Without Backlit) - amzn.to/3tthnNC
👉 6. Table Lamp - amzn.to/3IJIieg
👉 7. Table - amzn.to/3tv6tXA
👉 8. Mic - amzn.to/35rnzOb
✅ Recommended Books to Read on Machine Learning And Deep Learning:
👉 1. Natural Language Processing - amzn.to/3KhqszI
👉 2. Hands-On Machine Learning with Keras and Tensorflow - amzn.to/3KddeE2
👉 3. Deep Learning with Pytorch - amzn.to/35Lk2Kd
👉 4. Practical Machine Learning for Computer Vision - amzn.to/3HFfaDz
👉 5. Applied Data Science using Pyspark - amzn.to/3sLaV5s
Connect with me on :
1. LinkedIn: / karndeepsingh
2. Github: www.github.com...
#datascience #nlp #deeplearning #ecommercee
This video is very very helpful. Very useful for modeling new languages. I am waiting for the next part, training the model from scratch on new language.
It was really helpful.Can you continue this series? step by step process ,now how can we use this tokenize data to predict results?
thanks for shating
Bro i have doubt for choosing the vocabsize fro my dataset. I have used oscar's telugu dataset. I have got total 97 chunk file texts. But in your video you have choosen vocab size as 3000. But my chunks are more in number. What should be the preferred vocab size fro me case??
You can choose 30K, 50K
Do you have an idea of how to convert a csv file into a dataset for the same purpose? I want to make a training and data set out of it. it is only one column csv
You can convert the csv data into huggingface Dataset by using dataset pythong library
What are the consequences of using "max_length"?