I wanted to express my sincere appreciation for your videos on RUclips. They have been immensely helpful to me in my Ph.D. thesis, particularly in understanding how to pre-train using MLM and fine-tune the BERT model. I thoroughly enjoy watching your videos, and they have provided valuable insights and guidance for my research. Thank you for creating such informative and engaging content.
I don't comment on videos, but your video is so clear and easy to understand I had to just say thank you! I have been trying to solve a multi class problem with an LLM for months without significant progress. Using your video, I was able to make more progress by training a BERT model in a few days than I have in months! Please keep posting. It's immensely helpful for the rest of us.
Thank you so much for this tutorial. Most tutorials really piss me off because they always refer back to other videos they made regarding why things work but you explained each step as you did it and this is super good for someone with a temperant like mine. Appreciate it, you're a beast!
Hey Great Video, just got a question in my data set some texts have multiple lables. Can i just set multiple lables to 1 in the labels[] array at 13:47?
This was great! One question what if you wanted to use additional features besides the Bert embeddings in the training data set. What would be the best approach? Do some type of model stacking where you take the output of the sentiment model and use that combined with other features as input to another model? Or is the a better way to merge/concatenate the additional features onto the BERT word vector training data?
This is a fantastic tutorial! Excellent stuff, even for non experts. I wonder how one would go about should one want to add (domain specific) tokens to the BERT tokenizer, before training. Where in the workflow can that be done?
Hi Simon, there are two approaches, you train from scratch (obviously this takes some time) OR you can add tokens, I want to cover this soon but here's an example github.com/huggingface/transformers/issues/1413#issuecomment-538083512
Really good tutorial! Thank you so much an awsome teacher.....made the model understanding easy and simple,is there any similar tutorial for bertformultilabelsequenceclassification .....or the same code can be used for mulilabel classification
James thank you! I had stuck before with extracting Bert embedding for tf layer as now almost everyone shows this part with use of other libraries like tensor flow hub, text etc. and I cannot use them in my project due to limitations Will try your algorithm. Thanks a lot
Thank you for this very explanatory video. I tried following along with another dataset but each time I try to one-hot-encode my labels with these 3 lines of code arr = df['rating'].values labels = np.zeros((num_samples, arr.max()))#(my label values are from 1-10) labels[np.arange(num_samples), arr] = 1 numpy.float64' object cannot be interpreted as an integer".
HI Sir, great explanation, and I followed to implement the same, but I got this error when training the model : InvalidArgumentError: Data type mismatch at component 0: expected double but got int32. [[node IteratorGetNext (defined at :1) ]] [Op:__inference_train_function_20701]
seems like one of the datatypes for (probably) your inputs is wrong, you will need to add something like dtype=float32 to your input layer definitions OR it may be that your data must be converted to float first before being processed by the model
Xids= np.float64(Xids) Xmask=a= np.float64(Xmask) dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels)) before creating pipeline just convert Xids and Xmask to float64
Hi James Briggs, I found that following the way of dividing validation/train data, validation and train sets vary all the time. When I save the trained model and load it to evaluate for validation data again, I got different results for each run time. Should I divide train/and validation data from beginning and do not need to use SPLIT = 0.9 or others? does it compromise the accuracy of the trained model? Thanks
Code is split between a few different notebooks on Github - they're all in this repo folder: github.com/jamescalam/transformers/tree/main/course/project_build_tf_sentiment_model - hope it helps :)
After training I get around 60% accuracy. When I try to predict I never get the model to predict Sentiment 0 or 4. Do you have any idea why the model has problems with these?
Thank you for such an amazing collection:) Just 1 question, While loading the model, I get this error: ValueError: Cannot assign to variable bert/embeddings/token_type_embeddings/embeddings:0 due to variable shape (2, 768) and value shape (512, 768) are incompatible. Can you let me know why is that so? Thank you so much in advance.
Hey Asim, I would double check that you are tokenizing everything correctly, the 512 that you see is the standard number of tokens consumed by BERT, which we set when encoding our text with the tokenizer :)
Nice example! Could u also use the same technique if you want to classify text in more than 5 categories, for example 10 or 20? And each class is not perfectly balanced and it is NOT an englist text? 😉
haha yes you could, you have different language BERT models that are pretrained - if there was not the language you wanted, we'd want to train from scratch on the new language (mentioned in the last comment) - as for training with more categories, yes we could do that using the same code we use here, we just switch our training data to the new 10-20 class data, and update classifier layer output size to match :)
Hi there, I am trying to generate a confusion matrix, but due to the dataset being shuffled I'm not able to, and it's giving me random values. Any ideas what to do? (The accuracy and loss is pretty good whilst training the model)
Hi James, first of all good tutoriel ! I tried implementing the same architecture with a different dataset but the model training time is insane it's +50h do you have any clue of the reason it takes so much time ? thank you !
it can be a long time, it will depend on the hardware setup you have, I'm using a 3090 GPU so it is reasonably fast, I would double check that you are using GPU (if you have a compatible GPU). If you search something like 'tensorflow GPU setup' you should find some good explanations - hope that helps!
Nice informative video. It would be nice if you can help me to know how can I change this to pytorch # create the dataset object dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels)) def map_func(input_ids, masks, labels): # we convert our three-item tuple into a two-item tuple where the input item is a dictionary return {'input_ids': input_ids, 'attention_mask': masks}, labels # then we use the dataset map method to apply this transformation dataset = dataset.map(map_func)
I'm not using PyTorch for sentiment analysis in this example, instead for masked language modeling, but the dataset build logic is very similar, this video at ~14:57: ruclips.net/video/R6hcxMMOrPE/видео.html
Amazing tutorial 👏👏👏. If you are going to use your model on another machine it's better to h5 format. # Saving the model model = model.save("your_model.h5") # Loading the model in another machine import tensorflow as tf import transformers model = tf.keras.models.load_model('your_model.h5', custom_objects={'CustomMetric':transformers.TFBertMainLayer})
@@jamesbriggs I am working on a hate speech detection project, I trained the model on kaggle and after saving it, it worked in the same notebook but in my local machine it didn't. saving directly need to save the configuration also. I didn't find how to so, so I save the model to h5 format.
Hi sir, I'm trying step by step at Google Colab but it running out of RAM. They give me 12.69GB; in the most cases that happens due code problems. Any idea? thank you!
Google Colab can be difficult with the amount of memory you're given, transformers use *a lot* - one thing that can help is loading your data in batches (so you're not storing it all in memory), one of my recent videos covers this, it might help: ruclips.net/video/r-zQQ16wTCA/видео.html
Thank you for this great video. I tried following along with another dataset but each time I try to one-hot-encode my labels I keep getting an error that says " numpy.float64 object cannot be interpreted as an interger". Any idea how to fix this? Thank you.
Hey it's not necessarily exactly the same, but you will find very similar code here github.com/jamescalam/transformers/tree/main/course/project_build_tf_sentiment_model
hey Amit, this sets the internal BERT layers to not train, but still allows us to train the classifier layers (which are layers 3, 4, etc), we can actually train the BERT layer too by removing that line, but training time will be much longer
Hi. Excellent tutorial! I have a problem. When I'm trying to replicate your code and in the part when I'm using tokenizer.encode_plus() and i get ValueError: could not broadcast input array from shape (15) into shape (512)Thanks. It says that the error is here - Xids[i, :] = tokens['input_ids']
Does it work if you write Xids[:, i] = tokens['input_ids']? Otherwise, double-check the Xids dimensionality with Xids.shape and make sure it lines up to what we would expect (eg num_samples and 512)
Hi James, at the very end when you predicted your new sentiment data with your model you assigned it to. probs = model.predict(test) I would like to know how to export that data you predicted into CSV format so that one can submit it on Kaggle. test['sentiment'] = model.predict(test['phrase']) submission = test[['tweetid', 'sentiment']] submission.to_csv('bertmodel.csv',index=False) Is this the correct way of going about it :) because I want it in sentiment values when exported.
I think you might need to perform a np.argmax() operation on the model.predict output, to convert from output logits to predicted labels, but otherwise it looks good :)
HI James, great explanation and I followed to implement the same , but I got this error : InvalidArgumentError: indices[2,2] = 29200 is not in [0, 28996) [[node model/bert/embeddings/Gather (defined at /usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_tf_bert.py:188) ]] [Op:__inference_train_function_488497] I know it's related to embedding token id. can u help me how can I resolve this ?
I wanted to express my sincere appreciation for your videos on RUclips. They have been immensely helpful to me in my Ph.D. thesis, particularly in understanding how to pre-train using MLM and fine-tune the BERT model.
I thoroughly enjoy watching your videos, and they have provided valuable insights and guidance for my research. Thank you for creating such informative and engaging content.
The best video regarding how to use bert in tensorflow ,thank u
I don't comment on videos, but your video is so clear and easy to understand I had to just say thank you! I have been trying to solve a multi class problem with an LLM for months without significant progress. Using your video, I was able to make more progress by training a BERT model in a few days than I have in months! Please keep posting. It's immensely helpful for the rest of us.
Thank you so much for this tutorial. Most tutorials really piss me off because they always refer back to other videos they made regarding why things work but you explained each step as you did it and this is super good for someone with a temperant like mine. Appreciate it, you're a beast!
haha thanks Kenneth, I try to assume we're starting at the start for every video :)
This video is excellent sir, I was looking for video like that in 2 straight days.
That's awesome to hear, happy you found it, thanks!
Thanks so much, James. On my 1st attempt I was able to get to ~51% accuracy. I will need to make some tweaks, but I'm so excited about this! Woohoo!
Very crisp and nicely structured, with the objective of the exercise stated right at the start
thanks, useful to know stating the objective helps!
Thank you so much!! I was really stuck with the prediction part for a very long time. This will help me a lot.
Thank you so much sir! Best video ever seen on RUclips, clearly explain each steps.
at 42:00
on cell 9, it returns an array of what? What those numbers mean ?
Hey Great Video, just got a question in my data set some texts have multiple lables. Can i just set multiple lables to 1 in the labels[] array at 13:47?
This was great! One question what if you wanted to use additional features besides the Bert embeddings in the training data set. What would be the best approach? Do some type of model stacking where you take the output of the sentiment model and use that combined with other features as input to another model? Or is the a better way to merge/concatenate the additional features onto the BERT word vector training data?
This is a fantastic tutorial! Excellent stuff, even for non experts. I wonder how one would go about should one want to add (domain specific) tokens to the BERT tokenizer, before training. Where in the workflow can that be done?
Hi Simon, there are two approaches, you train from scratch (obviously this takes some time) OR you can add tokens, I want to cover this soon but here's an example github.com/huggingface/transformers/issues/1413#issuecomment-538083512
@@jamesbriggs Great! So add tokens to tokenizer before training on the labeled data, right?
Great! Tutorial! I wonder if the seq_length has to be that long if we work with short phrases ?
This video helped thanks. Usage of BERT does need a GPU subscription though.
Really good tutorial! Thank you so much an awsome teacher.....made the model understanding easy and simple,is there any similar tutorial for bertformultilabelsequenceclassification .....or the same code can be used for mulilabel classification
Thanks! You should be able to use the same code, just change the output layer dimensions to align with your new number of output labels :)
James thank you! I had stuck before with extracting Bert embedding for tf layer as now almost everyone shows this part with use of other libraries like tensor flow hub, text etc. and I cannot use them in my project due to limitations
Will try your algorithm. Thanks a lot
Glad it helps!
hi, first of all thank you for this nice video. How can we make a confisition matrix and classification report here?
how can i find confusion matrix for this kind of dataset?
Hello, thank you for this great video
I follow the steps but i have error
Can you help me please ?
Getting this error: Unknown layer: Custom>TFBertMainLayer. Please ensure this object is passed to the `custom_objects`
Anybody have any idea?
Thank you for this very explanatory video. I tried following along with another dataset but each time I try to one-hot-encode my labels with these 3 lines of code
arr = df['rating'].values
labels = np.zeros((num_samples, arr.max()))#(my label values are from 1-10)
labels[np.arange(num_samples), arr] = 1
numpy.float64' object cannot be interpreted as an integer".
HI Sir, great explanation, and I followed to implement the same, but I got this error when training the model :
InvalidArgumentError: Data type mismatch at component 0: expected double but got int32.
[[node IteratorGetNext (defined at :1) ]] [Op:__inference_train_function_20701]
seems like one of the datatypes for (probably) your inputs is wrong, you will need to add something like dtype=float32 to your input layer definitions
OR it may be that your data must be converted to float first before being processed by the model
Xids= np.float64(Xids)
Xmask=a= np.float64(Xmask)
dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels))
before creating pipeline just convert Xids and Xmask to float64
This is awesome.
Hi James Briggs, I found that following the way of dividing validation/train data, validation and train sets vary all the time. When I save the trained model and load it to evaluate for validation data again, I got different results for each run time. Should I divide train/and validation data from beginning and do not need to use SPLIT = 0.9 or others? does it compromise the accuracy of the trained model? Thanks
Excellent! Where can I fnd the code used in the video?
Code is split between a few different notebooks on Github - they're all in this repo folder: github.com/jamescalam/transformers/tree/main/course/project_build_tf_sentiment_model - hope it helps :)
@@jamesbriggs Thanks. That surely helps! Keep up the good work James, I see you are working on a Transformers course. Will be looking forward to it!
After training I get around 60% accuracy. When I try to predict I never get the model to predict Sentiment 0 or 4. Do you have any idea why the model has problems with these?
Great tutorial! Thanks
Thank you for such an amazing collection:) Just 1 question, While loading the model, I get this error: ValueError: Cannot assign to variable bert/embeddings/token_type_embeddings/embeddings:0 due to variable shape (2, 768) and value shape (512, 768) are incompatible.
Can you let me know why is that so? Thank you so much in advance.
Hey Asim, I would double check that you are tokenizing everything correctly, the 512 that you see is the standard number of tokens consumed by BERT, which we set when encoding our text with the tokenizer :)
@@jamesbriggs I got it and solved the problem. Thank you so much :)
Nice example! Could u also use the same technique if you want to classify text in more than 5 categories, for example 10 or 20? And each class is not perfectly balanced and it is NOT an englist text? 😉
haha yes you could, you have different language BERT models that are pretrained - if there was not the language you wanted, we'd want to train from scratch on the new language (mentioned in the last comment) - as for training with more categories, yes we could do that using the same code we use here, we just switch our training data to the new 10-20 class data, and update classifier layer output size to match :)
Great tutorial, like always, thanks!
Thanks I appreciate these comments a lot! :)
Hi there, I am trying to generate a confusion matrix, but due to the dataset being shuffled I'm not able to, and it's giving me random values. Any ideas what to do? (The accuracy and loss is pretty good whilst training the model)
Hi James, first of all good tutoriel !
I tried implementing the same architecture with a different dataset but the model training time is insane it's +50h do you have any clue of the reason it takes so much time ?
thank you !
it can be a long time, it will depend on the hardware setup you have, I'm using a 3090 GPU so it is reasonably fast, I would double check that you are using GPU (if you have a compatible GPU). If you search something like 'tensorflow GPU setup' you should find some good explanations - hope that helps!
Nice informative video. It would be nice if you can help me to know how can I change this to pytorch
# create the dataset object
dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels))
def map_func(input_ids, masks, labels):
# we convert our three-item tuple into a two-item tuple where the input item is a dictionary
return {'input_ids': input_ids, 'attention_mask': masks}, labels
# then we use the dataset map method to apply this transformation
dataset = dataset.map(map_func)
I'm not using PyTorch for sentiment analysis in this example, instead for masked language modeling, but the dataset build logic is very similar, this video at ~14:57:
ruclips.net/video/R6hcxMMOrPE/видео.html
Amazing tutorial 👏👏👏.
If you are going to use your model on another machine it's better to h5 format.
# Saving the model
model = model.save("your_model.h5")
# Loading the model in another machine
import tensorflow as tf
import transformers
model = tf.keras.models.load_model('your_model.h5',
custom_objects={'CustomMetric':transformers.TFBertMainLayer})
hey Fares, thanks and appreciate the info - I assume you recommend so due to us then only having a single file to transfer - rather than several?
@@jamesbriggs
I am working on a hate speech detection project, I trained the model on kaggle and after saving it, it worked in the same notebook but in my local machine it didn't. saving directly need to save the configuration also.
I didn't find how to so, so I save the model to h5 format.
Hi sir, I'm trying step by step at Google Colab but it running out of RAM. They give me 12.69GB; in the most cases that happens due code problems. Any idea? thank you!
Google Colab can be difficult with the amount of memory you're given, transformers use *a lot* - one thing that can help is loading your data in batches (so you're not storing it all in memory), one of my recent videos covers this, it might help: ruclips.net/video/r-zQQ16wTCA/видео.html
@@jamesbriggs Okay, I'll see it. Thank you!
Thank you for this great video. I tried following along with another dataset but each time I try to one-hot-encode my labels I keep getting an error that says " numpy.float64 object cannot be interpreted as an interger". Any idea how to fix this? Thank you.
same here did you find any solution ?
@@abAbhi105 Yes, I did. I casted my array elements to integer.
arr = arr.astype(int)
Labels[np.arange(num_samples), arr-1] = 1.
Hi there, please can you share the notebook?
Hey it's not necessarily exactly the same, but you will find very similar code here github.com/jamescalam/transformers/tree/main/course/project_build_tf_sentiment_model
Its an implementation or fine-tuning?
#model.layers[2].trainable = False
hey Amit, this sets the internal BERT layers to not train, but still allows us to train the classifier layers (which are layers 3, 4, etc), we can actually train the BERT layer too by removing that line, but training time will be much longer
I want something from scratch
Hi. Excellent tutorial! I have a problem. When I'm trying to replicate your code and in the part when I'm using tokenizer.encode_plus() and i get ValueError: could not broadcast input array from shape (15) into shape (512)Thanks. It says that the error is here - Xids[i, :] = tokens['input_ids']
Does it work if you write Xids[:, i] = tokens['input_ids']? Otherwise, double-check the Xids dimensionality with Xids.shape and make sure it lines up to what we would expect (eg num_samples and 512)
I had the same issue, and I solved it by putting pad_to_max_length = True instead of padding = 'max_length'.
Hi James, at the very end when you predicted your new sentiment data with your model you assigned it to.
probs = model.predict(test)
I would like to know how to export that data you predicted into CSV format so that one can submit it on Kaggle.
test['sentiment'] = model.predict(test['phrase'])
submission = test[['tweetid', 'sentiment']]
submission.to_csv('bertmodel.csv',index=False)
Is this the correct way of going about it :) because I want it in sentiment values when exported.
I think you might need to perform a np.argmax() operation on the model.predict output, to convert from output logits to predicted labels, but otherwise it looks good :)
HI James, great explanation and I followed to implement the same , but I got this error :
InvalidArgumentError: indices[2,2] = 29200 is not in [0, 28996)
[[node model/bert/embeddings/Gather (defined at /usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_tf_bert.py:188) ]] [Op:__inference_train_function_488497]
I know it's related to embedding token id. can u help me how can I resolve this ?
Luckily, I got the solution :)
@@madhavimourya1157 Oh good to hear, was it in your dataset definition?