Thanks for the video. I have a couple of questions you have set the epochs=500 at 12:00 but you get the best checkpoint at 963. epoch at 15:33. How does that happen? You trained more after 500. epoch if so which command you have used? Could you please help me?
Yes I first trained it for 500 but then I retrained it for 1000 epochs. I retrained it from scratch but you can resume the training from your last checkpoint by using --resume. Read more here forums.developer.nvidia.com/t/how-to-resume-pytorch-trainning-in-jetson-nano/84197
@@RocketSystems Hi again. I have a few more questions to ask. We have trained our model for around 1200 epoch. Our model has 5 classes and for each class we have taken 48 images. Then, we used Roboflow to annotate those images. After than that using roboflow's augmentation feature we produced new images from these 5*48=240 images. Overall our dataset has around 600 images for 5 classes. We have trained our ssd model by using jetson-train github repo. We used those 600 images to train the model. However at 1200 epoch our loss was around 1.64. And, we thought that it will not decrease anymore. Now my question is that Would it have decreased if we had trained the model more? If so, How much loss can be considered as great for a model? Some sources says 0.01 loss is ok. Another question of mine is that How many images should be in the dataset for each class to get a well-trained model? How much epoch do you suggest us to train our model? I mean should we train our model up until the loss decreases around 0.1 or 0.01 assuming that it will reach that threshold at some point? The last question of mine is that we have tested our model by showing the objects that it is trained with to the camera and confidence was around %90-100. However, when we show another object that the model didn't see before the confidence was decreased around %65. Is there a possibility of overfitting to be occured in our model? If so how can we detect it and how can we prevent our model to be overfitted? I would be grateful if you answer to my questions. Thanks a lot.
hi i m getting error when i m doing my image training using my pc which has a gpu and the error is "Traceback (most recent call last): File "train_ssd.py", line 13, in from torch.utils.tensorboard import SummaryWriter File "/home/kml/rohin/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 1, in import tensorboard File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/__init__.py", line 4, in from .writer import FileWriter, SummaryWriter File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/writer.py", line 28, in from .summary import scalar, histogram, image, audio, text File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/summary/__init__.py", line 22, in from tensorboard.summary import v1 # noqa: F401 File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/summary/v1.py", line 21, in from tensorboard.plugins.audio import summary as _audio_summary File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/plugins/audio/summary.py", line 34, in from tensorboard.plugins.audio import metadata File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/plugins/audio/metadata.py", line 18, in from tensorboard.compat.proto import summary_pb2 File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in from tensorboard.compat.proto import histogram_pb2 as tensorboard_dot_compat_dot_proto_dot_histogram__pb2 File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/compat/proto/histogram_pb2.py", line 18, in DESCRIPTOR = _descriptor.FileDescriptor( File "/home/kml/rohin/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 1024, in __new__ return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "tensorboard/compat/proto/histogram.proto": tensorboard.HistogramProto.min: "tensorboard.HistogramProto.min" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto.max: "tensorboard.HistogramProto.max" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto.num: "tensorboard.HistogramProto.num" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto.sum: "tensorboard.HistogramProto.sum" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto.sum_squares: "tensorboard.HistogramProto.sum_squares" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto.bucket_limit: "tensorboard.HistogramProto.bucket_limit" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto.bucket: "tensorboard.HistogramProto.bucket" is already defined in file "tensorboard/src/summary.proto". tensorboard.HistogramProto: "tensorboard.HistogramProto" is already defined in file "tensorboard/src/summary.proto".
I got my less loss model, i just need to check with other input image for checking the model using in colab. i got confussed from .pth file onwards. i didn't install/ download jeton files
You can test your pth file with images. But what is shown in the video is how to train model for Jetson hardware. So you need jetson board to convert and build pth file into tensort
Hello after training my custom datasheet and exporting it to onnx it give me an error ssting OSERROR: couldnt fund valid .pth checkpoint under 'models/TuodMango'
If you already have your images, you do not need to use the script mentioned. Just make sure you follow the exact directory structure as mentioned in this video
I didn't get you. Do you want to train on Raspberry pi? Raspberry pi should not be used for training. If you just want to use model, then yes you can use it but it will be very slow
Thanks you so much. I was looking for this a lot of time.
great video
great video!Thanks.
hi i just wanted to ask are we using ssd mobile net v2 or just ssd mobilenet
Thankyou
Great tutorial, thanks. But how can we print the coordinates (x, y and the centre point) of each object.
I'm getting an error of no module named torch.fx
whereas it is installed.
also how to train using GPU instead of CPU which is happening by default.
thanks for sharing. Is there any pre installation required for some libraries. There were lots of input error during training script.
Thank you for the tutorial. How can i extract the mAP0.5 and training loss together?
Thanks for the video. I have a couple of questions you have set the epochs=500 at 12:00 but you get the best checkpoint at 963. epoch at 15:33. How does that happen? You trained more after 500. epoch if so which command you have used? Could you please help me?
Yes I first trained it for 500 but then I retrained it for 1000 epochs. I retrained it from scratch but you can resume the training from your last checkpoint by using --resume. Read more here forums.developer.nvidia.com/t/how-to-resume-pytorch-trainning-in-jetson-nano/84197
@@RocketSystems Hi again. I have a few more questions to ask. We have trained our model for around 1200 epoch. Our model has 5 classes and for each class we have taken 48 images. Then, we used Roboflow to annotate those images. After than that using roboflow's augmentation feature we produced new images from these 5*48=240 images. Overall our dataset has around 600 images for 5 classes. We have trained our ssd model by using jetson-train github repo. We used those 600 images to train the model. However at 1200 epoch our loss was around 1.64. And, we thought that it will not decrease anymore. Now my question is that Would it have decreased if we had trained the model more? If so, How much loss can be considered as great for a model? Some sources says 0.01 loss is ok. Another question of mine is that How many images should be in the dataset for each class to get a well-trained model? How much epoch do you suggest us to train our model? I mean should we train our model up until the loss decreases around 0.1 or 0.01 assuming that it will reach that threshold at some point? The last question of mine is that we have tested our model by showing the objects that it is trained with to the camera and confidence was around %90-100. However, when we show another object that the model didn't see before the confidence was decreased around %65. Is there a possibility of overfitting to be occured in our model? If so how can we detect it and how can we prevent our model to be overfitted?
I would be grateful if you answer to my questions.
Thanks a lot.
hi i m getting error when i m doing my image training using my pc which has a gpu and the error is "Traceback (most recent call last):
File "train_ssd.py", line 13, in
from torch.utils.tensorboard import SummaryWriter
File "/home/kml/rohin/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 1, in
import tensorboard
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/__init__.py", line 4, in
from .writer import FileWriter, SummaryWriter
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/writer.py", line 28, in
from .summary import scalar, histogram, image, audio, text
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/summary/__init__.py", line 22, in
from tensorboard.summary import v1 # noqa: F401
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/summary/v1.py", line 21, in
from tensorboard.plugins.audio import summary as _audio_summary
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/plugins/audio/summary.py", line 34, in
from tensorboard.plugins.audio import metadata
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/plugins/audio/metadata.py", line 18, in
from tensorboard.compat.proto import summary_pb2
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in
from tensorboard.compat.proto import histogram_pb2 as tensorboard_dot_compat_dot_proto_dot_histogram__pb2
File "/home/kml/rohin/lib/python3.8/site-packages/tensorboard/compat/proto/histogram_pb2.py", line 18, in
DESCRIPTOR = _descriptor.FileDescriptor(
File "/home/kml/rohin/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 1024, in __new__
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "tensorboard/compat/proto/histogram.proto":
tensorboard.HistogramProto.min: "tensorboard.HistogramProto.min" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto.max: "tensorboard.HistogramProto.max" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto.num: "tensorboard.HistogramProto.num" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto.sum: "tensorboard.HistogramProto.sum" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto.sum_squares: "tensorboard.HistogramProto.sum_squares" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto.bucket_limit: "tensorboard.HistogramProto.bucket_limit" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto.bucket: "tensorboard.HistogramProto.bucket" is already defined in file "tensorboard/src/summary.proto".
tensorboard.HistogramProto: "tensorboard.HistogramProto" is already defined in file "tensorboard/src/summary.proto".
I got my less loss model, i just need to check with other input image for checking the model using in colab. i got confussed from .pth file onwards. i didn't install/ download jeton files
You can test your pth file with images. But what is shown in the video is how to train model for Jetson hardware. So you need jetson board to convert and build pth file into tensort
Hello after training my custom datasheet and exporting it to onnx it give me an error ssting OSERROR: couldnt fund valid .pth checkpoint under 'models/TuodMango'
You need to transfer your best checkpoint on Jetson device and then export it to onnx.
do I need to use ubuntu OS for this or is Windows OS okay?
i tried adapt scrips to WIndows format by GPT, but it can't work properly, so I used jetson nano to train it
hi what if you already have jpeg files and dont need to extract from a video? I want to create those directories but i dont know the script.
If you already have your images, you do not need to use the script mentioned. Just make sure you follow the exact directory structure as mentioned in this video
i have a question. How to fix to choose the model"mb2-ssd-lite"? and Can you give me the "mb2-ssd-lite-mp-0_686.pth"? I have a bug tu train. Please!!
This is already provided in the repo.
can convert the trained model to frozen_inference_graph.pb?
Hye
I need to make mobile application for food nutrition detection using ai for diabetic patients .. Can you help me to achieve this!?
Hi, I do not have much experience with mobile application development.
how to detect from camera
what do you need to do exactly?
labelimg unable to install?
labelIMG only support lower python version like 3.9
can I use this on raspberry pi 4
I didn't get you. Do you want to train on Raspberry pi? Raspberry pi should not be used for training. If you just want to use model, then yes you can use it but it will be very slow
@@RocketSystems no instead of Jetson nano, I wan to deploy it on raspberry pi 4, and I will train it on colab, so can you help me out with that
@@lemonbitter7641 you did it or not yet ?
hello can you please provide me a link for pre trained data sets?
Do you need the dataset which I created for the fruits or is there anything you are asking for?
@@RocketSystems for the fruits only
@@RocketSystems also do you have knowledge about YOLOV3? I have som errors in that too.
I do not have the dataset but you can download the model files from the repository mentioned in description
@@RocketSystems what about yolo? You got knowledge in that? I got an annoying error which is not leaving me
I have this error, you can helpme, please?
2023-06-28 14:00:55 - Epoch: 0, Step: 10/60, Avg Loss: 77.0541, Avg Regression Loss 66.9335, Avg Classification Loss: 10.1206
2023-06-28 14:01:00 - Epoch: 0, Step: 20/60, Avg Loss: 72.4313, Avg Regression Loss 64.5434, Avg Classification Loss: 7.8879
2023-06-28 14:01:07 - Epoch: 0, Step: 30/60, Avg Loss: 74.7886, Avg Regression Loss 67.5160, Avg Classification Loss: 7.2726
2023-06-28 14:01:12 - Epoch: 0, Step: 40/60, Avg Loss: 65.2462, Avg Regression Loss 60.2236, Avg Classification Loss: 5.0226
2023-06-28 14:01:18 - Epoch: 0, Step: 50/60, Avg Loss: 80.1739, Avg Regression Loss 75.2484, Avg Classification Loss: 4.9255
2023-06-28 14:01:23 - Epoch: 0, Training Loss: 78.1188, Training Regression Loss 71.2975, Training Classification Loss: 6.8214
Traceback (most recent call last):
File "/content/jetson-train-main/train_ssd.py", line 400, in
val_loss, val_regression_loss, val_classification_loss = test(val_loader, net, criterion, DEVICE)
File "/content/jetson-train-main/train_ssd.py", line 200, in test
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/jetson-train-main/vision/nn/multibox_loss.py", line 41, in forward
classification_loss = F.cross_entropy(confidence.reshape(-1, num_classes), labels[mask], size_average=False)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
IndexError: Target 4 is out of bounds.