See full course on Object Detection: ruclips.net/p/PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S and Subscribe to my channel If you found this tutorial useful, please share with your friends(WhatsApp/iMessage/Messenger/WeChat/Line/KaTalk/Telegram) and on Social(LinkedIn/Quora/Reddit), Tag @cogneethi on twitter.com Let me know your feedback @ cogneethi.com/contact
Thankyou so much sir. I was looking for an explanation and unfortunately I could not get it even after watching alot of youtube videos and reading articles . But when I saw this video, My confusion is clear now. Thanks again.
This is excellent ! If you are willing to understand YOLO, YOU NEED TO WATCH THIS VIDEO !!! I had no idea how bouding boxes were predicted. Now it's clear, the only thing I have to figure out is how to spread my last layers into a classifier and a regressor. Thanks !!!!!
When training the neural network, should all the other bounding boxes be zeros? Say there are three classes: *people, boat, tv.* If the image contains only a boat, what is the ground truth for people and tv?
We will still have some bounding box estimate for the other classes, but will not be considered. Only the BBox corresponding to the highest class will be considered.
You're a good teacher. I know that, as you said on your website, video/audio recording is time consuming and is not for the faint of heart 😀, but I hope you'll continue to do these tutorials. One question: do you have examples of actually coding the neural networks you explain in your videos? I looked on your website and your github account but didn't find anything. It might be that I didn't do a very good job at searching. 😊
@Bianca, Regarding the code, I have not posted them on github/website. I will probably post them to github. I will comment here and let you know as soon as I do. (But first I have to find them, I dont know where I saved them :( ) Thank you for the encouragement. I will try to do more of these as and when I find time. :) And let me know if I have made any mistakes and how I can improve, since, like you, I am still learning and not an expert yet!
Meanwhile, these are the libraries that I used in this tutorial: HOG: scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html SVM: scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC VGG & Faster RCNN: github.com/endernewton/tf-faster-rcnn
Since the feature maps separate into two path, one for classification, the other one for grounding box regressor. How does one know the results of classification (object1, object2, object3) belong to which bounding box (box1, box2, box3) what I mean the result could be object1 -> box2, but in your explanation, object1 -> box1
Sir, what you said is theoretically very good, thank you. I train the model with faster rcnn. This model gives me good outputs. It draws the boxes, but I can't get the coordinates of the boxes with the code, how can I do this?
In 4:10, how do you mean modifying the last FC layer to obtain the BBox coordinates? Are these coordinates obtained from 4 Layers of the FC layer or from only the last FC layer? Please clarify.
There should be 2 loss functions used here right ? 1. For the classification layer 2. For the regression layer Does we sum the loss before backpropagating through the network ? What i understand is, we count the classification loss to "only" backpropagate through the classification layer AND we count the regression loss to "only" backpropagte through the regression layer (only for the 4 neuron correspond to the predicted class in classification layer). Both of the loss will be "added" to the neurons in the last FC layer, lets say FC7, through both of the layer backprop step (which branched into class and regg layer) Is this right ? Can you please clarify this for me ?
The output you get in the last FC layer depends on the loss function that you use. If you use softmax and use class labels to calculate loss, then eventually after many steps of training, it will learn to predict the correct class labels. Here you need just 1 output per class. This is a case of 'classification'. In the code, you might use something like this: pytorch.org/docs/stable/nn.html?highlight=loss#torch.nn.CrossEntropyLoss Instead, If you use L2 loss, and use the co-ordinates of the 'ground truth' bounding box as input to L2 loss function, after many training steps, if will learn to predict the co-ordinates of bounding box. Only, in this case, since bbox needs 4 points, your last FC layer outputs will have 4 outputs per class. This is a case of 'regression'. In the code, you might use something like this: pytorch.org/docs/stable/nn.html?highlight=loss#torch.nn.MSELoss In general, it all depends on what is the output you are expecting and what kind of loss function you are using. Based on this, you decide the number of outputs in your last FC layer. If, you see this pytorch.org/docs/stable/search.html?q=loss&check_keywords=yes&area=default, there are different types of loss functions suitable for different use cases. Let me know, if I need to eloborate further.
Lets say, your last fully connected layer before classification and bbox regression is called 'fc7'. Then, from this, to get the classification probabilities, you do: cls_score = fully_connected(fc7, num_classes,...) cls_prob = softmax(cls_score) # here number of outputs = num_classes And to get the bbox co-ordinates, you do: bbox = fully_connected(fc7, num_classes * 4,...) # here num of outputs = num_classes * 4. That is all there is to it. In fact, when I was studying, I too was too confused by it. But after seeing the code, it was clear.
Does the last FC layer branced into 2 different layers: 1. For classification, consist of Nclass+1(background) neurons and softmax function 2. For bounding box regression, consist of Nclass*4(each for the coordinate) Is this true ? (I belive this is what pavithran was asking) Edit : reading the code you provide in the first reply of this comment, i believe what i say is right. Thanks :)
I have a doubt you told us that we do backpropagation based on the L2 loss between the actual bounding box and the predicted bounding box coordinates. But we actually have bounding box coordinates only for the correct class which means all the coordinates for the incorrect classes are fixed to zero.
It is L2 loss used in this example: heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0 "will you put the correct coordinates only corresponding to the actual lable?" That is correct.
May be you can first identify text position and crop it off and use 'image in-painting' technique to fill the gaps. Not sure about the quality but should work.
@@vineethgogu2309 Unfortunately, as of now, i dont have the bandwidth for new videos. But I found a blog on inpainting which might help. heartbeat.fritz.ai/guide-to-image-inpainting-using-machine-learning-to-edit-and-correct-defects-in-photos-3c1b0e13bbd0 paperswithcode.com/task/image-inpainting Once you have the text coordinates from any ocr library, you can just set them to ones or zeros and try inpainting.
Hi, between (6:00 - 6:03) regarding the initial bbox that u mentioned as a hypothetical one..will it be a bbox of any one of the feature map among the stack in last FC layer? also having said the last FC layer basically has the stack of different features as a vector..will the stack have the entire boat as one of feature and based on ground truth co-ordinates and bbox regression we are using the L2 loss to narrow down on that one as the location. Like basically backtracking the 4 ground truth bbox co-ordinates in the feature vector space bbox coordinates for any input?
No, the enitre boat will not be a feature. It is difficult to guage exactly what is happening. You may check this: distill.pub/2020/circuits/zoom-in/ and this: ruclips.net/video/AgkfIQ4IGaM/видео.html The network basically learns patterns in the data. And based on the pattern it will approximately guage the location of the object. And we fine tune the detection part based on the Ground Truth.
@@Cogneethi Hi Thanks for getting back. I had a look at the links and also aware of the visualization toolbox. However the thing still not clear is about the initial hypothetical bbox co-ordinates that you mentioned. Let u say that when we start training even before the first back propagation step the network's FC layer will have a stack of feature map and as you say yes it won't be an entire boat or whatever object we are trying to detect. If this is the case then the o/p bbox co-ordinates that we try to detect are around an entire boat right. So how does the stack of feature map in FC layer with each representing just a part of feature extracted through the CNN operation coupled with pooling stride etc....return set of bbox coordinate that is supposed to represent a bbox for an entire boat. 2nd question: if the 1st bbox coordinate is hypothetical then does is not co relate with the features of the boat and it is through the ground truth and L2 loss we are forcing the network to spit out the final numbers or if the initial bbox co-ordinates co relate or is formed based on features of a boat then can you show a similar explanation as how u did for HOG+ SVM how we form the bbox co-ordinates from features in FC layer stack (the transformation) even though they are not accurate.
@@harishkumaranandan5946 Sorry, at this point of time, I dont have an easy answer to the 1st question. Regarding the 2nd one, I will have to dig deeper into visualization and show some examples as you suggested. I have received similar queries from other viewers. But to briefly ans ur 2nd q: Yes, initially the network just spits out random numbers. Later on as the training proceeds, once the network sees 1000s of boat images, we are using the ground truth values to force the network to learn the correct bbox co-ordinates from the feature maps extracted by the CNN. This way, the network learns to read the feature maps and guess the correct bbox values. I have given some sort of imperfect demo at the end of 8th chapter. That might help ur intuitions a bit. Meanwhile, I will keep this in mind when I try to expand the course, I will probably include more visualizations for better understanding.
Unfortunately I have not covered the coding part. May be on a later date I will add some kind of explanations for the code. But in the end, I have covered the code for Faster RCNN. ruclips.net/video/09DRku3USAs/видео.html drive.google.com/drive/folders/120KC9i3F0WMhqksngS-dWS1iJNP-mXAv?usp=sharing github.com/endernewton/tf-faster-rcnn
@Saurav, these are the libraries that I used in this tutorial: HOG: scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html SVM: scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC VGG & Faster RCNN: github.com/endernewton/tf-faster-rcnn
See full course on Object Detection: ruclips.net/p/PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S and Subscribe to my channel
If you found this tutorial useful, please share with your friends(WhatsApp/iMessage/Messenger/WeChat/Line/KaTalk/Telegram) and on Social(LinkedIn/Quora/Reddit),
Tag @cogneethi on twitter.com
Let me know your feedback @ cogneethi.com/contact
This is the simplest, most lucid explanation for the topic I've heard.
There are very few good explanations of this on the net, including in the online courses. You really made it easer to understand, thanks!
A tip : you can watch movies on Flixzone. Me and my gf have been using it for watching lots of of movies recently.
@Rodrigo Devon Yup, I have been watching on flixzone for since november myself :)
@Rodrigo Devon Definitely, I have been using Flixzone for months myself :D
@Rodrigo Devon Definitely, I have been using Flixzone for years myself :D
hey can you help with coding parts i am a beginner
Thankyou so much sir. I was looking for an explanation and unfortunately I could not get it even after watching alot of youtube videos and reading articles . But when I saw this video, My confusion is clear now. Thanks again.
That was an amazing explanation and insight about the bounding box regressor. Rare video about this topic. I appreciate your efforts.
Thanks Ganesh!
@@Cogneethi By the way, are you still working on to put YOLO framework on board?
@@ganeshchalamalasetti2884 Not yet. Caught up with some projects. So not getting the time. So may be later.
@@Cogneethi Make sense. All the best 👍
Very good videos. Small and crispy. Easy to understand. Thank you very much.
which can model is good for detecting bounding boxes of customer demographic on national ID cards?
This is excellent ! If you are willing to understand YOLO, YOU NEED TO WATCH THIS VIDEO !!! I had no idea how bouding boxes were predicted. Now it's clear, the only thing I have to figure out is how to spread my last layers into a classifier and a regressor.
Thanks !!!!!
Welcome Valentin!
@@Cogneethi the theory explanation is good but can you tell me codes also for this
most simplified video on youtube keep it up bro hats off for your explanation.i would like to learn coding part for it.
Amazing video!!! You deserve more subs!
When training the neural network, should all the other bounding boxes be zeros? Say there are three classes: *people, boat, tv.* If the image contains only a boat, what is the ground truth for people and tv?
We will still have some bounding box estimate for the other classes, but will not be considered. Only the BBox corresponding to the highest class will be considered.
do we need 4 output layer for Bouning box regressor ?
You're a good teacher. I know that, as you said on your website, video/audio recording is time consuming and is not for the faint of heart 😀, but I hope you'll continue to do these tutorials.
One question: do you have examples of actually coding the neural networks you explain in your videos? I looked on your website and your github account but didn't find anything. It might be that I didn't do a very good job at searching. 😊
@Bianca,
Regarding the code, I have not posted them on github/website. I will probably post them to github. I will comment here and let you know as soon as I do. (But first I have to find them, I dont know where I saved them :( )
Thank you for the encouragement. I will try to do more of these as and when I find time. :)
And let me know if I have made any mistakes and how I can improve, since, like you, I am still learning and not an expert yet!
Meanwhile, these are the libraries that I used in this tutorial:
HOG: scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html
SVM: scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
VGG & Faster RCNN: github.com/endernewton/tf-faster-rcnn
Code and PPT are here: drive.google.com/drive/folders/120KC9i3F0WMhqksngS-dWS1iJNP-mXAv?usp=sharing
Since the feature maps separate into two path, one for classification, the other one for grounding box regressor. How does one know the results of classification (object1, object2, object3) belong to which bounding box (box1, box2, box3)
what I mean the result could be object1 -> box2, but in your explanation, object1 -> box1
My question is how to decide which object belong to which bounding box?
Thank you so much for explaining this!!
Very clear explanation. Keep creating.
Sir, what you said is theoretically very good, thank you. I train the model with faster rcnn. This model gives me good outputs. It draws the boxes, but I can't get the coordinates of the boxes with the code, how can I do this?
Amazing explanation!
In 4:10, how do you mean modifying the last FC layer to obtain the BBox coordinates? Are these coordinates obtained from 4 Layers of the FC layer or from only the last FC layer? Please clarify.
There should be 2 loss functions used here right ?
1. For the classification layer
2. For the regression layer
Does we sum the loss before backpropagating through the network ? What i understand is, we count the classification loss to "only" backpropagate through the classification layer AND we count the regression loss to "only" backpropagte through the regression layer (only for the 4 neuron correspond to the predicted class in classification layer). Both of the loss will be "added" to the neurons in the last FC layer, lets say FC7, through both of the layer backprop step (which branched into class and regg layer)
Is this right ? Can you please clarify this for me ?
How to change the last fully connected layer to give the co-ordinates of bounding boxes?
The output you get in the last FC layer depends on the loss function that you use. If you use softmax and use class labels to calculate loss, then eventually after many steps of training, it will learn to predict the correct class labels. Here you need just 1 output per class. This is a case of 'classification'.
In the code, you might use something like this: pytorch.org/docs/stable/nn.html?highlight=loss#torch.nn.CrossEntropyLoss
Instead, If you use L2 loss, and use the co-ordinates of the 'ground truth' bounding box as input to L2 loss function, after many training steps, if will learn to predict the co-ordinates of bounding box. Only, in this case, since bbox needs 4 points, your last FC layer outputs will have 4 outputs per class. This is a case of 'regression'.
In the code, you might use something like this: pytorch.org/docs/stable/nn.html?highlight=loss#torch.nn.MSELoss
In general, it all depends on what is the output you are expecting and what kind of loss function you are using. Based on this, you decide the number of outputs in your last FC layer.
If, you see this pytorch.org/docs/stable/search.html?q=loss&check_keywords=yes&area=default, there are different types of loss functions suitable for different use cases.
Let me know, if I need to eloborate further.
Lets say, your last fully connected layer before classification and bbox regression is called 'fc7'.
Then, from this, to get the classification probabilities, you do:
cls_score = fully_connected(fc7, num_classes,...)
cls_prob = softmax(cls_score)
# here number of outputs = num_classes
And to get the bbox co-ordinates, you do:
bbox = fully_connected(fc7, num_classes * 4,...)
# here num of outputs = num_classes * 4.
That is all there is to it. In fact, when I was studying, I too was too confused by it. But after seeing the code, it was clear.
The coding part, I have covered a bit more in the last chapter 'Faster-RCNN'.
ruclips.net/video/09DRku3USAs/видео.html
@@Cogneethi if it possible, may i see the code for implementation this lecture ?
Does the last FC layer branced into 2 different layers:
1. For classification, consist of Nclass+1(background) neurons and softmax function
2. For bounding box regression, consist of Nclass*4(each for the coordinate)
Is this true ? (I belive this is what pavithran was asking)
Edit : reading the code you provide in the first reply of this comment, i believe what i say is right. Thanks :)
I have a doubt you told us that we do backpropagation based on the L2 loss between the actual bounding box and the predicted bounding box coordinates. But we actually have bounding box coordinates only for the correct class which means all the coordinates for the incorrect classes are fixed to zero.
Can you please explain the loss function, will you put the correct coordinates only corresponding to the actual lable?
It is L2 loss used in this example: heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
"will you put the correct coordinates only corresponding to the actual lable?"
That is correct.
Thank you!!
thank you for good explaination. can we find approximation of bbox for semantic segment masks?
Many thanks about this video.
Hello sir
Can we completely remove/wipe out text from an image ??? Using python libraries like easyocr ,pytessarat
May be you can first identify text position and crop it off and use 'image in-painting' technique to fill the gaps.
Not sure about the quality but should work.
@@Cogneethi hey please
Why can't you make a video on it ?? And explain how it actually works so that it would be very beneficial to me
And more over I identified the text position on a image using EASYOCR
Have a try with easyocr ???
@@vineethgogu2309 Unfortunately, as of now, i dont have the bandwidth for new videos.
But I found a blog on inpainting which might help.
heartbeat.fritz.ai/guide-to-image-inpainting-using-machine-learning-to-edit-and-correct-defects-in-photos-3c1b0e13bbd0
paperswithcode.com/task/image-inpainting
Once you have the text coordinates from any ocr library, you can just set them to ones or zeros and try inpainting.
@@Cogneethi thank you so much sir for providing blog links ....
👍👍👍👍👍👍👍
wonderful video
Thanks for the explanation,it is very clear and easy to understand
Hi, between (6:00 - 6:03) regarding the initial bbox that u mentioned as a hypothetical one..will it be a bbox of any one of the feature map among the stack in last FC layer? also having said the last FC layer basically has the stack of different features as a vector..will the stack have the entire boat as one of feature and based on ground truth co-ordinates and bbox regression we are using the L2 loss to narrow down on that one as the location. Like basically backtracking the 4 ground truth bbox co-ordinates in the feature vector space bbox coordinates for any input?
No, the enitre boat will not be a feature.
It is difficult to guage exactly what is happening.
You may check this: distill.pub/2020/circuits/zoom-in/
and this:
ruclips.net/video/AgkfIQ4IGaM/видео.html
The network basically learns patterns in the data. And based on the pattern it will approximately guage the location of the object.
And we fine tune the detection part based on the Ground Truth.
@@Cogneethi Hi Thanks for getting back. I had a look at the links and also aware of the visualization toolbox. However the thing still not clear is about the initial hypothetical bbox co-ordinates that you mentioned. Let u say that when we start training even before the first back propagation step the network's FC layer will have a stack of feature map and as you say yes it won't be an entire boat or whatever object we are trying to detect. If this is the case then the o/p bbox co-ordinates that we try to detect are around an entire boat right. So how does the stack of feature map in FC layer with each representing just a part of feature extracted through the CNN operation coupled with pooling stride etc....return set of bbox coordinate that is supposed to represent a bbox for an entire boat.
2nd question: if the 1st bbox coordinate is hypothetical then does is not co relate with the features of the boat and it is through the ground truth and L2 loss we are forcing the network to spit out the final numbers or if the initial bbox co-ordinates co relate or is formed based on features of a boat then can you show a similar explanation as how u did for HOG+ SVM how we form the bbox co-ordinates from features in FC layer stack (the transformation) even though they are not accurate.
@@harishkumaranandan5946
Sorry, at this point of time, I dont have an easy answer to the 1st question.
Regarding the 2nd one, I will have to dig deeper into visualization and show some examples as you suggested.
I have received similar queries from other viewers.
But to briefly ans ur 2nd q:
Yes, initially the network just spits out random numbers.
Later on as the training proceeds, once the network sees 1000s of boat images, we are using the ground truth values to force the network to learn the correct bbox co-ordinates from the feature maps extracted by the CNN.
This way, the network learns to read the feature maps and guess the correct bbox values.
I have given some sort of imperfect demo at the end of 8th chapter. That might help ur intuitions a bit.
Meanwhile, I will keep this in mind when I try to expand the course, I will probably include more visualizations for better understanding.
@@Cogneethi Hi cogneethi, thanks for getting back. I appreciate it. I will keeping touch.
Great 👍🙏
best explanation , thank you sir
Thanks for crystal clear explanation.
How did you find the coordinates (200,250) (600,400) I didn't understand please explain
amazing tutorial!
sir how to find the distance from camera to bounding box
hello sir, thanks for explanations. but if it possible can you explain that theory into the real source code ?
Unfortunately I have not covered the coding part. May be on a later date I will add some kind of explanations for the code.
But in the end, I have covered the code for Faster RCNN.
ruclips.net/video/09DRku3USAs/видео.html
drive.google.com/drive/folders/120KC9i3F0WMhqksngS-dWS1iJNP-mXAv?usp=sharing
github.com/endernewton/tf-faster-rcnn
Please let me know which libraries you have used for coding.
@Saurav, these are the libraries that I used in this tutorial:
HOG: scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html
SVM: scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
VGG & Faster RCNN: github.com/endernewton/tf-faster-rcnn
Thanks for the awesome tutorial. Just one doubt: so this video assumes that a given image has only 1 instance of the object ?
Yes, that is the assumption in case of 'Localization'. For 'Detection' there will be multiple objects in an image.
It would be extremely fruitful if you explained the code along with this theory.
Yes, I think that is one mistake that I made which i realised later. "If" I make some videos in future, I will definitely include code. :)
Sir could you provide me some very good reference videos to understand how yolov3 work?
the theory explanation is good but can you tell me codes also for this
Good explanation 👍
Thank you Khabar
what if camera sensor outputs in bounding boxes with 3rd order polynomial.
how to decode it
Sorry, I dont know about this.
can anyone link to a keras implementation of the object detector model?
you did not cover yolo model?
Not yet, will do so in few months time.
Thank you
explain YOLO .....it would be a great help.
Yes, in a few months.
very good
how do you get expected values to compare
It is manually annotated for each image by some person. See this ruclips.net/video/e4G9H18VYmA/видео.html