Same concept is used in YOLO v3, but instead of softmax activation for all classes, logistic regression is applied to each class (meaning there can be an object belonging to two classes)
So, at 1:49 we give the pc value to the 2nd anchor box because it had more IoU and not to the 1st. So to generalize, check if there's something worth in the grid; if there is, assign the associated pc value to anchor box with the highest IoU.
YOLO algorithm *CORRECTION* At time 5:00, for the slide titled "Outputting the non-max suppressed output" the text should read "For each grid cell" instead of "For each grid call".
@0:56 I think something is wrong. According to YOLO paper we have [S,S,(B * 5 + C)] which means each cell has C classes but here you said that each anchor box has C classes or [S,S,B*(5+C)].
[S,S,(B*5+C)] is YOLOv1. He is talking about YOLOv2. The two models are pretty different in input encoding and also in the definition of loss, if I understood correctly. fairyonice.github.io/Part_4_Object_Detection_with_Yolo_using_VOC_2012_data_loss.html
@@yumik4990 YOLOv1 predict 2 boxes for each grid cell. I saw that YOLOv1 doesn't predefine 2 anchor boxes but it predicts based on the grid cell. And YOLOv1 assign only 1 box to grid cell with object with the highest IoU in the training process. I don't know how we can do this in the training process. If we run 1-st epochs, box 1 can have higher IoU, but when we run 2-nd epoch, box 2 can have higher IoU.
Bounding boxes are parameters to be learned/trained basically a continuous/regression output hence the bounding boxes change in size but not anchor boxes they are fixed in size
This algorithm simplified the bounding box regression by having a 3x3 (or some other) grid output, right? What I didn't understand is how anchor boxes are used in this algorithm...
YOLO and faster R-CNN have something in common and that is anchor boxes which are used to simulate the famous commonly used image pyramids as we see in SVM classifiers conventional training. Regardless of why we use anchors, YOLO v2 uses 5 anchor boxes (instead of 9 , unlike Faster R-CNN) for each cell of the 3x3 grid here. Faster R-CNN uses 9 of them but not for each cell of the grid and slides them on conventional feature maps resulted from the intermediate layer of a CNN. As far as I understood, YOLO used 5 bounding boxes at each cell in the 13x13 output feature map and predicts 5 coordinates for each bounding box. Since they constrain the location prediction (by using grids that Faster R-CNN does not use) the parametrization is easier to learn and it makes the network more stable. Hope you got the point. ;)
Yeah I didn't get it too. It is just increasing the size of the output, instead of making one predictions per cell, now the algorithm makes two predictions per cell. I don't understand what is the purpose of predefined boxes here.
@@azharhussian4326 Did you get it in the end? Been through every post on the internet about anchor boxes and no one has come close to correctly explain how they are used in the process. Pretty frustrating.
I am not clear how will it's work at Inference time? How can I get model output BB into original image format? Kindly give me the mathematics how to compute it?
OK, so how many objects can one cell of YOLOv1 predict? The article says 'we only predict one set of class probabilities per grid cell regardless of the number of boxes'? It seems that the article skirts around the fact that the model can only predict at most 1 object/cell, but the wording above does not exclude, for example, the case when all B objects belong to the same class. So how many?
How is this grid cell segmentation actually encoded in the neural network? Is it encoded at all? If I understood correctly, the segmentation is only encoded into the training data, and the network is supposed to "learn" to output the y=3x3x16 that matches the locations of the objects relative to the grid cell on the training data. In other words, the network has no information about any image grid.
In the previous videos, its shown that the grid is actually the "cut down" version of the image after being through multiple convolution layers. That's the grid!
Let's say I have an object in 3 of the grid cells. Then, the outputs of all the 3 of the grid cells should be identical, with the same values of bx,by, bh,bw. Am I correct?
Thanks for the video, it brought me back to light:) I however still have a question: In the Yolo v1 paper it is described that the final convolutional output layer is a tensor of 7x7x1024 dimension (Darknet), then the detection follows, where grid cells dimension of 7x7 are defined. My assumption here is, since the dimension of the conv output the same as the grid cell's, can one say that one grid cell represents one pixel, hence the detection proceeds one 'pixel' at a time?
One grid cell represents a small 'crop' (e.g. 20x20 pixels) of the image, not necessarily a single pixel. Another thing to note: the algorithms processes all grid cells simultaneously in one shot. It doesn't process them sequentiall.
Is someone else tell me training time we are using anchor box terminology become boundingbox in prediction time is that right?Prediction time acnhorbox not using only boundingbox right?
Yes you are correct.Anchor box is only used to see the IOU matching with the ground truth bounding box. If the value of IOU between anchorbox and ground truth box of particular object is greater than 0.5 then we will consider the anchor and for that classlabels are [object confidence as 1(object with which IOU>0.5),bounding box coordinates of that object,classlabel as 1 for that object and zero for remaining].
They often do, bh and bw ground truths are defined taking into account the size of the orignal image. Meaning that one object on a cell can have bh, and bw that go behind the boundaries of the cell. That's not a problem because you do regression towards these values. the cell serves only to mark where the object should be detected. If the center point of the object is one cell then, the target vector for that cell is the only one that will have bh bw x, y for that object.
c1 c2 and c3 are the classification part of an algorithm. They basically mean 'if the bounding box intersects an object, what is its type'. During training, your data should be annotated, so each bbox should have position and class if applicable. When you train your nn, you check if a box is over some iou with an object, and if it is then you train c1 c2 c3 like any other classifier
what happens when the expected training output was close to bounding box 1, but the output of the network was 2 and coordinates of the box on expected output were incorrectly marked close to 1 whereas they should have been close to 2
I know I thought that too! Actually Andrew Ng is the founder of Deeplearning.ai so technically it isn't a reupload he must've just want to consolidate everything
i have read multiple blog posts on yolo, along with the original paper, but this video provides the intuition at a different level. amazing !
Same concept is used in YOLO v3, but instead of softmax activation for all classes, logistic regression is applied to each class (meaning there can be an object belonging to two classes)
So, at 1:49 we give the pc value to the 2nd anchor box because it had more IoU and not to the 1st. So to generalize, check if there's something worth in the grid; if there is, assign the associated pc value to anchor box with the highest IoU.
YOLO algorithm *CORRECTION*
At time 5:00, for the slide titled "Outputting the non-max suppressed output" the text should read "For each grid cell" instead of "For each grid call".
Thank you very much for all your YOLO videos. They are just great :)
@Kohen Dominick ...
@0:56 I think something is wrong. According to YOLO paper we have [S,S,(B * 5 + C)] which means each cell has C classes but here you said that each anchor box has C classes or [S,S,B*(5+C)].
[S,S,(B*5+C)] is YOLOv1. He is talking about YOLOv2. The two models are pretty different in input encoding and also in the definition of loss, if I understood correctly. fairyonice.github.io/Part_4_Object_Detection_with_Yolo_using_VOC_2012_data_loss.html
@@yumik4990 YOLOv1 predict 2 boxes for each grid cell. I saw that YOLOv1 doesn't predefine 2 anchor boxes but it predicts based on the grid cell. And YOLOv1 assign only 1 box to grid cell with object with the highest IoU in the training process. I don't know how we can do this in the training process. If we run 1-st epochs, box 1 can have higher IoU, but when we run 2-nd epoch, box 2 can have higher IoU.
the best AI teacher, thank you
@5:28 How come bounding boxes size are different ? How is the bounding box size changing ?
Have the same question. How exactly are the bounding boxes being predicted at every step?
Bounding boxes are parameters to be learned/trained basically a continuous/regression output hence the bounding boxes change in size but not anchor boxes they are fixed in size
This algorithm simplified the bounding box regression by having a 3x3 (or some other) grid output, right? What I didn't understand is how anchor boxes are used in this algorithm...
YOLO and faster R-CNN have something in common and that is anchor boxes which are used to simulate the famous commonly used image pyramids as we see in SVM classifiers conventional training. Regardless of why we use anchors, YOLO v2 uses 5 anchor boxes (instead of 9 , unlike Faster R-CNN) for each cell of the 3x3 grid here. Faster R-CNN uses 9 of them but not for each cell of the grid and slides them on conventional feature maps resulted from the intermediate layer of a CNN. As far as I understood, YOLO used 5 bounding boxes at each cell in the 13x13 output feature map and predicts 5 coordinates for each bounding box. Since they constrain the location prediction (by using grids that Faster R-CNN does not use) the parametrization is easier to learn and it makes the network more stable. Hope you got the point. ;)
@@smart_world7928 k
Yeah I didn't get it too. It is just increasing the size of the output, instead of making one predictions per cell, now the algorithm makes two predictions per cell. I don't understand what is the purpose of predefined boxes here.
@@azharhussian4326 Did you get it in the end? Been through every post on the internet about anchor boxes and no one has come close to correctly explain how they are used in the process. Pretty frustrating.
@@thelonespeaker Yeah still kind of stuck. But anchor boxes are usually used in the loss functions.
I am not clear how will it's work at Inference time? How can I get model output BB into original image format? Kindly give me the mathematics how to compute it?
at 6:01 how this lady's bounding box is made.......because there is separate CNN for each grid cell....can somebody explain ?
2 bounding boxes from 2 anchor boxes. Maybe this question comes because of your previous question in previous video.
Is this YOLO or YOLO 9000? According to the YOLO paper, I think the y should be 3x3x((2x5)+3), so y is 3x3x13. Is this right?
amazing educator
Thank you Andrew !
OK, so how many objects can one cell of YOLOv1 predict? The article says 'we only predict one set of class probabilities per grid cell regardless of the number of boxes'? It seems that the article skirts around the fact that the model can only predict at most 1 object/cell, but the wording above does not exclude, for example, the case when all B objects belong to the same class. So how many?
I think the answer is: One cell can predict, at maximum, one object for every anchor box.
great explanation
what are the values of don't care question marks? Is it up to the labeler or is there a convention?
Did you get the answer?
@@vaneEAE from my research it seems its just up to the labeler, i havent found any convention anywhere
@@jamieabw4517 I saw that these terms are not considered in the loss function. Therefore, it is of no interest to know what value these terms take.
If we divide the image into 3*3=9 small boxes, why do we still need bx, by, bh, bw these box coordinate variables?
Object may not lie in the center of that grid. Bounding boxing coordinates will specify the fitting bb for the object
Is Non-Max suppression used during training?
Clear and good ecough. Thank you.
Is this a graduate or undergraduate level course?
Is yolo is a deep learning algorithm???
How is this grid cell segmentation actually encoded in the neural network? Is it encoded at all?
If I understood correctly, the segmentation is only encoded into the training data, and the network is supposed to "learn" to output the y=3x3x16 that matches the locations of the objects relative to the grid cell on the training data. In other words, the network has no information about any image grid.
In the previous videos, its shown that the grid is actually the "cut down" version of the image after being through multiple convolution layers.
That's the grid!
Amazing tutorial!! thank you so muchh
Thank you!!
Let's say I have an object in 3 of the grid cells. Then, the outputs of all the 3 of the grid cells should be identical, with the same values of bx,by, bh,bw. Am I correct?
Not really. The object should be assigned to the cell that contains the object's center. The remaining cells should predict 'background'
Thanks for the video, it brought me back to light:)
I however still have a question: In the Yolo v1 paper it is described that the final convolutional output layer is a tensor of 7x7x1024 dimension (Darknet), then the detection follows, where grid cells dimension of 7x7 are defined. My assumption here is, since the dimension of the conv output the same as the grid cell's, can one say that one grid cell represents one pixel, hence the detection proceeds one 'pixel' at a time?
One grid cell represents a small 'crop' (e.g. 20x20 pixels) of the image, not necessarily a single pixel.
Another thing to note: the algorithms processes all grid cells simultaneously in one shot. It doesn't process them sequentiall.
@@MrAmgadHasan how the yolo can predict the final output for 3 different scale? Yolov3 have 3 scale with different feature map
I read some documents and I know yolo use HSV, can you explain for me why?
Is someone else tell me training time we are using anchor box terminology become boundingbox in prediction time is that right?Prediction time acnhorbox not using only boundingbox right?
Yes you are correct.Anchor box is only used to see the IOU matching with the ground truth bounding box. If the value of IOU between anchorbox and ground truth box of particular object is greater than 0.5 then we will consider the anchor and for that classlabels are [object confidence as 1(object with which IOU>0.5),bounding box coordinates of that object,classlabel as 1 for that object and zero for remaining].
what if an object spans more than one grid cell?
They often do, bh and bw ground truths are defined taking into account the size of the orignal image. Meaning that one object on a cell can have bh, and bw that go behind the boundaries of the cell. That's not a problem because you do regression towards these values. the cell serves only to mark where the object should be detected. If the center point of the object is one cell then, the target vector for that cell is the only one that will have bh bw x, y for that object.
thank you
How to define anchors boxes boundary
How to get value c1, c2, c3?
c1 c2 and c3 are the classification part of an algorithm. They basically mean 'if the bounding box intersects an object, what is its type'. During training, your data should be annotated, so each bbox should have position and class if applicable. When you train your nn, you check if a box is over some iou with an object, and if it is then you train c1 c2 c3 like any other classifier
How to get the programming exercise
what happens when the expected training output was close to bounding box 1, but the output of the network was 2 and coordinates of the box on expected output were incorrectly marked close to 1 whereas they should have been close to 2
Just wanted to let you know that this video has been ripped and re-uploaded:
ruclips.net/video/3Pv66biqc1E/видео.html
I know I thought that too! Actually Andrew Ng is the founder of Deeplearning.ai so technically it isn't a reupload he must've just want to consolidate everything
source code?