General Flow dig : https://wikidocs.net/167833
The YOLO backbone is a convolutional neural network that pools image pixels to form features at different granularities. The Backbone is typically pretrained on a classification dataset, typically ImageNet.
Backbone is the deep learning architecture that basically acts as a feature extractor. All of the backbone models are basically classification models. I assume that everyone is familiar with at least VGG19. One of the earliest deep learning classifiers. There are three more models that we can use in backbone other than the models mentioned above namely SqueezeNet, MobileNet, ShuffleNet, but all of them are meant for CPU training only.
The YOLO neck (FPN is chosen above) combines and mixes the ConvNet layer representations before passing on to the prediction head.
Neck is a subset of the bag of specials, it basically collects feature maps from different stages of the backbone. In simple terms, it’s a feature aggregator. The neck of the object detection pipeline will be discussed in more detail in the later sections.
This is the part of the network that makes the bounding box and class prediction. It is guided by the three YOLO loss functions for class, box, and objectness.
Head is also known as the object detector, it basically finds the region where the object might be present but doesn't tell about which object is present in that region. We have two-stage detectors and one stage-detectors which are further subdivided into anchor-based and anchor-free detectors.
Coupled head has been traditionally preferred in SSD. Authors in YOLO-X argue in support of decouple head. It does shows improvements in the metric.
SimOTA : For Dynamic Label Assignment