General Flow dig : https://wikidocs.net/167833

Untitled

Backbone, neck, head

Backbone

The YOLO backbone is a convolutional neural network that pools image pixels to form features at different granularities. The Backbone is typically pretrained on a classification dataset, typically ImageNet.

Backbone is the deep learning architecture that basically acts as a feature extractor. All of the backbone models are basically classification models. I assume that everyone is familiar with at least VGG19. One of the earliest deep learning classifiers. There are three more models that we can use in backbone other than the models mentioned above namely SqueezeNet, MobileNet, ShuffleNet, but all of them are meant for CPU training only.

Neck

The YOLO neck (FPN is chosen above) combines and mixes the ConvNet layer representations before passing on to the prediction head.

Neck is a subset of the bag of specials, it basically collects feature maps from different stages of the backbone. In simple terms, it’s a feature aggregator. The neck of the object detection pipeline will be discussed in more detail in the later sections.

Head

This is the part of the network that makes the bounding box and class prediction. It is guided by the three YOLO loss functions for class, box, and objectness.

Head is also known as the object detector, it basically finds the region where the object might be present but doesn't tell about which object is present in that region. We have two-stage detectors and one stage-detectors which are further subdivided into anchor-based and anchor-free detectors.

Untitled

Coupled head has been traditionally preferred in SSD. Authors in YOLO-X argue in support of decouple head. It does shows improvements in the metric.

SimOTA : For Dynamic Label Assignment