This paper is important for fine-grained object recognition because the authors created two novel and effective modules which mimic a psychological process of object recognition.
Main task: Fine-grained object recognition.
Problem & Concept: Traditional methods only focus on learning salient patterns while ignoring the holistic structural composition. Since traditional methods only focus on partial information, they can be easily fooled as shown in Figure 1 (3rd column), especially in fine-grained object recognition (birds, airplanes, and cars in Figure 1). In this paper, the network learns holistic structural information to prevent the network from only focusing on partial information. By doing this, the proposed method (Look-into-Object, LIO) uses holistic structural information to predict the label correctly, as shown in Figure 1 (4th column).
Figure 1. The maximally responding feature maps from the Ground Truth and Predicted labels (ResNet-50 Baseline)
According to the paper,
From the psychological point of view, recognizing an object can be naturally regarded into two stages:
Based on the psychological point of view, the authors proposed the Look-into-Object (LIO) which consists of two modules.
Object Extent Learning (OEL) for Roughly localizing the whole extent of the object rather than the object part.
Spatial Context Learning (SCL) for Parsing the structure among parts within the object.
Figure 2. Look-into-Object (LIO) approach
Contribution: the main contributions what I think are:
Another strong point: Authors won 1st place at Aliproduct competition in CVPR2020 by using LIO with DCL (their previous work), so LIO is also verified for the real-world datasets.
In this section, I only introduce concept and the details is described in section 2 with codes.
Figure 3. The overall pipeline of our Look-into-object (LIO) framework.
Look-into-Object (LIO) consists of OEL and SCL (LIO = OEL + SCL).