This report is written for the Visual Media class of UTokyo and summarized with my personal opinion. (see my implementation repository also)

1. Why this paper is important?

This paper is important for fine-grained object recognition because the authors created two novel and effective modules which mimic a psychological process of object recognition.

- Summary of this paper:

Main task: Fine-grained object recognition.
Problem & Concept: Traditional methods only focus on learning salient patterns while ignoring the holistic structural composition. Since traditional methods only focus on partial information, they can be easily fooled as shown in Figure 1 (3rd column), especially in fine-grained object recognition (birds, airplanes, and cars in Figure 1). In this paper, the network learns holistic structural information to prevent the network from only focusing on partial information. By doing this, the proposed method (Look-into-Object, LIO) uses holistic structural information to predict the label correctly, as shown in Figure 1 (4th column).

Figure 1. The maximally responding feature maps from the Ground Truth and Predicted labels (ResNet-50 Baseline)

According to the paper,

From the psychological point of view, recognizing an object can be naturally regarded into two stages:
1. roughly localizing the object extent (the whole extent of the object rather than object part) in the image
2. parsing the structure among parts within the object.
Based on the psychological point of view, the authors proposed the Look-into-Object (LIO) which consists of two modules.
1. Object Extent Learning (OEL) for Roughly localizing the whole extent of the object rather than the object part.
2. Spatial Context Learning (SCL) for Parsing the structure among parts within the object.
Figure 2. Look-into-Object (LIO) approach
Contribution: the main contributions what I think are:
1. two novel modules: object-extent learning for object-extent localization, and self-supervised spatial context learning module for modeling object structural compositions.
2. For practical application, proposed methods do not need additional annotation and introduce no computational overhead at inference time. Moreover, the proposed modules can be plugged into any CNN based recognition models.
Another strong point: Authors won 1st place at Aliproduct competition in CVPR2020 by using LIO with DCL (their previous work), so LIO is also verified for the real-world datasets.

- What the technical core is?

In this section, I only introduce concept and the details is described in section 2 with codes.

Figure 3. The overall pipeline of our Look-into-object (LIO) framework.

Look-into-Object (LIO) consists of OEL and SCL (LIO = OEL + SCL).

Object Extent Learning (OEL) roughly localizes the whole extent of the object. To localize the whole extent of the object, the network with OEL learns the object localization by using a weakly pseudo mask that mimics the mask label of object localization. The weakly pseudo mask for the input image is created with images from the same category.
Spatial Context Learning (SCL) parses the structure among parts within the object. To parse the structure, the network with SCL learns to predict the relative polar coordinates. At first, 7x7 feature maps (last convolutional layer of ResNet50) are regarded as 7x7 Cartesian coordinates and then the network with SCL learns to convert this 7x7 Cartesian coordinates into 7x7 relative polar coordinates. The network with SCL predicts where the origin of polar coordinates is and then predicts the relative distance and the polar angle of 7x7 Cartesian coordinates.

- Why the paper is accepted?

Two novel modules based on the theory that easy to agree.
Two modules are novel and simple and work as intended.