个人理解

创新点：打破了当前主流的范式“tracking-by-detection”，实现MOT不再需要双阶段了：①目标检测；②特征提取与数据关联。提出的CenterTrack对于MOT直接一步到位，合并检测阶段和匹配阶段到同一个网络，加速MOT系统的整体检测速度，实验证明，CenterTrack的确是一种速度和精度trade-off的模型。
**为什么：**当前主流的多目标跟踪的范式是tracking-by-detection，即该范式将MOT分为了两步：①目标检测；②特征提取与数据关联，而这类范式虽然表现出不俗的跟踪精度，但是其最大的缺点就是速度慢，因为其将物体检测和（外观）特征提取分开了，则模型的速度自然就慢了，因此，作者考虑到这点缺陷，想要设计出一种MOT算法是不用双阶段的，即合并检测和追踪过程到同一个网络，故作者提出了CenterTrack
**怎么做：**作者在CenterNet目标检测网络的基础上进行改进，设计出了CenterTrack，该模型首先输入三张图，分别是①当前帧，②前一帧，③前一帧的heatmap。然后将三张图分别进行初步的特征提取，得到三张特征图，再将这三张特征图按位相加，得到融合的特征图，然后将融合后的特征图输入到骨干网络，在骨干网络之后则有四个分支，分别是①heatmap：检测框中心点位置分布热力图；②Confidence：预测前景中心的置信度图；③Weight&Height：预测中心点对应的检测框的宽高；④Displacement prediction：检测框中心点在前后帧的位移（有点类似于光流）

一、摘要

Tracking has traditionally been the art of following interest points through space and time. This changed with the rise of powerful deep networks. Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection. We present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That’s it. CenterTrack is simple, online (no peeking into the future), and real-time. It achieves 67*.8% MOTA on the MOT17 challenge at 22 FPS and 89.4% MOTA on the KITTI tracking benchmark at 15 FPS, setting a new state of the art on both datasets. CenterTrack is easily extended to monocular 3D tracking by regressing additional 3D attributes. Using monocular video input, it achieves 28.*3% [email protected] on the newly released nuScenes 3D tracking benchmark, substantially outperforming the monocular baseline on this benchmark while running at 28 FPS.

传统上，目标跟踪是通过空间和时间跟踪感兴趣点的技术，
如今，而随着深度网络的发展，目标跟踪的主要由**目标检测→时序关联(数据关联)**的范式主导，即 Tracking-by-Detection。
作者提出了CenterTrack（同时检测与跟踪的方法，即合并检测阶段和匹配阶段），该方法对一对图像应用检测模型，并利用前一帧的检测结果。给定最小输入，CenterTrack 可以定位目标，并预测它们和前一帧的关联。
- **输入：**两张RGB图像（当前帧和前一帧,(H,W,3)）+一张Heatmap图像（前一帧的检测结果,(H,W,1)）
- **输出：**Heatmap（H/4,W/4,80），Confidence（H/4,W/4,1），Hight&Width（H/4,W/4,2），Displacement prediction（H/4,W/4,2）