Parallel Sensor Fusion Algorithm for Radar and Camera Data


=> https://www.notion.so/okesity/Test-bcbcac0ac6b1445db8e0a1d8f7e873b9


We are going to implement a radar & camera sensor fusion algorithm for object 2D depth reconstruction application on both multi-core CPU and GPU platforms and measure the throughput and latency the algorithm takes.


We want to explore the parallelism in the context of edge AI and the application we target at is the car 2D point cloud reconstruction out of the radar sensor data and camera sensor data. And the target is to achieve low latency and high throughput performance for processing scenes on the constrained resource (multi core cpu and gpu provided by the course)

For the radar data, we need to implement a FFT to get the 2D distance map, and for the image data, we will first do image segmentation to find object position, and also potentially in parallel do a monocular depth estimation to get the estimated depth hot map, and then implement a simple sensor fusion algorithm to get the final object depth point cloud.

The performance metrics we pick for the implementation is throughput and latency. We want to get the output at the highest frame rate we can achieve, and also want the lowest latency which represents the high sensitivity to the input data. And we will first implement a sequential version as the baseline, and then work on parallelizing the multi-core CPU and GPU which is provided by the course.

The parallelism happens 1) inside operations like FFT, 2) between operations such as image segmentation and image monocular depth estimation  3) between sensors.


The main challenge is to learn and implement the algorithm in C++ and parallelize the computation using gpu or multi core architecture. The workload we described in the “BACKGROUND” section has dependencies between operations and there is spatial locality for the FFT operations and the image processing algorithms. We think the communication to computation ratio is dependent on the cache size and the sensor data size which determines whether a frame of sensor data can be fitted in cache entirely or not. The CPU has a constrained number of cores. The GPU is very hard to parallelize computation between sensors because GPUs prefer applications with the instruction stream coherence.


We are going to implement the naive algorithm in the paper “Exploring mmWave Radar and Camera Fusion for High-Resolution and Long-Range Depth Imaging”. And we need to learn the algorithm and implement and parallelize it from scratch. We don’t have the entire dataset for this course and just use a few to validate the performance.