3D Object Detection based on 3D LiDAR Point Clouds

Technical details of the implementation

1. Network architecture

The ResNet-based Keypoint Feature Pyramid Network (KFPN) that was proposed in RTM3D paper.
Input:
- The model takes a birds-eye-view (BEV) map as input.
- The BEV map is encoded by height, intensity, and density of 3D LiDAR point clouds. Assume that the size of the BEV input is (H, W, 3).
Outputs:
- Heatmap for main center with a size of (H/S, W/S, C) where S=4 (the down-sample ratio), and C=3 (the number of classes)
- Center offset: (H/S, W/S, 2)
- The heading angle (yaw): (H/S, W/S, 2). The model estimates the imaginary and the real fraction (sin(yaw) and cos(yaw) values).
- Dimension (h, w, l): (H/S, W/S, 3)
- z coordinate: (H/S, W/S, 1)
Targets: 7 degrees of freedom (7-DOF) of objects: (cx, cy, cz, l, w, h, θ)
- cx, cy, cz: The center coordinates.
- l, w, h: length, width, height of the bounding box.
- θ: The heading angle in radians of the bounding box.
Objects: Cars, Pedestrians, Cyclists.

For main center heatmap: Used focal loss
For heading angle (yaw): The im and re fractions are directly regressed by using l1_loss
For z coordinate and 3 dimensions (height, width, length), I used balanced l1 loss that was proposed by the paper Libra R-CNN: Towards Balanced Learning for Object Detection

A 3 × 3 max-pooling operation was applied on the center heat map, then only 50 predictions whose center confidences are larger than 0.2 were kept.
The heading angle (yaw) = arctan(imaginary fraction / real fraction)

Intersection over Union(IoU) method has been used here. Obtained a value of 0.87 with 200 epochs.

The model could be trained with more classes and with a larger detected area by modifying configurations in the config/kitti_dataset.py file. This could lead to better results.