Object detection is one of the fundamental tasks in computer vision that locates and classifies objects in an image. Most classical detectors are based on convolutional architectures and require many hand-designed components including region proposal mechanisms, anchor generation, and non-maximum suppression to improve the detection performance. Detection Transformer (DETR) is a Transformer-based object detection method that defines object detection as a set prediction problem. It eliminates the need for hand-designed components and achieves comparable performance with optimized classical CNN-based detectors. However, DETR suffers from significantly slow training time and has low performance in detecting small objects.
The project focuses on improving the performance of Detection Transformer (DETR) by addressing detection accuracy and training efficiency. The DETR model will be trained on the PASCAL VOC dataset. The baseline model source code was provided, with fixed hyperparameters for training:
Learning rate (lr): 5e-5
Weight decay: 1e-4
Number of epochs for training: 10
Batch size: 4
The three figures below show a COCO example using the DETR model with detection thresholds of 0.9, 0.7, and 0.0, respectively. From these results, one can see that a low threshold leads to higher recall, meaning the model produces more detections, but at the cost of lower precision and more false positives. In contrast, a high threshold requires the model to be very confident before reporting a detection, resulting in fewer predicted boxes and more false negatives.