Convolutional neural networks have come a long way in conveniently identifying objects in images and videos. There are many networks like VGG19, ResNet, YOLO, SSD, RCNN, DensepathNet, DualNet, Xception, Inception, PolyNet, MobileNet and many more have evolved over time. They have been used in such practical purposes as detection of availability of space in the parking lot, satellite image analysis to track ships and agricultural output, radiology, people count, detecting words in vehicle license plates and storefronts, circuits and machinery fault analysis, medical diagnosis etc.
The Facebook AI Research team has recently published RetinaNet architecture which uses the Feature pyramid network (FPN) with ResNet. This architecture has a larger accuracy though useful in situations where quickness in not very important. The retinaNet is built on top on FPN using the ResNet.
Google research offers benchmark comparison for a tradeoff between speed and accuracy of various networks. Google has used MS COCO dataset to train these models in TensorFlow. It gives us a benchmark to understand the best models that provide a balance between speed and accuracy. It concludes that Faster RCNN is more accurate whereas R-FCN and SSD are fast. Inception and ResNet are implementations of Faster R-CNN. MobileNet is an implementation of SSD.
For feature extraction, overall mAP (mean average precision) of around 30 is highest for Faster RCNN implementations, but at the same time accuracy is also highest at around 80.5%. MobileNet R-FCN implementation has a lower mAP of around 15 thus accuracy drops down to around 71.5%. For object detection, SSD implementations work best when the objects are of larger size. Faster RCNN and RFCN are better at detecting small objects.
On the COCO dataset, Faster R-CNN has average mAP for IoU from 0.5 to 0.95 (mAP@[0.5, 0.95]) as 21.9% . R-FCN has mAP of 31.5% . SSD300 and SSD512 have mAPs of 23.2 and 26.8 respectively . YOLO-V2 is at 21.6% whereas YOLO-V3 is at 33% . FPN delivers 33.9% . RetinaNet stands highest at 40.8%.
Foreground and background classes and need of balance
A One stage detector scans for candidate objects sampled for around 100000 locations in the image that densely covers the spatial extent. This does not let class balance between background and foreground. A Two-stage detector first narrows down the number of candidate objects on up to 2000 locations and separates them from the background in the first stage. In the second stage, each candidate object is classified. Thus class balance is managed. But because the number of locations sampled is very small, many objects might get escaped from detection. Faster R-CNN is an implementation of two-stage detector. RetinaNet, an implementation of one stage detector addresses this class imbalance while efficiently detecting all objects.
A new loss function: Focal loss
This function focuses on training on hard negatives. It is defined as
and p = sigmoid output score.
The greeks are hyperparameters.
When a sample is misclassified and pt is small, the loss is unaffected. Gamma is the focusing parameter and adjusts the rate at which the easy samples are down-weighted. Samples get down-weighted when samples are misclassified and pt is close to 1. When gamma is 0, the focal loss is close to the cross-entropy loss. Upon increasing gamma, the effect of modulating factor is increased.
The new loss function called Focal loss increases the accuracy significantly. Essentially it is a one-stage detector Feature Pyramid Network with cross-entropy loss replaces with Focal loss. Hard negative mining in a single shot detector and Faster RCNN addresses the class imbalance by downsampling the dominant samples. On the contrary, RetinaNet addresses it by changing the weights in the loss function. The architecture is understood with the following diagram.
ResNet is used for deep feature extraction and FPN is used on the top of ResNet for constructing multi-scale feature pyramid from one single resolution image. FPN is fast to compute and works scalably on multiscale.
Here we have used ResNet50-FPN pre-trained on MS COCO. We have tried to identify humans in the photo. The threshold is set above a score of 0.5. The results are shown in marked images below along with confidence values.
We further tried to detect other objects like chairs.
In conclusion, It’s great to know that training on COCO dataset can be used to detect objects from unknown scenes. The object detection in the scenes has taken between 5 – 7 seconds. Till now we have put filters of human or chair in results. RetinaNet can detect all the identifiable objects in the scene.
The different objects detected with the score are listed below
Next, we will be interested in working towards a model good in detecting objects in the larger depth of the image which the current ResNet50-FPN could not do.