Better dense shape detection in live imagery with RetinaNet

Convolutional neural networks have come a long way in conveniently identifying objects in images and videos. There are many networks like VGG19, ResNet, YOLO, SSD, RCNN, DensepathNet, DualNet, Xception, Inception, PolyNet, MobileNet and many more have evolved over time. They have been used in such practical purposes as detection of availability of space in the parking lot, satellite image analysis to track ships and agricultural output, radiology, people count, detecting words in vehicle license plates and storefronts, circuits and machinery fault analysis, medical diagnosis etc.

The Facebook AI Research team has recently published RetinaNet architecture which uses the Feature pyramid network (FPN) with ResNet. This architecture has a larger accuracy though useful in situations where quickness in not very important. The retinaNet is built on top on FPN using the ResNet.



Google research offers benchmark comparison for a tradeoff between speed and accuracy of various networks. Google has used MS COCO dataset to train these models in TensorFlow. It gives us a benchmark to understand the best models that provide a balance between speed and accuracy. It concludes that Faster RCNN is more accurate whereas R-FCN and SSD are fast. Inception and ResNet are implementations of Faster R-CNN. MobileNet is an implementation of SSD.

For feature extraction, overall mAP (mean average precision) of around 30 is highest for Faster RCNN implementations, but at the same time accuracy is also highest at around 80.5%. MobileNet R-FCN implementation has a lower mAP of around 15 thus accuracy drops down to around 71.5%. For object detection, SSD implementations work best when the objects are of larger size. Faster RCNN and RFCN are better at detecting small objects.

Speed & Accuracy Benchmark


On the COCO dataset, Faster R-CNN has average mAP for IoU from 0.5 to 0.95 (mAP@[0.5, 0.95]) as 21.9% . R-FCN has mAP of 31.5% . SSD300 and SSD512 have mAPs of 23.2 and 26.8 respectively . YOLO-V2 is at 21.6% whereas YOLO-V3 is at 33% . FPN delivers 33.9% . RetinaNet stands highest at 40.8%.

COCO AP Vs Inference time(ms)

The two variations of RetinaNet are compared above for AP vs speed (ms) for inference.


Foreground and background classes and need of balance


A One stage detector scans for candidate objects sampled for around 100000 locations in the image that densely covers the spatial extent. This does not let class balance between background and foreground. A Two-stage detector first narrows down the number of candidate objects on up to 2000 locations and separates them from the background in the first stage. In the second stage, each candidate object is classified. Thus class balance is managed. But because the number of locations sampled is very small, many objects might get escaped from detection. Faster R-CNN is an implementation of two-stage detector. RetinaNet, an implementation of one stage detector addresses this class imbalance while efficiently detecting all objects.


A new loss function: Focal loss


This function focuses on training on hard negatives. It is defined as


and p = sigmoid output score.


The greeks are hyperparameters.


When a sample is misclassified and pt is small, the loss is unaffected. Gamma is the focusing parameter and adjusts the rate at which the easy samples are down-weighted. Samples get down-weighted when samples are misclassified and pt is close to 1. When gamma is 0, the focal loss is close to the cross-entropy loss. Upon increasing gamma, the effect of modulating factor is increased.


RetinaNet backbone

The new loss function called Focal loss increases the accuracy significantly. Essentially it is a one-stage detector Feature Pyramid Network with cross-entropy loss replaces with Focal loss. Hard negative mining in a single shot detector and Faster RCNN addresses the class imbalance by downsampling the dominant samples. On the contrary, RetinaNet addresses it by changing the weights in the loss function. The architecture is understood with the following diagram.


ResNet is used for deep feature extraction and FPN is used on the top of ResNet for constructing multi-scale feature pyramid from one single resolution image. FPN is fast to compute and works scalably on multiscale.



Here we have used ResNet50-FPN pre-trained on MS COCO. We have tried to identify humans in the photo. The threshold is set above a score of 0.5. The results are shown in marked images below along with confidence values.

Human Detection

Human Detection

We further tried to detect other objects like chairs.

Chairs Detection


In conclusion, It’s great to know that training on COCO dataset can be used to detect objects from unknown scenes. The object detection in the scenes has taken between 5 – 7 seconds. Till now we have put filters of human or chair in results. RetinaNet can detect all the identifiable objects in the scene.

Dense Shape Detection

The different objects detected with the score are listed below


human    0.74903154

human    0.7123633

laptop     0.69287986

human    0.68936586

bottle      0.67716646

human    0.66410005

human    0.5968385

chair       0.5855772

human    0.5802317

bottle      0.5792091

chair       0.5783555

chair       0.538948

human    0.52267283


Next, we will be interested in working towards a model good in detecting objects in the larger depth of the image which the current ResNet50-FPN could not do.

Let your friend know on :
Go Top

May i help you?