Detecting and recognizing objects in video is ideal for many video tasks, including indexing, segmentation, action and event recognition, surveillance needs, and more.
In spite of the remarkable progress in image based object detectors and region proposal networks, Convoluntional Neural Networks (CNN) based methods are costly by nature in terms of their compute power and so applying object detectors to video for a real world use-case is not straightforward.
One could simply use the object detector uniformly over a subset of the video and use temporal association to perform post processing but this approach will not yield the best results in terms of mAP since different parts of the video may be more challenging than others.
Another approach may be to optimize the object detector itself using model compression to achieve a higher throughput, this approach requires additional human effort.
Our model treats the object detection model as a black box, which means that given a new object detection model we do not require additional human effort to apply it on video. We use a novel approximation technique to both generate frame level detections and leverage tubelets to generate more accurate results.
Overall, using our model we are able to both dramatically cut down processing time and achieve higher accuracy (mAP), and - we are able to apply our model to various object detection architectures, because we are agnostic to the architecture and treat it as a black box.