Video Action Detection

Video action detection requires an algorithm to detect and classify human actions in a video clip. Tackling the problem requires to address challenges that lie at the intersection between low-level and high-level video understanding. Firstly, fine-grained and discriminative spatio-temporal features are needed to represent video chunks in a compact and manageable form.
On the other hand, detecting and understanding human actions is not just a matter of extracting middle-level features, and demands for more high-level reasoning. We devise a high-level module for video action detection which considers interactions between different people in the scene and interactions between actors and objects. Further, we also take into account long-range temporal dependencies by connecting consecutive clips during learning and inference.

