NVIDIA AI Technical Center at Unimore
UNIMORE participates to the NVIDIA AI Nation program, which has been launched in Modena last January, with a local NVIDIA AI Technical Centre, namely NVAITC@Unimore. The goal of the technical centre is to provide support for computationally intensive research activities, which at AImageLab range from video analysis to fully-attentive models and self-supervision. Such research activities are carried on by close cooperation between UNIMORE (and consequently AImagelab) and NVIDIA and have the common need for high performance computing due to their computational requirements: the collaboration opens up the possibility of solving these tasks, also thanks to NVIDIA's expertise in the field.
At present, we have activated two research streams, on human activity recognition and on transformer-based models, with a focus on Novel Object Captioning. In the near future we also plan to open other research streams with intensive computational demands.
Human action recognition.
Deep architectures able to handle video clips gained increasing attention in the last few years, especially for tasks involving human action understanding. Nevertheless, existing models are still far from begin satisfactory, if compared to state-of-the-art deep image-based networks. The third dimension, i.e. the time axis, highly affects models complexity: it's still an open question how to handle the temporal domain and its strong redundancy in order to limit this complexity. In this project we aim to search for alternative solutions in the field of action recognition, thanks to the NVAITC support. The goal is to learn better Spatio-temporal representations which can be easily transferred to a number of more specific tasks involving human behavior understanding (from Spatio-temporal action localization to temporal action detection). We expect these models to require huge computational resources (from I/O to GPUs), and we aim to exploit the capabilities of the new accelerated partition of CINECA (Marconi100).
Image Captioning is the task of producing natural language sentences to describe the visual content of an image. One of the most critical limitations of these models is that they are built on a number of image-caption pairs which contain only a shallow view of in-domain objects. Novel Object Captioning (NOC) goes a step further from this by trying to generalize the model in real-word scenarios to describe novel objects, unseen during the training phase. Indeed, in this task, test images contain previously unobserved, 'novel' objects that are drawn from a target distribution that differs from the source/training distribution. These tasks have usually been tackled via Recurrent Neural Network (RNNs) in most literature, but the recent advent of fully-attentive models has led the research to new opportunities in terms of performances and representation capabilities, as testified by the Transformer and BERT architectures. Therefore in this project we will investigate the influence of the attentive paradigm, along with other emerging ones such as self-supervision, to enhance models effectiveness when describing unseen objects.