Unimore logo AImageLab

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Abstract: Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.


Citation:

Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita "Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning" COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 237, pp. 1 -10 , 2023 DOI: 10.1016/j.cviu.2023.103857

 not available