Imparare a descrivere gli oggetti salienti presenti nelle immagini tramite la visione e il linguaggio
Abstract: Replicating the human ability to connect vision and language has recently been gaining a lot of attention in computer vision, artificial intelligence, and natural language processing, resulting in new models and architectures capable of automatically describing images with textual descriptions. This task, called image captioning, requires not only to recognize salient objects in an image and understand their interactions, but also to verbalize them using natural language, which makes itself very challenging. In this thesis, we present state of the art solutions for the aforementioned problems covering all aspects involved in the generation of natural sentences. When humans describe a scene, they look at an object before naming it in a sentence, as selective mechanisms attract their gaze on salient and relevant parts of the scene. Motivated by the importance of automatically estimating the human focus of attention on images, the first part of the dissertation introduces two different saliency prediction models based on deep neural networks. In the first model, we use a combination of image features extracted at different levels of a convolutional neural network to estimate the saliency of an image. In the second model, instead, we employ a recurrent architecture together with neural attentive mechanisms that focus on the most salient regions of the input image to iteratively refine the predicted saliency map. Despite saliency prediction identifies the most relevant regions of an image, it has never been incorporated in a captioning architecture, even though such supervision could result in better image captioning performance. Following this intuition, we show how incorporating saliency prediction to effectively enhance the quality of image descriptions and introduce a captioning model that extends the classical machine attention paradigm in order to take into account salient regions as well as the context of the image. Inspired by the recent advent of fully attentive models, we also investigate the use of the Transformer model in image captioning and we propose a novel captioning architecture in which the recurrent relation is abandoned in favor of the use of self-attention. While an image can be described in multiple ways, standard captioning approaches provide no way of controlling which regions are described and what importance is given to each region. This lack of controllability creates a distance between humans and machine intelligence, as humans can manage the variety of ways in which an image can be described and select the most appropriate one depending on the task and the context at hand. Most importantly, this also limits the applicability of captioning algorithms to complex scenarios in which some control over the generation process is needed. To explicitly address these shortcomings, we present an image captioning model that can generate diverse natural language captions depending on a control signal that can be given either as a sequence or as a set of image regions which need to be described. On a side note, we also explore a different application scenario that requires conditioning the language model, i.e. that of naming characters in movies. In the last part of the thesis, we present solutions for cross-modal retrieval, another task related to vision and language that consists of finding images corresponding to a given textual query, and vice versa. Finally, we also show the application of retrieval techniques in a challenging scenario, i.e. that of digital humanities and cultural heritage, obtaining promising results using both supervised and unsupervised models.
Citation:Cornia, Marcella "Imparare a descrivere gli oggetti salienti presenti nelle immagini tramite la visione e il linguaggio" 2020