Trasformare Visione e Linguaggio con Attenzione
Abstract: Attention mechanism and Transformer-based architectures have recently revolutionized the artificial intelligence landscape in almost every field. Ever since their first introduction, they have become ubiquitous components of any deep learning breakthrough, from Natural Language Processing to Computer Vision and Bioinformatics. This boils down mainly to their superior abilities in dealing with long-range interactions across data. In this thesis, we investigate the frontier of Transformer-based architectures at the intersection of Vision and Language, where machines are required to replicate the human ability to semantically connect different domains. In the first part, we present state-of-the-art solutions for the image captioning task, which consist of automatically describing images with natural language sentences, from the understanding of the visual content, objects and their interactions, to the creation of a syntactically and semantically correct sentence. We first discuss a thorough literature survey in the deep learning era, and we propose a novel image captioning model among the firsts embracing self-attention in place of recurrent networks. Experimentally, our architecture reaches a new state of the art, achieving the first place of the public leaderboard on the most important captioning benchmark. Further, we explore new training strategies proposing a method based on the interplay between two distinct language models, using the mean teacher paradigm and knowledge distillation, providing state-of-the-art caption quality with a reduced number of parameters. Despite the remarkable results obtained by captioning models, switching to real-life scenarios constitutes a challenge due to the larger variety of visual concepts not covered in existing datasets. For this reason, we propose a novel approach for novel object captioning, that learns to select the most relevant objects of an image, regardless of their presence in the training set, and constrains the generative process accordingly. In the following, we present solutions for cross-modal retrieval, another task related to vision and language that consists of finding images corresponding to a given textual query and, vice versa, retrieving texts which describe a given query image. Since both images and texts are usually encoded as sets or sequences of elements, we propose an attentive reduction method that transforms a set of elements into a single response, leading to a performance increase. Moreover, we propose an efficient Transformer architecture to fill in the gap between effectiveness and efficiency by learning a shared embedding space and distilling fine-grained scores previously aligned. Our approach competes with state-of-the-art large models while being almost 90 times faster. Switching to more complex and challenging scenarios, we also investigate visual-semantic models in the artistic and digital humanities domain. To this aim, we propose a cross-modal retrieval method that also identifies if sentences describe the visual content or the context of a painting and a visual-semantic embedding that can automatically align illustrations and texts without paired supervision. Finally, we expand the scope of attentive models to the language of life: the genetic code. We propose a new class of deep learning models based on the Perceiver architecture, built upon Transformer, which leverages asymmetric attention and can scale to longer sequences. We present a model able to predict the gene expression (mRNA level) given its DNA sequence, and a model for the first time predicting the protein expression given its amino-acid sequence. We demonstrate the effectiveness of our methods and promising future opportunities.
Citation:
Stefanini, Matteo "Trasformare Visione e Linguaggio con Attenzione" 2023not available