
Architetture Multimodali Attentive di Deep Learning per la Comprensione Visivo-Semantica
Abstract: Computer Vision has experienced rapid advancements in recent years, driven by the advent of attentive and Transformer-based models. These architectures have revolutionized the field, enabling complex data interactions and pushing the boundaries of Artificial Intelligence (AI). Central to this evolution is attention modeling, which facilitates a sophisticated and nuanced understanding of diverse data types, such as text, images, and videos. The integration of these data types has given rise to Multimodal Deep Learning, which aims to emulate human-like perception and reasoning across multiple modalities, enhancing performance and broadening AI applications in fields like healthcare and autonomous vehicles. The research presented in this thesis investigates critical challenges associated with multimodal attentive architectures, including improving semantic segmentation accuracy, enabling open-vocabulary segmentation, and advancing video question answering. A critical challenge in semantic segmentation is accurately delineating object boundaries between different semantic classes. Misclassification of pixels in these transition areas can lead to errors that affect downstream tasks. To address this issue, we introduce novel boundary-level objectives and develop modified geometric distance functions to enhance boundary accuracy in complex environments. Additionally, we emphasize the importance of comprehensive evaluation metrics by proposing a fine-grained error analysis method for semantic segmentation. This approach provides deeper insights into model performance and facilitates targeted improvements in segmentation models. Building upon this foundation, we explore open-vocabulary semantic segmentation, a cutting-edge multimodal task that enables the segmentation of arbitrary categories expressed in textual form. We introduce innovative approaches such as prototype retrieval and synthetic references to bridge the gap between global features and pixel-level semantics. These methods effectively address the domain shift problem and enable open-vocabulary segmentation capabilities without relying on extensive training or large annotated datasets. Significant contributions are made to enhance Vision Transformer (ViT) architectures for semantic segmentation. A novel superpixel-based positional encoding technique is proposed, integrating semantic priors with self-attentive features to improve performance without increasing model complexity. Our research also investigates bidimensional downsampling in ViT models and self-supervised learning techniques, aiming to increase efficiency while boosting performance in visual tasks such as image classification. In the realm of multimodal video understanding, we present a text-guided temporal querying transformer for video question answering. This component effectively bridges frame-wise visual perception with the reasoning capabilities of large language models, advancing multimodal video comprehension. The broader implications of multimodal deep learning are further explored through a systematic study on multimodal deepfake detection. By leveraging contrastive-based disentangling strategies, we analyze the interplay between textual semantics and low-level visual cues in the context of advanced diffusion models. The research presented in this thesis spans a wide spectrum of computer vision challenges, from low-level semantic segmentation to high-level video reasoning and deepfake detection. By developing novel methodologies and architectures, we contribute to expanding the possibilities of artificial visual intelligence. Our findings have implications for various applications, including medical imaging, robotics, and autonomous driving, paving the way for future research in visual-semantic understanding.
Citation:
Amoroso, Roberto "Architetture Multimodali Attentive di Deep Learning per la Comprensione Visivo-Semantica" 2025
not available