Unimore logo AImageLab
Back to the project list

CRoss-modal understanding and gEnerATIon of Visual and tExtual content

The recent advances of two key fields of AI, i.e. CV and NLP, have shown the ability to process textual and visual information in effective ways, opening up new possibilities both at the scientific and the industrial level. While content is increasingly available with mixed modalities (text, images, videos, etc.), most efforts have focused on monomodal deep learning approaches which deal primarily with either text or visual. The need to process mixed content is particularly important when it comes to creative contexts where the combination of text and visual content can enable new possibilities.

The convergence of deep learning with Computer Vision and NLP has made it possible to empower not only effective understanding and retrieval, but also - more recently - the generation of textual and visual information. However, in many cases the two communities tend to work independently, missing the benefit of developing innovative approaches which can work cross-modally, i.e. moving from text to visual and vice versa, or even mixing them. Indeed, a unified methodology is missing which provides a seamless integration between the different modalities based on multimodal processing of visual and textual data.

The CREATIVE project will make a quantum leap by investigating and developing innovative cross-modal neural models which can manipulate and transform different types of data seamlessly. CREATIVE will enable:

  • cross-modal processing of textual and visual input to create efficient and reusable representations in a shared space
  • cross-modal understanding of textual and visual content for retrieval of digital data
  • cross-modal generation, e.g. producing text from visual and vice versa, including mixed content.

At the core of the project lies a new unifying paradigm that aims to find synergies between supervised neural networks (going beyond current convolutive autoencoders, GANs, Transformer-based NNs, Capsules and graph-based networks) and symbolic representations, as those obtained from multilingual lexical-semantic knowledge graphs.


This 3-year project brings together the research experiences and expertise of three internationally-recognized research teams: the NLP Group at Roma Sapienza, AImageLab at UNIMORE and Mhug Group at UNITN in multimedia and human understanding, encompassing NLP, vision and multimedia. The project proposes foundational research with a direct industrial exploitation. Three use cases have been proposed with clearly-identified stakeholders: cross-modal media retrieval and image animation with RAI, image, video and textual generation for virtual-try-on services in the fashion industry (with the YNAP luxury e-commerce company) and cross-modal retrieval and generation of recipe and food description for restaurant e-commerce (with Italian Restaurants). We foresee an enormous potential benefit for made-in-Italy industries and as well as in paving the way to new research directions in several areas of AI.


1 Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita "Retrieval-Augmented Transformer for Image Captioning" Proceedings of the 19th International Conference on Content-based Multimedia Indexing, Graz, Austria, 2022, 2022 | DOI: 10.1145/3549555.3549585 Conference
2 Morelli, Davide; Fincato, Matteo; Cornia, Marcella; Landi, Federico; Cesari, Fabio; Cucchiara, Rita "Dress Code: High-Resolution Multi-Category Virtual Try-On" Proceeding of the European Conference on Computer Vision (Lecture Notes in Computer Science), vol. 13668, Tel Aviv, pp. 345 -362 , October 23-27, 2022, 2022 | DOI: 10.1007/978-3-031-20074-8_20 Conference

Video Demo

Project Info




01/06/2022 - 01/06/2025

Project Web Site


Project Number

MIUR_PRIN 2020 2020ZSL9F9

Funded by:

MUR - Italian Ministry of University and Research

Project type:

PRIN (National Interest Research Project)