
Multimodal Image Editing
Multimodal-conditioned fashion image editing aims to generate realistic images of a person wearing new garments by conditioning on multiple types of input. Unlike standard text-to-image generation, this task leverages a combination of modalities — such as human pose, garment sketches, and textual descriptions — to guide the image synthesis process. Our research explores how to effectively integrate these diverse constraints to enable fine-grained control over garment appearance and fit, while preserving the person’s identity and pose.

Textual-inverted Multimodal Garment Designer
Abstract. Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
Keywords: Latent Diffusion Models, Textual Inversion, Generative AI
Multimodal Garment Designer
Publication:
"Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing"
A. Baldrati, D. Morelli, G. Cartella, M. Cornia, M. Bertini, R. Cucchiara
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Abstract. Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs.
Keywords: Latent Diffusion Models, Generative AI
Publications
1 |
Cartella, Giuseppe; Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita
"OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data"
Proceedings of the 22nd International Conference on Image Analysis and Processing,
vol. 14233,
Udine, Italy,
pp. 245
-256
,
September 11-15, 2023,
2023
| DOI: 10.1007/978-3-031-43148-7_21
Conference
![]() |
2 |
Baldrati, Alberto; Morelli, Davide; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita
"Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing"
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023),
Paris, France,
pp. 23336
-23345
,
October 2-6, 2023,
2023
| DOI: 10.1109/ICCV51070.2023.02138
Conference
![]() |