Unimore logo AImageLab

Augmenting and Mixing Transformers with Synthetic Data for Image Captioning

Abstract: Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.


Citation:

Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita "Augmenting and Mixing Transformers with Synthetic Data for Image Captioning" IMAGE AND VISION COMPUTING, vol. 162, pp. 1 -31 , 2025 DOI: 10.1016/j.imavis.2025.105661

 not available

Paper download: