
The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis
Abstract: Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.
Citation:
Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita "The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis" IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, vol. 2022-, New Orleans, Louisiana, pp. 4661 -4669 , June 19-24, 2022, 2022 DOI: 10.1109/CVPRW56347.2022.00512
not available
Paper download:
- Author version:
- DOI: 10.1109/CVPRW56347.2022.00512