Unimore logo AImageLab

Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences

Matteo Stefanini; Marcella Cornia; Lorenzo Baraldi; Massimiliano Corsini; Rita Cucchiara


As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset.


Artpedia contains a collection of 2,930 painting images, each associated to a variable number of textual descriptions. Each sentence is labelled either as a visual sentence or as a contextual sentence, if does not describe the visual content of the artwork. Contextual sentences can describe the historical context of the artwork, its author, the artistic influence or the place where the painting is exhibited. As in standard cross-modal datasets, the association between sentences and painting is also provided. Overall, the dataset contains a total of 28,212 sentences, 9,173 labelled as visual sentences and the remaining 19,039 as contextual sentences. On average, each painting is associated with 3.1 visual and 6.5 contextual sentences.

Download: artpedia.zip


Please cite with the following BibTeX:

  	  title={{Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences}},
  	  author={Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Corsini, Massimiliano and Cucchiara, Rita},
  	  booktitle={Proceedings of the International Conference on Image Analysis and Processing},


If you have any question, drop us an e-mail at matteo.stefanini [at] unimore.it or marcella.cornia [at] unimore.it.