What Vision-Language Models `See' when they See Scenes

Cafagna, Michele; van Deemter, Kees; Gatt, A.

doi:https://doi.org/10.48550/arXiv.2109.07301

What Vision-Language Models `See' when they See Scenes

DSpace/Manakin Repository

What Vision-Language Models `See' when they See Scenes

Cafagna, Michele; van Deemter, Kees; Gatt, A.

(2021) Utrecht University Repository

(Preprint)

Abstract

Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of both types with images. We compare 3 ... read more

Download/Full Text

Open Access version via Utrecht University Repository

Preprint

DOI: https://doi.org/10.48550/arXiv.2109.07301

Publisher: arXiv

See more statistics about this item