Publication:
Vilma: a zero-shot benchmark for linguıstic and temporal grounding in video-language models

Thumbnail Image

School / College / Institute

Organizational Unit
Organizational Unit

Program

KU Authors

Co-Authors

Pedrotti, Andrea
Dogan, Mustafa
Cafagna, Michele
Parcalabescu, Letitia
Calixto, Iacer
Frank, Anetteh
Gatt, Albert

Publication Date

Language

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present VILMA), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, VILMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. VILMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

Source

Publisher

International Conference on Learning Representations, ICLR

Subject

Electrical and electronics engineering

Citation

Has Part

Source

12th International Conference on Learning Representations, ICLR 2024

Book Series Title

Edition

DOI

item.page.datauri

Link

Rights

Copyrights Note

Endorsement

Review

Supplemented By

Referenced By

10

Views

4

Downloads