Publication: Spherical vision transformers for audio-visual saliency prediction in 360→ videos
Program
KU-Authors
KU Authors
Co-Authors
Ozsoy, Halit
Imamoglu, Nevrez Efe
Ozcinar, Cagri
Ayhan, Inci
Erdem, Erkut
Publication Date
Language
Type
Embargo Status
No
Journal Title
Journal ISSN
Volume Title
Alternative Title
Abstract
—Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360→ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360→ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360→ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360→ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at: https://cyberiada.github.io/SalViT360/. © 2025 Elsevier B.V., All rights reserved.
Source
Publisher
IEEE Computer Society
Subject
Engineering, Computer science
Citation
Has Part
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Book Series Title
Edition
DOI
10.1109/TPAMI.2025.3604091
item.page.datauri
Link
Rights
CC BY-NC-ND (Attribution-NonCommercial-NoDerivs)
Copyrights Note
Creative Commons license
Except where otherwised noted, this item's license is described as CC BY-NC-ND (Attribution-NonCommercial-NoDerivs)

