Publication: Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360→ Videos
| dc.contributor.coauthor | Cokelek, Mert (57240049200) | |
| dc.contributor.coauthor | Ozsoy, Halit (60080191600) | |
| dc.contributor.coauthor | Imamoglu, Nevrez Efe (38061650600) | |
| dc.contributor.coauthor | Ozcinar, Cagri (35847657400) | |
| dc.contributor.coauthor | Ayhan, Inci (35087819700) | |
| dc.contributor.coauthor | Erdem, Erkut (13410837300) | |
| dc.contributor.coauthor | Erdem, Aykut (13410510300) | |
| dc.date.accessioned | 2025-12-31T08:24:29Z | |
| dc.date.available | 2025-12-31 | |
| dc.date.issued | 2025 | |
| dc.description.abstract | —Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360→ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360→ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360→ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360→ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at: https://cyberiada.github.io/SalViT360/. © 2025 Elsevier B.V., All rights reserved. | |
| dc.description.fulltext | Yes | |
| dc.description.harvestedfrom | Manual | |
| dc.description.indexedby | Scopus | |
| dc.description.publisherscope | International | |
| dc.description.readpublish | N/A | |
| dc.description.sponsoredbyTubitakEu | N/A | |
| dc.identifier.doi | 10.1109/TPAMI.2025.3604091 | |
| dc.identifier.embargo | No | |
| dc.identifier.issn | 0162-8828 | |
| dc.identifier.quartile | N/A | |
| dc.identifier.scopus | 2-s2.0-105014780882 | |
| dc.identifier.uri | https://doi.org/10.1109/TPAMI.2025.3604091 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14288/31786 | |
| dc.keywords | 360Videos | |
| dc.keywords | Adapter Fine-tuning | |
| dc.keywords | Audio-visual Saliency Prediction | |
| dc.keywords | Vision Transformers | |
| dc.keywords | Audio Acoustics | |
| dc.keywords | Computer Vision | |
| dc.keywords | Forecasting | |
| dc.keywords | User Experience | |
| dc.keywords | User Interfaces | |
| dc.keywords | Virtual Reality | |
| dc.keywords | Visualization | |
| dc.keywords | 360→ Video | |
| dc.keywords | Adapter Fine-tuning | |
| dc.keywords | Audio-visual | |
| dc.keywords | Audio-visual Saliency Prediction | |
| dc.keywords | Eye-tracking | |
| dc.keywords | Field Of Views | |
| dc.keywords | Fine Tuning | |
| dc.keywords | Spatial Audio | |
| dc.keywords | Vision Transformer | |
| dc.keywords | Visual Saliency | |
| dc.keywords | Spheres | |
| dc.language.iso | eng | |
| dc.publisher | IEEE Computer Society | |
| dc.relation.affiliation | Koç University | |
| dc.relation.collection | Koç University Institutional Repository | |
| dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | |
| dc.relation.openaccess | Yes | |
| dc.rights | CC BY-NC-ND (Attribution-NonCommercial-NoDerivs) | |
| dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | |
| dc.title | Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360→ Videos | |
| dc.type | Journal Article | |
| dspace.entity.type | Publication |
