Publication: Spherical vision transformers for audio-visual saliency prediction in 360→ videos
| dc.contributor.coauthor | Ozsoy, Halit | |
| dc.contributor.coauthor | Imamoglu, Nevrez Efe | |
| dc.contributor.coauthor | Ozcinar, Cagri | |
| dc.contributor.coauthor | Ayhan, Inci | |
| dc.contributor.coauthor | Erdem, Erkut | |
| dc.contributor.department | Graduate School of Sciences and Engineering | |
| dc.contributor.department | Department of Computer Engineering | |
| dc.contributor.kuauthor | Çökelek, Mert | |
| dc.contributor.kuauthor | Erdem, Aykut | |
| dc.contributor.schoolcollegeinstitute | GRADUATE SCHOOL OF SCIENCES AND ENGINEERING | |
| dc.contributor.schoolcollegeinstitute | College of Engineering | |
| dc.date.accessioned | 2025-12-31T08:24:29Z | |
| dc.date.available | 2025-12-31 | |
| dc.date.issued | 2025 | |
| dc.description.abstract | —Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360→ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360→ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360→ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360→ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at: https://cyberiada.github.io/SalViT360/. © 2025 Elsevier B.V., All rights reserved. | |
| dc.description.fulltext | Yes | |
| dc.description.harvestedfrom | Manual | |
| dc.description.indexedby | Scopus | |
| dc.description.indexedby | PubMed | |
| dc.description.publisherscope | International | |
| dc.description.readpublish | N/A | |
| dc.description.sponsoredbyTubitakEu | TÜBİTAK | |
| dc.description.sponsorship | This work was supported in part by the KUIS AI Center Research Award, Unvest R&D Center, T\u00DCB\u0130TAK-1001 Program (No. 120E501), the T\u00DCBA-GEB\u0130P 2018 Award to E. Erdem, and the BAGEP 2021 Award to A. Erdem. | |
| dc.identifier.doi | 10.1109/TPAMI.2025.3604091 | |
| dc.identifier.embargo | No | |
| dc.identifier.grantno | 120E501 | |
| dc.identifier.issn | 0162-8828 | |
| dc.identifier.pubmed | 40880339 | |
| dc.identifier.quartile | Q1 | |
| dc.identifier.scopus | 2-s2.0-105014780882 | |
| dc.identifier.uri | https://doi.org/10.1109/TPAMI.2025.3604091 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14288/31786 | |
| dc.keywords | 360Videos | |
| dc.keywords | Adapter fine-tuning | |
| dc.keywords | Audio-visual saliency prediction | |
| dc.keywords | Vision transformers | |
| dc.keywords | Audio acoustics | |
| dc.keywords | Computer vision | |
| dc.keywords | Forecasting | |
| dc.keywords | User experience | |
| dc.keywords | User interfaces | |
| dc.keywords | Virtual reality | |
| dc.keywords | Visualization | |
| dc.keywords | 360→ video | |
| dc.keywords | Adapter fine-tuning | |
| dc.keywords | Audio-visual | |
| dc.keywords | Audio-visual saliency prediction | |
| dc.keywords | Eye-tracking | |
| dc.keywords | Field of views | |
| dc.keywords | Fine tuning | |
| dc.keywords | Spatial audio | |
| dc.keywords | Vision transformer | |
| dc.keywords | Visual saliency | |
| dc.keywords | Spheres | |
| dc.language.iso | eng | |
| dc.publisher | IEEE Computer Society | |
| dc.relation.affiliation | Koç University | |
| dc.relation.collection | Koç University Institutional Repository | |
| dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | |
| dc.relation.openaccess | Yes | |
| dc.rights | CC BY-NC-ND (Attribution-NonCommercial-NoDerivs) | |
| dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | |
| dc.subject | Engineering | |
| dc.subject | Computer science | |
| dc.title | Spherical vision transformers for audio-visual saliency prediction in 360→ videos | |
| dc.type | Journal Article | |
| dspace.entity.type | Publication | |
| person.familyName | Çökelek | |
| person.familyName | Erdem | |
| person.givenName | Mert | |
| person.givenName | Aykut | |
| relation.isOrgUnitOfPublication | 3fc31c89-e803-4eb1-af6b-6258bc42c3d8 | |
| relation.isOrgUnitOfPublication | 89352e43-bf09-4ef4-82f6-6f9d0174ebae | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 3fc31c89-e803-4eb1-af6b-6258bc42c3d8 | |
| relation.isParentOrgUnitOfPublication | 434c9663-2b11-4e66-9399-c863e2ebae43 | |
| relation.isParentOrgUnitOfPublication | 8e756b23-2d4a-4ce8-b1b3-62c794a8c164 | |
| relation.isParentOrgUnitOfPublication.latestForDiscovery | 434c9663-2b11-4e66-9399-c863e2ebae43 |
