Publication:
Spherical vision transformers for audio-visual saliency prediction in 360→ videos

dc.contributor.coauthorOzsoy, Halit
dc.contributor.coauthorImamoglu, Nevrez Efe
dc.contributor.coauthorOzcinar, Cagri
dc.contributor.coauthorAyhan, Inci
dc.contributor.coauthorErdem, Erkut
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorÇökelek, Mert
dc.contributor.kuauthorErdem, Aykut
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2025-12-31T08:24:29Z
dc.date.available2025-12-31
dc.date.issued2025
dc.description.abstract—Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360→ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360→ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360→ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360→ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at: https://cyberiada.github.io/SalViT360/. © 2025 Elsevier B.V., All rights reserved.
dc.description.fulltextYes
dc.description.harvestedfromManual
dc.description.indexedbyScopus
dc.description.indexedbyPubMed
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuTÜBİTAK
dc.description.sponsorshipThis work was supported in part by the KUIS AI Center Research Award, Unvest R&D Center, T\u00DCB\u0130TAK-1001 Program (No. 120E501), the T\u00DCBA-GEB\u0130P 2018 Award to E. Erdem, and the BAGEP 2021 Award to A. Erdem.
dc.identifier.doi10.1109/TPAMI.2025.3604091
dc.identifier.embargoNo
dc.identifier.grantno120E501
dc.identifier.issn0162-8828
dc.identifier.pubmed40880339
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-105014780882
dc.identifier.urihttps://doi.org/10.1109/TPAMI.2025.3604091
dc.identifier.urihttps://hdl.handle.net/20.500.14288/31786
dc.keywords360Videos
dc.keywordsAdapter fine-tuning
dc.keywordsAudio-visual saliency prediction
dc.keywordsVision transformers
dc.keywordsAudio acoustics
dc.keywordsComputer vision
dc.keywordsForecasting
dc.keywordsUser experience
dc.keywordsUser interfaces
dc.keywordsVirtual reality
dc.keywordsVisualization
dc.keywords360→ video
dc.keywordsAdapter fine-tuning
dc.keywordsAudio-visual
dc.keywordsAudio-visual saliency prediction
dc.keywordsEye-tracking
dc.keywordsField of views
dc.keywordsFine tuning
dc.keywordsSpatial audio
dc.keywordsVision transformer
dc.keywordsVisual saliency
dc.keywordsSpheres
dc.language.isoeng
dc.publisherIEEE Computer Society
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartofIEEE Transactions on Pattern Analysis and Machine Intelligence
dc.relation.openaccessYes
dc.rightsCC BY-NC-ND (Attribution-NonCommercial-NoDerivs)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subjectEngineering
dc.subjectComputer science
dc.titleSpherical vision transformers for audio-visual saliency prediction in 360→ videos
dc.typeJournal Article
dspace.entity.typePublication
person.familyNameÇökelek
person.familyNameErdem
person.givenNameMert
person.givenNameAykut
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery434c9663-2b11-4e66-9399-c863e2ebae43

Files