Publication:
Leveraging auxiliary image descriptions for dense video captioning

dc.contributor.coauthorBoran, Emre
dc.contributor.coauthorİkizler-Cinbiş, Nazlı
dc.contributor.coauthorErdem, Erkut
dc.contributor.coauthorMadhyastha, Pranava
dc.contributor.coauthorSpecia, Lucia
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorErdem, Aykut
dc.contributor.kuprofileFaculty Member
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.yokid20331
dc.date.accessioned2024-11-09T22:51:43Z
dc.date.issued2021
dc.description.abstractCollecting textual descriptions is an especially costly task for dense video captioning, since each event in the video needs to be annotated separately and a long descriptive paragraph needs to be provided. In this paper, we investigate a way to mitigate this heavy burden and propose to leverage captions of visually similar images as auxiliary context. Our model successfully fetches visually relevant images and combines noun and verb phrases from their captions to generating coherent descriptions. To this end, we use a generator and discriminator design, together with an attention-based fusion technique, to incorporate image captions as context in the video caption generation process. The experiments on the challenging ActivityNet Captions dataset demonstrate that our proposed approach achieves more accurate and more diverse video descriptions compared to the strong baseline using METEOR, BLEU and CIDEr-D metrics and qualitative evaluations. (c) 2021 Published by Elsevier B.V.
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.openaccessNO
dc.description.publisherscopeInternational
dc.description.sponsorshipTUBA GEBIP fellowship
dc.description.sponsorshipMMVC project - TUBITAK
dc.description.sponsorshipBritish Council via the Newton Fund Institutional Links grant [219E054, 352343575]
dc.description.sponsorshipTUBITAKthrough 2210-A fellowship program This work was supported in part by TUBA GEBIP fellowship awarded to E. Erdem, and the MMVC project funded by TUBITAKand the British Council via the Newton Fund Institutional Links grant programme (Grant ID 219E054 and 352343575) . E. Boran received support from TUBITAKthrough 2210-A fellowship program.
dc.description.volume146
dc.identifier.doi10.1016/j.patrec.2021.02.009
dc.identifier.eissn1872-7344
dc.identifier.issn0167-8655
dc.identifier.quartileQ2
dc.identifier.scopus2-s2.0-85103245218
dc.identifier.urihttp://dx.doi.org/10.1016/j.patrec.2021.02.009
dc.identifier.urihttps://hdl.handle.net/20.500.14288/6881
dc.identifier.wos646018400010
dc.keywordsVideo captioning
dc.keywordsAdversarial training
dc.keywordsAttention
dc.languageEnglish
dc.publisherElsevier
dc.sourcePattern Recognition Letters
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.titleLeveraging auxiliary image descriptions for dense video captioning
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.authorid0000-0002-6280-8422
local.contributor.kuauthorErdem, Aykut
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae

Files