Publication: Leveraging auxiliary image descriptions for dense video captioning
dc.contributor.coauthor | Boran, Emre | |
dc.contributor.coauthor | İkizler-Cinbiş, Nazlı | |
dc.contributor.coauthor | Erdem, Erkut | |
dc.contributor.coauthor | Madhyastha, Pranava | |
dc.contributor.coauthor | Specia, Lucia | |
dc.contributor.department | Department of Computer Engineering | |
dc.contributor.department | Department of Computer Engineering | |
dc.contributor.kuauthor | Erdem, Aykut | |
dc.contributor.kuprofile | Faculty Member | |
dc.contributor.schoolcollegeinstitute | College of Engineering | |
dc.contributor.yokid | 20331 | |
dc.date.accessioned | 2024-11-09T22:51:43Z | |
dc.date.issued | 2021 | |
dc.description.abstract | Collecting textual descriptions is an especially costly task for dense video captioning, since each event in the video needs to be annotated separately and a long descriptive paragraph needs to be provided. In this paper, we investigate a way to mitigate this heavy burden and propose to leverage captions of visually similar images as auxiliary context. Our model successfully fetches visually relevant images and combines noun and verb phrases from their captions to generating coherent descriptions. To this end, we use a generator and discriminator design, together with an attention-based fusion technique, to incorporate image captions as context in the video caption generation process. The experiments on the challenging ActivityNet Captions dataset demonstrate that our proposed approach achieves more accurate and more diverse video descriptions compared to the strong baseline using METEOR, BLEU and CIDEr-D metrics and qualitative evaluations. (c) 2021 Published by Elsevier B.V. | |
dc.description.indexedby | WoS | |
dc.description.indexedby | Scopus | |
dc.description.openaccess | NO | |
dc.description.publisherscope | International | |
dc.description.sponsorship | TUBA GEBIP fellowship | |
dc.description.sponsorship | MMVC project - TUBITAK | |
dc.description.sponsorship | British Council via the Newton Fund Institutional Links grant [219E054, 352343575] | |
dc.description.sponsorship | TUBITAKthrough 2210-A fellowship program This work was supported in part by TUBA GEBIP fellowship awarded to E. Erdem, and the MMVC project funded by TUBITAKand the British Council via the Newton Fund Institutional Links grant programme (Grant ID 219E054 and 352343575) . E. Boran received support from TUBITAKthrough 2210-A fellowship program. | |
dc.description.volume | 146 | |
dc.identifier.doi | 10.1016/j.patrec.2021.02.009 | |
dc.identifier.eissn | 1872-7344 | |
dc.identifier.issn | 0167-8655 | |
dc.identifier.quartile | Q2 | |
dc.identifier.scopus | 2-s2.0-85103245218 | |
dc.identifier.uri | http://dx.doi.org/10.1016/j.patrec.2021.02.009 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14288/6881 | |
dc.identifier.wos | 646018400010 | |
dc.keywords | Video captioning | |
dc.keywords | Adversarial training | |
dc.keywords | Attention | |
dc.language | English | |
dc.publisher | Elsevier | |
dc.source | Pattern Recognition Letters | |
dc.subject | Computer science | |
dc.subject | Artificial intelligence | |
dc.title | Leveraging auxiliary image descriptions for dense video captioning | |
dc.type | Journal Article | |
dspace.entity.type | Publication | |
local.contributor.authorid | 0000-0002-6280-8422 | |
local.contributor.kuauthor | Erdem, Aykut | |
relation.isOrgUnitOfPublication | 89352e43-bf09-4ef4-82f6-6f9d0174ebae | |
relation.isOrgUnitOfPublication.latestForDiscovery | 89352e43-bf09-4ef4-82f6-6f9d0174ebae |