Publication: Investigating contributions of speech and facial landmarks for talking head generation
dc.contributor.department | N/A | |
dc.contributor.department | Department of Computer Engineering | |
dc.contributor.kuauthor | Kesim, Ege | |
dc.contributor.kuauthor | Erzin, Engin | |
dc.contributor.kuprofile | Faculty Member | |
dc.contributor.other | Department of Computer Engineering | |
dc.contributor.schoolcollegeinstitute | Graduate School of Sciences and Engineering | |
dc.contributor.schoolcollegeinstitute | College of Engineering | |
dc.contributor.yokid | N/A | |
dc.contributor.yokid | 34503 | |
dc.date.accessioned | 2024-11-09T13:10:40Z | |
dc.date.issued | 2021 | |
dc.description.abstract | Talking head generation is an active research problem. It has been widely studied as a direct speech-to-video or two stage speech-to-landmarks-to-video mapping problem. In this study, our main motivation is to assess individual and joint contributions of the speech and facial landmarks to the talking head generation quality through a state-of-the-art generative adversarial network (GAN) architecture. Incorporating frame and sequence discriminators and a feature matching loss, we investigate performances of speech only, landmark only and joint speech and landmark driven talking head generation on the CREMA-D dataset. Objective evaluations using the peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) and landmark distance (LMD) indicate that while landmarks bring PSNR and SSIM improvements to the speech driven system, speech brings LMD improvement to the landmark driven system. Furthermore, feature matching is observed to improve the speech driven talking head generation models significantly. | |
dc.description.fulltext | YES | |
dc.description.indexedby | Scopus | |
dc.description.openaccess | YES | |
dc.description.publisherscope | International | |
dc.description.sponsoredbyTubitakEu | N/A | |
dc.description.sponsorship | N/A | |
dc.description.version | Publisher version | |
dc.format | ||
dc.identifier.doi | 10.21437/Interspeech.2021-1585 | |
dc.identifier.embargo | NO | |
dc.identifier.filenameinventoryno | IR03341 | |
dc.identifier.isbn | 9.78171E+12 | |
dc.identifier.issn | 2308-457X | |
dc.identifier.link | https://doi.org/10.21437/Interspeech.2021-1585 | |
dc.identifier.quartile | N/A | |
dc.identifier.scopus | 2-s2.0-85119170398 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14288/2823 | |
dc.keywords | Speech driven animation | |
dc.keywords | Talking head generation | |
dc.language | English | |
dc.publisher | International Speech Communication Association (ISCA) | |
dc.relation.grantno | NA | |
dc.relation.uri | http://cdm21054.contentdm.oclc.org/cdm/ref/collection/IR/id/10127 | |
dc.source | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | |
dc.subject | Photography | |
dc.subject | Image color analysis | |
dc.subject | Pipelines | |
dc.subject | Computer architecture | |
dc.subject | Network architecture | |
dc.subject | Noise measurement | |
dc.subject | Colored noise | |
dc.subject | Computational photography | |
dc.subject | Low-light imaging | |
dc.subject | Image denoising | |
dc.subject | Burst images | |
dc.title | Investigating contributions of speech and facial landmarks for talking head generation | |
dc.type | Conference proceeding | |
dspace.entity.type | Publication | |
local.contributor.authorid | N/A | |
local.contributor.authorid | 0000-0002-2715-2368 | |
local.contributor.kuauthor | Kesim, Ege | |
local.contributor.kuauthor | Erzin, Engin | |
relation.isOrgUnitOfPublication | 89352e43-bf09-4ef4-82f6-6f9d0174ebae | |
relation.isOrgUnitOfPublication.latestForDiscovery | 89352e43-bf09-4ef4-82f6-6f9d0174ebae |
Files
Original bundle
1 - 1 of 1