Publication:
Multimodal speaker/speech recognition using lip motion, lip texture and audio

dc.contributor.departmentN/A
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorÇetingül, Hasan Ertan
dc.contributor.kuauthorErzin, Engin
dc.contributor.kuauthorYemez, Yücel
dc.contributor.kuauthorTekalp, Ahmet Murat
dc.contributor.kuprofileMaster Student
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileFaculty Member
dc.contributor.otherDepartment of Computer Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.yokidN/A
dc.contributor.yokid34503
dc.contributor.yokid107907
dc.contributor.yokid26207
dc.date.accessioned2024-11-09T23:14:57Z
dc.date.issued2006
dc.description.abstractWe present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. In this paper, we employ a new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called reliability weighted summation (RWS) decision rule. Experimental results show that inclusion of lip motion modality provides further performance gains over those which are obtained by fusion of audio and lip texture alone, in both speaker identification and isolated word recognition scenarios. (c) 2006 Published by Elsevier B.V.
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.issue12
dc.description.openaccessNO
dc.description.volume86
dc.identifier.doi10.1016/j.sigpro.2006.02.045
dc.identifier.eissn1872-7557
dc.identifier.issn0165-1684
dc.identifier.scopus2-s2.0-33749436578
dc.identifier.urihttp://dx.doi.org/10.1016/j.sigpro.2006.02.045
dc.identifier.urihttps://hdl.handle.net/20.500.14288/10249
dc.identifier.wos242182700004
dc.keywordsSpeaker identification
dc.keywordsIsolated word recognition
dc.keywordsLip reading
dc.keywordsLip motion
dc.keywordsDecision fusion
dc.keywordsIdentification
dc.keywordsSpeech
dc.keywordsFace
dc.languageEnglish
dc.publisherElsevier
dc.sourceSignal Processing
dc.subjectEngineering
dc.subjectElectrical electronic engineering
dc.titleMultimodal speaker/speech recognition using lip motion, lip texture and audio
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.authoridN/A
local.contributor.authorid0000-0002-2715-2368
local.contributor.authorid0000-0002-7515-3138
local.contributor.authorid0000-0003-1465-8121
local.contributor.kuauthorÇetingül, Hasan Ertan
local.contributor.kuauthorErzin, Engin
local.contributor.kuauthorYemez, Yücel
local.contributor.kuauthorTekalp, Ahmet Murat
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae

Files