Publication:
Multimodal speaker identification with audio-video processing

dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentN/A
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Electrical and Electronics Engineering
dc.contributor.kuauthorYemez, Yücel
dc.contributor.kuauthorKanak, Alper
dc.contributor.kuauthorErzin, Engin
dc.contributor.kuauthorTekalp, Ahmet Murat
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileMaster Student
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileFaculty Member
dc.contributor.otherDepartment of Computer Engineering
dc.contributor.otherDepartment of Electrical and Electronics Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.yokid107907
dc.contributor.yokidN/A
dc.contributor.yokid34503
dc.contributor.yokid26207
dc.date.accessioned2024-11-10T00:06:40Z
dc.date.issued2003
dc.description.abstractIn this paper we present a multimodal audio-visual speaker identification system. The objective is to improve the recognition performance over conventional unimodal schemes. The proposed system decomposes the information existing in a video stream into three components: speech, face texture and lip motion. Lip motion between successive frames is first computed in terms of optical row vectors and then encoded as a feature vector in a magnitude-direction histogram domain. The feature vectors obtained along the whole stream are then interpolated to match the rate of the speech signal and fused with mel frequency cepstral coeffcients (MFCC) of the corresponding speech signal. The resulting joint feature vectors are used to train and test a Hidden Markov Model (HMM) based identification system. Face texture images are treated separately in eigenface domain and integrated to the system through decision-fusion. Experimental results are also included for demonstration of the system performance.
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.openaccessNO
dc.description.publisherscopeInternational
dc.identifier.doiN/A
dc.identifier.isbn0-7803-7750-8
dc.identifier.scopus2-s2.0-0345565788
dc.identifier.urihttps://hdl.handle.net/20.500.14288/16652
dc.identifier.wos187010500002
dc.keywordsSpeech
dc.languageEnglish
dc.publisherIeee
dc.source2003 International Conference on Image Processing, Vol 3, Proceedings
dc.subjectComputer Science
dc.subjectArtificial intelligence
dc.subjectImaging systems
dc.subjectPhotography
dc.titleMultimodal speaker identification with audio-video processing
dc.typeConference proceeding
dspace.entity.typePublication
local.contributor.authorid0000-0002-7515-3138
local.contributor.authorid0000-0003-2541-7753
local.contributor.authorid0000-0002-2715-2368
local.contributor.authorid0000-0003-1465-8121
local.contributor.kuauthorYemez, Yücel
local.contributor.kuauthorKanak, Alper
local.contributor.kuauthorErzin, Engin
local.contributor.kuauthorTekalp, Ahmet Murat
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication21598063-a7c5-420d-91ba-0cc9b2db0ea0
relation.isOrgUnitOfPublication.latestForDiscovery21598063-a7c5-420d-91ba-0cc9b2db0ea0

Files