Researcher: Sargın, Mehmet Emre
Name Variants
Sargın, Mehmet Emre
Email Address
Birth Date
7 results
Search Results
Now showing 1 - 7 of 7
Publication Metadata only Audio-visual correlation analysis for lip feature extraction(Institute of Electrical and Electronics Engineers (IEEE), 2005) Department of Computer Engineering; Department of Computer Engineering; N/A; Department of Electrical and Electronics Engineering; Yemez, Yücel; Erzin, Engin; Sargın, Mehmet Emre; Tekalp, Ahmet Murat; Faculty Member; Faculty Member; Master Student; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; College of Engineering; 107907; 34503; N/A; 26207In this paper, correlations between audio features and different lip features are investigated. Audio features are selected as Mel Frequency Cepstral Coefficients (MFCC) of the audio signal. Three different lip features are considered for the visual lip information, where these features are 2D DCT coefficients of the intensity based image and the optical flow vectors within the lip region, and the distances between predefined points on the lip contour which carries the lip shape information. In this study, a new methodology based on class conditional probability is used to estimate and compare the correlations between audio feature and each lip feature. The lip feature, which has the highest correlation to audio features, is identified among the above lip features. Isolation of lip features, which are highly correlated with audio signal, can be used for audio-visual speech recognition, audio-visual lip synchronization and estimation of lip shapes using audio signal for visual synthesis. / Bu çalışmada, ses özellikleri ile farklı dudak özellikleri arasındaki ilişkiler araştırılmaktadır. Ses özellikleri, ses sinyalinin Mel Frekans Cepstral Katsayıları (MFCC) olarak seçilir. Görsel dudak bilgisi için üç farklı dudak özelliği dikkate alınır; burada bu özellikler, yoğunluğa dayalı görüntünün 2D DCT katsayıları ve dudak bölgesindeki optik akış vektörleri ile dudak konturunda dudak şekli bilgisini taşıyan önceden tanımlanmış noktalar arasındaki mesafelerdir. . Bu çalışmada, ses özelliği ile her bir dudak özelliği arasındaki korelasyonları tahmin etmek ve karşılaştırmak için sınıf koşullu olasılığa dayalı yeni bir metodoloji kullanılmıştır. Ses özellikleri ile en yüksek korelasyona sahip olan dudak özelliği, yukarıdaki dudak özellikleri arasında tanımlanır. Ses sinyali ile yüksek oranda ilişkili olan dudak özelliklerinin izolasyonu, görsel-işitsel konuşma tanıma, görsel-işitsel dudak senkronizasyonu ve görsel sentez için ses sinyali kullanılarak dudak şekillerinin tahmini için kullanılabilir.Publication Metadata only Speech driven 3D head gesture synthesis(IEEE, 2006) Erdem, A. Tanju; N/A; Department of Computer Engineering; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Sargın, Mehmet Emre; Erzin, Engin; Yemez, Yücel; Tekalp, Ahmet Murat; Master Student; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; College of Engineering; N/A; 34503; 107907; 26207In this paper, we present a speech driven natural head gesture analysis and synthesis system. The proposed system assumes that sharp head movements are correlated with prominence in speech. For analysis, a binocular camera system is employed to capture the head motion of a talking person. The motion parameters associated with the 3D head motion are then used for extraction of the repetitive head gestures. In parallel, prosodic events are detected using an HMM structure with pitch and formant frequencies and speech intensity as audio features. For synthesis, the head motion parameters are estimated from the prosodic events based on a gesture-speech correlation model and then the associated Euler angles are used for speech driven animation of a 3D personalized talking head model. Results on head motion feature extraction, prosodic event detection and correlation modelling are provided..Publication Metadata only Prosody-driven head-gesture animation(Institute of Electrical and Electronics Engineers (IEEE), 2007) Erdem, A. T.; Erdem, C.; Özkan, M.; N/A; Department of Computer Engineering; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Sargın, Mehmet Emre; Erzin, Engin; Yemez, Yücel; Tekalp, Ahmet Murat; Master Student; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; College of Engineering; N/A; 34503; 107907; 26207We present a new framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. The proposed two-stage analysis aims to "learn" both elementary prosody and head gesture patterns for a particular speaker, as well as the correlations between these head gesture and prosody patterns from a training video sequence. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech.Publication Metadata only Speech driven 3D head gesture synthesis(Institute of Electrical and Electronics Engineers (IEEE), 2006) Erdem, Tanju A.; Department of Computer Engineering; Department of Computer Engineering; N/A; Department of Electrical and Electronics Engineering; Yemez, Yücel; Erzin, Engin; Sargın, Mehmet Emre; Tekalp, Ahmet Murat; Faculty Member; Faculty Member; Master Student; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; College of Engineering; 107907; 34503; N/A; 26207In this paper, we present a speech driven natural head gesture analysis and synthesis system. The proposed system assumes that sharp head movements are correlated with prominence in speech. For analysis, a binocular camera system is employed to capture the head motion of a talking person. The motion parameters associated with the 3D head motion are then used for extraction of the repetitive head gestures. In parallel, prosodic events are detected using an HMM structure with pitch and formant frequencies and speech intensity as audio features. For synthesis, the head motion parameters are estimated from the prosodic events based on a gesture-speech correlation model and then the associated Euler angles are used for speech driven animation of a 3D personalized talking head model. Results on head motion feature extraction, prosodic event detection and correlation modelling are provided. / Bu yazıda, konuşmaya dayalı doğal bir baş hareketi analiz ve sentez sistemi sunuyoruz. Önerilen sistem, keskin kafa hareketlerinin konuşmadaki belirginlik ile ilişkili olduğunu varsayar. Analiz için, konuşan bir kişinin baş hareketini yakalamak için bir dürbün kamera sistemi kullanılır. 3B kafa hareketiyle ilişkili hareket parametreleri daha sonra tekrarlayan baş hareketlerinin çıkarılması için kullanılır. Buna paralel olarak, prozodik olaylar, ses özellikleri olarak perde ve formant frekansları ve konuşma yoğunluğu ile bir HMM yapısı kullanılarak tespit edilir. Sentez için, baş hareketi parametreleri, bir jest-konuşma korelasyon modeline dayalı prozodik olaylardan tahmin edilir ve ardından ilişkili Euler açıları, bir 3B kişiselleştirilmiş konuşan kafa modelinin konuşmaya dayalı animasyonu için kullanılır. Kafa hareketi öznitelik çıkarımı, prozodik olay tespiti ve korelasyon modellemesi ile ilgili sonuçlar sağlanır.Publication Metadata only Lip feature extraction based on audio-visual correlation(European Association for Signal Processing, 2005) Department of Computer Engineering; Department of Computer Engineering; N/A; Department of Electrical and Electronics Engineering; Yemez, Yücel; Erzin, Engin; Sargın, Mehmet Emre; Tekalp, Ahmet Murat; Faculty Member; Faculty Member; Master Student; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; College of Engineering; 107907; 34503; N/A; 26207In this paper, the lip feature that has the highest correlation with audio features is investigated. Audio features are selected as Mel Frequency Cepstral Coefficients (MFCC) of the audio signal. Three different lip features are considered for the visual lip information, where these features are 2D DCT coefficients of the intensity based image and the optical flow vectors within the lip region, and the distances between pre-defined points on the lip contour which carries the lip shape information. In this study, we present two techniques based on class conditional probability analysis and canonical correlation analysis to estimate and compare the correlations between audio feature and each lip feature. The lip feature, which has the highest correlation to audio features, is identified among the above lip features. Isolation of lip features, which are highly correlated with audio signal, can be used for audio-visual speech recognition, audio-visual lip synchronization and estimation of lip shapes using audio signal for visual synthesis.Publication Metadata only Multimodal speaker identification using canonical correlation analysis(IEEE, 2006) N/A; Department of Computer Engineering; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Sargın, Mehmet Emre; Erzin, Engin; Yemez, Yücel; Tekalp, Ahmet Murat; Master Student; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; College of Engineering; N/A; 34503; 107907; 26207In this work, we explore the use of canonical correlation analysis to improve the performance of multimodal recognition systems that involve multiple correlated modalities. More specifically, we consider the audiovisual speaker identification problem, where speech and lip texture (or intensity) modalities are fused in an open-set identification framework. Our motivation is based on the following observation. The late integration strategy, which is also referred to as decision or opinion fusion, is effective especially in case the contributing modalities are uncorrelated and thus the resulting partial decisions are statistically independent. Early integration techniques on the other hand can be favored only if a couple of modalities are highly correlated. However, coupled modalities such as audio and lip texture also consist of some components that are mutually independent. Thus we first perform a cross-correlation analysis on the audio and lip modalities so as to extract the correlated part of the information, and then employ an optimal combination of early and late integration techniques to fuse the extracted features. The results of the experiments testing the performance of the proposed system are also provided.Publication Open Access Audiovisual synchronization and fusion using canonical correlation analysis(Institute of Electrical and Electronics Engineers (IEEE), 2007) Department of Computer Engineering; Department of Electrical and Electronics Engineering; Sargın, Mehmet Emre; Yemez, Yücel; Erzin, Engin; Tekalp, Ahmet Murat; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; N/A; 34503; 26207It is well-known that early integration (also called data fusion) is effective when the modalities are correlated, and late integration (also called decision or opinion fusion) is optimal when modalities are uncorrelated. In this paper, we propose a new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features. We also propose a method for high precision synchronization of the speech and lip features using CCA prior to the proposed fusion. Experimental results show that i) the proposed fusion strategy yields the best equal error rates (EER), which are used to quantify the performance of the fusion strategy for open-set speaker identification, and ii) precise synchronization prior to fusion improves the EER; hence, the best EER is obtained when the proposed synchronization scheme is employed together with the proposed fusion strategy. We note that the proposed fusion strategy outperforms others because the features used in the late integration are truly uncorrelated, since they are output of the CCA analysis.