Researcher:
Erzin, Engin

Loading...
Profile Picture
ORCID

Job Title

Faculty Member

First Name

Engin

Last Name

Erzin

Name

Name Variants

Erzin, Engin

Email Address

Birth Date

Search Results

Now showing 1 - 10 of 135
  • Placeholder
    Publication
    Enhancement of throat microphone recordings by learning phone-dependent mappings of speech spectra
    (Institute of Electrical and Electronics Engineers (IEEE), 2013) N/A; Department of Computer Engineering; Turan, Mehmet Ali Tuğtekin; Erzin, Engin; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 34503
    We investigate spectral envelope mapping problem with joint analysis of throat- and acoustic-microphone recordings to enhance throatmicrophone speech. A new phone-dependent GMM-based spectral envelope mapping scheme, which performs the minimum mean square error (MMSE) estimation of the acoustic-microphone spectral envelope, has been proposed. Experimental evaluations are performed to compare the proposed mapping scheme to the state-of-theart GMM-based estimator using both objective and subjective evaluations. Objective evaluations are performed with the log-spectral distortion (LSD) and the wideband perceptual evaluation of speech quality (PESQ) metrics. Subjective evaluations are performed with the A/B pair comparison listening test. Both objective and subjective evaluations yield that the proposed phone-dependent mapping consistently improves performances over the state-of-the-art GMM estimator.
  • Placeholder
    Publication
    Robust speech recognition using adaptively denoised wavelet coefficients
    (IEEE, 2004) Department of Electrical and Electronics Engineering; Department of Electrical and Electronics Engineering; N/A; Tekalp, Ahmet Murat; Erzin, Engin; Akyol, Emrah; Faculty Member; Faculty Member; Master Student; Department of Electrical and Electronics Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; 26207; 34503; N/A
    The existence of additive noise affects the performance of speech recognition in real environments. We propose a new set of feature vectors for robust speech recognition using denoised wavelet coefficients. The use of wavelet coefficients in speech processing is motivated by the ability of the wavelet transform to capture both time and frequency information and the non-stationary behaviour of speech signals. We use one set of noisy data, such as data with car noise, and we use hard thresholding in the best basis for denoising. We use isolated digits as our database in our HMM based speech recognition system. A performance comparison of hard thresholding denoised wavelet coefficients and MFCC feature vectors is presented.
  • Placeholder
    Publication
    An audio-driven dancing avatar
    (Springer, 2008) Balci, Koray; Kizoglu, Idil; Akarun, Lale; Canton-Ferrer, Cristian; Tilmanne, Joelle; Bozkurt, Elif; Erdem, A. Tanju; Department of Computer Engineering; N/A; N/A; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Yemez, Yücel; Ofli, Ferda; Demir, Yasemin; Erzin, Engin; Tekalp, Ahmet Murat; Faculty Member; PhD Student; Master Student; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; College of Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; 107907; N/A; N/A; 34503; 26207
    We present a framework for training and synthesis of an audio-driven dancing avatar. The avatar is trained for a given musical genre using the multicamera video recordings of a dance performance. The video is analyzed to capture the time-varying posture of the dancer's body whereas the musical audio signal is processed to extract the beat information. We consider two different marker-based schemes for the motion capture problem. The first scheme uses 3D joint positions to represent the body motion whereas the second uses joint angles. Body movements of the dancer are characterized by a set of recurring semantic motion patterns, i.e., dance figures. Each dance figure is modeled in a supervised manner with a set of HMM (Hidden Markov Model) structures and the associated beat frequency. In the synthesis phase, an audio signal of unknown musical type is first classified, within a time interval, into one of the genres that have been learnt in the analysis phase, based on mel frequency cepstral coefficients (MFCC). The motion parameters of the corresponding dance figures are then synthesized via the trained HMM structures in synchrony with the audio signal based on the estimated tempo information. Finally, the generated motion parameters, either the joint angles or the 3D joint positions of the body, are animated along with the musical audio using two different animation tools that we have developed. Experimental results demonstrate the effectiveness of the proposed framework.
  • Placeholder
    Publication
    Multimodal analysis of speech prosody and upper body gestures using hidden semi-Markov models
    (Institute of Electrical and Electronics Engineers (IEEE), 2013) N/A; N/A; N/A; Department of Computer Engineering; Department of Computer Engineering; Bozkurt, Elif; Asta, Shahriar; Özkul, Serkan; Yemez, Yücel; Erzin, Engin; PhD Student; PhD Student; Master Student; Faculty Member; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; N/A; N/A; N/A; 107907; 34503
    Gesticulation is an essential component of face-to-face communication, and it contributes significantly to the natural and affective perception of human-to-human communication. In this work we investigate a new multimodal analysis framework to model relationships between intonational and gesture phrases using the hidden semi-Markov models (HSMMs). The HSMM framework effectively associates longer duration gesture phrases to shorter duration prosody clusters, while maintaining realistic gesture phrase duration statistics. We evaluate the multimodal analysis framework by generating speech prosody driven gesture animation, and employing both subjective and objective metrics.
  • Placeholder
    Publication
    Multicamera audio-visual analysis of dance figures
    (IEEE, 2007) N/A; N/A; Department of Computer Engineering; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Ofli, Ferda; Erzin, Engin; Yemez, Yücel; Tekalp, Ahmet Murat; PhD Student; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; College of Engineering; N/A; 34503; 107907; 26207
    We present an automated system for multicamera motion capture and audio-visual analysis of dance figures. the multiview video of a dancing actor is acquired using 8 synchronized cameras. the motion capture technique is based on 3D tracking of the markers attached to the person's body in the scene, using stereo color information without need for an explicit 3D model. the resulting set of 3D points is then used to extract the body motion features as 3D displacement vectors whereas MFC coefficients serve as the audio features. in the first stage of multimodal analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of the audio and body motion features, separately, to determine the recurrent elementary audio and body motion patterns. then in the second stage, we investigate the correlation of body motion patterns with audio patterns, that can be used for estimation and synthesis of realistic audio-driven body animation.
  • Placeholder
    Publication
    Audio-facial laughter detection in naturalistic dyadic conversations
    (Ieee-Inst Electrical Electronics Engineers Inc, 2017) N/A; N/A; Department of Computer Engineering; Department of Computer Engineering; Department of Computer Engineering; Türker, Bekir Berker; Yemez, Yücel; Sezgin, Tevfik Metin; Erzin, Engin; PhD Student; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; N/A; 107907; 18632; 34503
    We address the problem of continuous laughter detection over audio-facial input streams obtained from naturalistic dyadic conversations. We first present meticulous annotation of laughters, cross-talks and environmental noise in an audio-facial database with explicit 3D facial mocap data. Using this annotated database, we rigorously investigate the utility of facial information, head movement and audio features for laughter detection. We identify a set of discriminative features using mutual information-based criteria, and show how they can be used with classifiers based on support vector machines (SVMs) and time delay neural networks (TDNNs). Informed by the analysis of the individual modalities, we propose a multimodal fusion setup for laughter detection using different classifier-feature combinations. We also effectively incorporate bagging into our classification pipeline to address the class imbalance problem caused by the scarcity of positive laughter instances. Our results indicate that a combination of TDNNs and SVMs lead to superior detection performance, and bagging effectively addresses data imbalance. Our experiments show that our multimodal approach supported by bagging compares favorably to the state of the art in presence of detrimental factors such as cross-talk, environmental noise, and data imbalance.
  • Placeholder
    Publication
    Affect burst detection using multi-modal cues
    (IEEE, 2015) Department of Computer Engineering; Department of Computer Engineering; N/A; Department of Computer Engineering; N/A; Sezgin, Tevfik Metin; Yemez, Yücel; Türker, Bekir Berker; Erzin, Engin; Marzban, Shabbir; Faculty Member; Faculty Member; PhD Student; Faculty Member; Master Student; Department of Computer Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; College of Engineering; Graduate School of Sciences and Engineering; 18632; 107907; N/A; 34503; N/A
    Recently, affect bursts have gained significant importance in the field of emotion recognition since they can serve as prior in recognising underlying affect bursts. In this paper we propose a data driven approach for detecting affect bursts using multimodal streams of input such as audio and facial landmark points. The proposed Gaussian Mixture Model based method learns each modality independently followed by combining the probabilistic outputs to form a decision. This gives us an edge over feature fusion based methods as it allows us to handle events when one of the modalities is too noisy or not available. We demonstrate robustness of the proposed approach on 'Interactive emotional dyadic motion capture database' (IEMOCAP) which contains realistic and natural dyadic conversations. This database is annotated by three annotators to segment and label affect bursts to be used for training and testing purposes. We also present performance comparison between SVM based methods and GMM based methods for the same configuration of experiments.
  • Placeholder
    Publication
    Source and filter estimation for throat-microphone speech enhancement
    (IEEE-Inst Electrical Electronics Engineers Inc, 2016) N/A; Department of Computer Engineering; Turan, Mehmet Ali Tuğtekin; Erzin, Engin; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 34503
    In this paper, we propose a new statistical enhancement system for throat microphone recordings through source and filter separation. Throat microphones (TM) are skin-attached piezoelectric sensors that can capture speech sound signals in the form of tissue vibrations. Due to their limited bandwidth, TM recorded speech suffers from intelligibility and naturalness. In this paper, we investigate learning phone-dependent Gaussian mixture model (GMM)-based statistical mappings using parallel recordings of acoustic microphone (AM) and TM for enhancement of the spectral envelope and excitation signals of the TM speech. The proposed mappings address the phone-dependent variability of tissue conduction with TM recordings. While the spectral envelope mapping estimates the line spectral frequency (LSF) representation of AM from TM recordings, the excitation mapping is constructed based on the spectral energy difference (SED) of AM and TM excitation signals. The excitation enhancement is modeled as an estimation of the SED features from the TM signal. The proposed enhancement system is evaluated using both objective and subjective tests. Objective evaluations are performed with the log-spectral distortion (LSD), the wideband perceptual evaluation of speech quality (PESQ) and mean-squared error (MSE) metrics. Subjective evaluations are performed with an A/B comparison test. Experimental results indicate that the proposed phone-dependent mappings exhibit enhancements over phone-independent mappings. Furthermore enhancement of the TM excitation through statistical mappings of the SED features introduces significant objective and subjective performance improvements to the enhancement of TM recordings.
  • Placeholder
    Publication
    Agreement and disagreement classification of dyadic interactions using vocal and gestural cues
    (Institute of Electrical and Electronics Engineers (IEEE), 2016) N/A; N/A; N/A; Department of Computer Engineering; Khaki, Hossein; Bozkurt, Elif; Erzin, Engin; PhD Student; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; N/A; 34503
    In human-to-human communication gesture and speech co-exist in time with a tight synchrony, where we tend to use gestures to complement or to emphasize speech. In this study, we investigate roles of vocal and gestural cues to identify a dyadic interaction as agreement and disagreement. In this investigation we use the JESTKOD database, which consists of speech and full-body motion capture data recordings for dyadic interactions under agreement and disagreement scenarios. Spectral features of vocal channel and upper body joint angles of gestural channel are employed to extract unimodal and multimodal classification performances. Both of the modalities attain classification rates significantly above the chance level and the multimodal classifier performed more than 80% classification rate over 15 second utterances using statistical features of speech and motion.
  • Placeholder
    Publication
    Artificial bandwidth extension of speech excitation
    (IEEE, 2015) Department of Computer Engineering; N/A; Erzin, Engin; Turan, Mehmet Ali Tuğtekin; Faculty Member; PhD Student; Department of Computer Engineering; College of Engineering; Graduate School of Sciences and Engineering; 34503; N/A
    In this paper, a new approach that extends narrowband excitation signals to synthesize wide-band speech have been proposed. Bandwidth extension problem is analyzed using source-filter separation framework where a speech signal is decomposed into two independent components. For spectral envelope extension, our former work based on hidden Markov model have been used. For excitation signal extension, the proposed method moves the spectrum based on correlation analysis where the distance between the harmonics and the structure of the excitation signal are preserved in high-bands. In experimental studies, we also apply two other well-known extension techniques for excitation signals comparatively and evaluate the overall performance of proposed system using the PESQ metric. Our findings indicate that the proposed extension method outperforms other two techniques. © 2015 IEEE./ Öz: Bu çalışmada dar bantlı kaynak sinyallerinin bant genişliği artırılarak geniş bantlı konuşma sentezleyen yeni bir yaklaşım önerilmektedir. Bant genişletme problemi kaynak süzgeç analizinin yardımıyla iki bağımsız bileşen üzerinde ayrı ayrı ele alınmıştır. Süzgeç yapısını şekillendiren izgesel zarfı, saklı Markov modeli tabanlı geçmiş çalışmamızı kullanarak iyileştirirken, dar bantlı kaynak sinyalinin genişletilmesi için izgesel kopyalamaya dayalı yeni bir yöntem öneriyoruz. Bu yeni yöntemde dar bantlı kaynak sinyalinin yüksek frekans bileşenlerindeki harmonik yapısını, ilinti analizi ile genişletip geniş bantlı kaynak sinyali sentezlemekteyiz. Öne sürülen bu iyileştirmenin başarımını ölçebilmek için literatürde sıklıkla kullanılan iki ayrı genişletme yöntemi de karşılaştırmalı olarak degerlendirilmekte- dir. Deneysel çalışmalarda öne sürdüğümüz genişletmenin PESQ ölçütüyle nesnel başarımı gösterilmiştir.