Publications with Fulltext

Permanent URI for this collectionhttps://hdl.handle.net/20.500.14288/6

Browse

Search Results

Now showing 1 - 6 of 6
  • Thumbnail Image
    PublicationOpen Access
    A deep learning approach for data driven vocal tract area function estimation
    (Institute of Electrical and Electronics Engineers (IEEE), 2018) Department of Computer Engineering; Department of Electrical and Electronics Engineering; Erzin, Engin; Asadiabadi, Sasan; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; College of Sciences; Graduate School of Sciences and Engineering; 34503; N/A
    In this paper we present a data driven vocal tract area function (VTAF) estimation using Deep Neural Networks (DNN). We approach the VTAF estimation problem based on sequence to sequence learning neural networks, where regression over a sliding window is used to learn arbitrary non-linear one-to-many mapping from the input feature sequence to the target articulatory sequence. We propose two schemes for efficient estimation of the VTAF; (1) a direct estimation of the area function values and (2) an indirect estimation via predicting the vocal tract boundaries. We consider acoustic speech and phone sequence as two possible input modalities for the DNN estimators. Experimental evaluations are performed over a large data comprising acoustic and phonetic features with parallel articulatory information from the USC-TIMIT database. Our results show that the proposed direct and indirect schemes perform the VTAF estimation with mean absolute error (MAE) rates lower than 1.65 mm, where the direct estimation scheme is observed to perform better than the indirect scheme.
  • Thumbnail Image
    PublicationOpen Access
    End to end rate distortion optimized learned hierarchical bi-directional video compression
    (Institute of Electrical and Electronics Engineers (IEEE), 2022) Department of Electrical and Electronics Engineering; Tekalp, Ahmet Murat; Yılmaz, Mustafa Akın; Faculty Member; Department of Electrical and Electronics Engineering; College of Engineering; 26207; N/A
    Conventional video compression (VC) methods are based on motion compensated transform coding, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to the combinatorial nature of the end-to-end optimization problem. Learned VC allows end-to-end rate-distortion (R-D) optimized training of nonlinear transform, motion and entropy model simultaneously. Most works on learned VC consider end-to-end optimization of a sequential video codec based on R-D loss averaged over pairs of successive frames. It is well-known in conventional VC that hierarchical, bi-directional coding outperforms sequential compression because of its ability to use both past and future reference frames. This paper proposes a learned hierarchical bi-directional video codec (LHBDC) that combines the benefits of hierarchical motion-compensated prediction and end-to-end optimization. Experimental results show that we achieve the best R-D results that are reported for learned VC schemes to date in both PSNR and MS-SSIM. Compared to conventional video codecs, the R-D performance of our end-to-end optimized codec outperforms those of both x265 and SVT-HEVC encoders ("veryslow" preset) in PSNR and MS-SSIM as well as HM 16.23 reference software in MS-SSIM. We present ablation studies showing performance gains due to proposed novel tools such as learned masking, flow-field subsampling, and temporal flow vector prediction. The models and instructions to reproduce our results can be found in https://github.com/makinyilmaz/LHBDC/.
  • Thumbnail Image
    PublicationOpen Access
    Emotion dependent domain adaptation for speech driven affective facial feature synthesis
    (Institute of Electrical and Electronics Engineers (IEEE), 2022) Department of Electrical and Electronics Engineering; Erzin, Engin; Sadiq, Rizwan; Faculty Member; Department of Electrical and Electronics Engineering; Koç Üniversitesi İş Bankası Yapay Zeka Uygulama ve Araştırma Merkezi (KUIS AI)/ Koç University İş Bank Artificial Intelligence Center (KUIS AI); College of Engineering; 34503; N/A
    Although speech driven facial animation has been studied extensively in the literature, works focusing on the affective content of the speech are limited. This is mostly due to the scarcity of affective audio-visual data. In this article, we improve the affective facial animation using domain adaptation by partially reducing the data scarcity. We first define a domain adaptation to map affective and neutral speech representations to a common latent space in which cross-domain bias is smaller. Then the domain adaptation is used to augment affective representations for each emotion category, including angry, disgust, fear, happy, sad, surprise, and neutral, so that we can better train emotion-dependent deep audio-to-visual (A2V) mapping models. Based on the emotion-dependent deep A2V models, the proposed affective facial synthesis system is realized in two stages: first, speech emotion recognition extracts soft emotion category likelihoods for the utterances; then a soft fusion of the emotion-dependent A2V mapping outputs form the affective facial synthesis. Experimental evaluations are performed on the SAVEE audio-visual dataset. The proposed models are assessed with objective and subjective evaluations. The proposed affective A2V system achieves significant MSE loss improvements in comparison to the recent literature. Furthermore, the resulting facial animations of the proposed system are preferred over the baseline animations in the subjective evaluations.
  • Thumbnail Image
    PublicationOpen Access
    Federated dropout learning for hybrid beamforming with spatial path index modulation in multi-user MMWave-MIMO systems
    (Institute of Electrical and Electronics Engineers (IEEE), 2021) Mishra, Kumar Vijay; Department of Electrical and Electronics Engineering; Ergen, Sinem Çöleri; Elbir, Ahmet Musab; Faculty Member; Department of Electrical and Electronics Engineering; College of Engineering; 7211; N/A
    Millimeter wave multiple-input multiple-output (mmWave-MIMO) systems with small number of radio-frequency (RF) chains have limited multiplexing gain. Spatial path index modulation (SPIM) is helpful in improving this gain by utilizing additional signal bits modulated by the indices of spatial paths. In this paper, we introduce model-based and model-free frameworks for beamformer design in multi-user SPIM-MIMO systems. We first design the beamformers via model-based manifold optimization algorithm. Then, we leverage federated learning (FL) with dropout learning (DL) to train a learning model on the local dataset of users, who estimate the beamformers by feeding the model with their channel data. The DL randomly selects different set of model parameters during training, thereby further reducing the transmission overhead compared to conventional FL. Numerical experiments show that the proposed framework exhibits higher spectral efficiency than the state-of-the-art SPIM-MIMO methods and mmWave-MIMO, which relies on the strongest propagation path. Furthermore, the proposed FL approach provides at least 10 times lower transmission overhead than the centralized learning techniques.
  • Thumbnail Image
    PublicationOpen Access
    Generalized Polytopic Matrix Factorization
    (Institute of Electrical and Electronics Engineers (IEEE), 2021) Department of Electrical and Electronics Engineering; Erdoğan, Alper Tunga; Tatlı, Gökcan; Faculty Member; Department of Electrical and Electronics Engineering; College of Engineering; Graduate School of Sciences and Engineering; 41624; N/A
    Polytopic Matrix Factorization (PMF) is introduced as a flexible data decomposition tool with potential applications in unsupervised learning. PMF assumes a generative model where observations are lossless linear mixtures of some samples drawn from a particular polytope. Assuming that these samples are sufficiently scattered inside the polytope, a determinant maximization based criterion is used to obtain latent polytopic factors from the corresponding observations. This article aims to characterize all eligible polytopic sets that are suitable for the PMF framework. In particular, we show that any polytope whose set of vertices have only permutation and/or sign invariances qualifies for PMF framework. Such a rich set of possibilities enables elastic modeling of independent/dependent latent factors with combination of features such as relatively sparse/antisparse subvectors, mixture of signed/nonnegative components with optionally prescribed domains.
  • Thumbnail Image
    PublicationOpen Access
    Training socially engaging robots: modeling backchannel behaviors with batch reinforcement learning
    (Institute of Electrical and Electronics Engineers (IEEE), 2022) Department of Computer Engineering; Department of Electrical and Electronics Engineering; Hussain, Nusrah; Erzin, Engin; Sezgin, Tevfik Metin; Yemez, Yücel; PhD Student; Faculty Member; Faculty Member; Faculty Member; Department of Computer Engineering; Department of Electrical and Electronics Engineering; College of Engineering; Graduate School of Sciences and Engineering; N/A; 34503; 18632; 107907
    A key aspect of social human-robot interaction is natural non-verbal communication. In this work, we train an agent with batch reinforcement learning to generate nods and smiles as backchannels in order to increase the naturalness of the interaction and to engage humans. We introduce the Sequential Random Deep Q-Network (SRDQN) method to learn a policy for backchannel generation, that explicitly maximizes user engagement. The proposed SRDQN method outperforms the existing vanilla Q-learning methods when evaluated using off-policy policy evaluation techniques. Furthermore, to verify the effectiveness of SRDQN, a human-robot experiment has been designed and conducted with an expressive 3d robot head. The experiment is based on a story-shaping game designed to create an interactive social activity with the robot. The engagement of the participants during the interaction is computed from user's social signals like backchannels, mutual gaze and adjacency pair. The subjective feedback from participants and the engagement values strongly indicate that our framework is a step forward towards the autonomous learning of a socially acceptable backchanneling behavior.