Researcher: Erdem, Aykut
Name Variants
Erdem, Aykut
Email Address
Birth Date
21 results
Search Results
Now showing 1 - 10 of 21
Publication Metadata only Perception-distortion trade-off in the SR space spanned by flow models(The Institute of Electrical and Electronics Engineers Signal Processing Society, 2022) Erdem, Erkut; Department of Electrical and Electronics Engineering; Department of Electrical and Electronics Engineering; Department of Computer Engineering; Doğan, Zafer; Tekalp, Ahmet Murat; Erdem, Aykut; Korkmaz, Cansu; Faculty Member; Faculty Member; Faculty Member; PhD Student; Department of Electrical and Electronics Engineering; Department of Computer Engineering; Koç Üniversitesi İş Bankası Yapay Zeka Uygulama ve Araştırma Merkezi (KUIS AI)/ Koç University İş Bank Artificial Intelligence Center (KUIS AI); College of Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; 280658; 26207; 20331; N/AFlow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature (τ) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fusion approach to obtain a single SR image eliminating random artifacts and improving fidelity without significantly compromising perceptual quality. We achieve this by benefiting from a diverse set of feasible photo-realistic solutions in the SR space spanned by flow models. We propose different image ensembling and fusion strategies which offer multiple paths to move sample solutions in the SR space to more desired destinations in the perception-distortion plane in a controllable manner depending on the fidelity vs. perceptual quality requirements of the task at hand. Experimental results demonstrate that our image ensembling/fusion strategy achieves more promising perception-distortion tradeoff compared to sample SR images produced by flow models and adversarially trained models in terms of both quantitative metrics and visual quality. © 2022 IEEE.Publication Metadata only Using synthetic data for person tracking under adverse weather conditions(Elsevier, 2021) Kerim, Abdulrahman; Çelikcan, Ufuk; Erdem, Erkut; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331Robust visual tracking plays a vital role in many areas such as autonomous cars, surveillance and robotics. Recent trackers were shown to achieve adequate results under normal tracking scenarios with clear weather conditions, standard camera setups and lighting conditions. Yet, the performance of these trackers, whether they are corre-lation filter-based or learning-based, degrade under adverse weather conditions. The lack of videos with such weather conditions, in the available visual object tracking datasets, is the prime issue behind the low perfor-mance of the learning-based tracking algorithms. In this work, we provide a new person tracking dataset of real-world sequences (PTAW172Real) captured under foggy, rainy and snowy weather conditions to assess the performance of the current trackers. We also introduce a novel person tracking dataset of synthetic sequences (PTAW217Synth) procedurally generated by our NOVA framework spanning the same weather conditions in varying severity to mitigate the problem of data scarcity. Our experimental results demonstrate that the perfor-mances of the state-of-the-art deep trackers under adverse weather conditions can be boosted when the avail-able real training sequences are complemented with our synthetically generated dataset during training. (c) 2021 Elsevier B.V. All rights reserved.Publication Metadata only NOVA: rendering virtual worlds with humans for computer vision tasks(Wiley, 2021) Kerim, Abdulrahman; Aslan, Cem; Çelikcan, Ufuk; Erdem, Erkut; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331Today, the cutting edge of computer vision research greatly depends on the availability of large datasets, which are critical for effectively training and testing new methods. Manually annotating visual data, however, is not only a labor-intensive process but also prone to errors. In this study, we present NOVA, a versatile framework to create realistic-looking 3D rendered worlds containing procedurally generated humans with rich pixel-level ground truth annotations. NOVA can simulate various environmental factors such as weather conditions or different times of day, and bring an exceptionally diverse set of humans to life, each having a distinct body shape, gender and age. To demonstrate NOVA's capabilities, we generate two synthetic datasets for person tracking. The first one includes 108 sequences, each with different levels of difficulty like tracking in crowded scenes or at nighttime and aims for testing the limits of current state-of-the-art trackers. A second dataset of 97 sequences with normal weather conditions is used to show how our synthetic sequences can be utilized to train and boost the performance of deep-learning based trackers. Our results indicate that the synthetic data generated by NOVA represents a good proxy of the real-world and can be exploited for computer vision tasks.Publication Metadata only Modulating bottom-up and top-down visual processing via language-conditional filters(Ieee, 2022) Erdem, Erkut; N/A; N/A; Department of Computer Engineering; Department of Computer Engineering; Kesen, İlker; Can, Ozan Arkan; Erdem, Aykut; Yüret, Deniz; PhD Student; PhD Student; Faculty Member; Faculty Member; Department of Computer Engineering; Koç Üniversitesi İş Bankası Yapay Zeka Uygulama ve Araştırma Merkezi (KUIS AI)/ Koç University İş Bank Artificial Intelligence Center (KUIS AI); Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; College of Engineering; N/A; N/A; 20331; 179996How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.Publication Metadata only Mustgan: multi-stream generative adversarial networks for MR image synthesis(Elsevier, 2021) Yurt, Mahmut; Dar, Salman U. H.; Oguz, Kader K.; Cukur, Tolga; Erdem, Erkut; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331Multi-contrast MRI protocols increase the level of morphological information available for diagnosis. Yet, the number and quality of contrasts are limited in practice by various factors including scan time and patient motion. Synthesis of missing or corrupted contrasts from other high-quality ones can alleviate this limitation. When a single target contrast is of interest, common approaches for multi-contrast MRI involve either one-to-one or many-to-one synthesis methods depending on their input. One-to-one methods take as input a single source contrast, and they learn a latent representation sensitive to unique features of the source. Meanwhile, many-to-one methods receive multiple distinct sources, and they learn a shared latent representation more sensitive to common features across sources. For enhanced image synthesis, we propose a multi-stream approach that aggregates information across multiple source images via a mixture of multiple one-to-one streams and a joint many-to-one stream. The complementary feature maps generated in the one-to-one streams and the shared feature maps generated in the many to-one stream are combined with a fusion block. The location of the fusion block is adaptively modified to maximize task-specific performance. Quantitative and radiological assessments on T-1,- T-2-, PD-weighted, and FLAIR images clearly demonstrate the superior performance of the proposed method compared to previous state-of-the-art one-to-one and many-to-one methods. (C) 2020 Elsevier B.V. All rights reserved.Publication Metadata only Multi-contrast MRI synthesis with channel-exchanging-network(Institute of Electrical and Electronics Engineers Inc., 2022) Dalmaz, Onat; Aytekin, İdil; Dar, Salman Ul Hassan; Erdem, Erkut; Çukur, Tolga; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331Magnetic resonance imaging (MRI) is used in many diagnostic applications as it has a high soft-tissue contrast and is a non-invasive medical imaging method. MR signal levels differs according to the parameters T1, T2 and PD that change with respect to the chemical structure of the tissues. However, long scan times might limit acquiring images from multiple contrasts or if the multi-contrasts images are acquired, the contrasts are noisy. To overcome this limitation of MRI, multi-contrast synthesis can be utilized. In this paper, we propose a deep learning method based on Channel-Exchanging-Network (CEN) for multi-contrast image synthesis. Demonstrations are provided on IXI dataset. The proposed model based on CEN is compared against alternative methods based on CNNs and GANs. Our results show that the proposed model achieves superior performance to the competing methods.Publication Metadata only CRAFT: a benchmark for causal reasoning about forces and inTeractions(Assoc Computational Linguistics-Acl, 2022) Ates, Tayfun; Atesoglu, M. Samil; Yigit, Cagatay; N/A; N/A; Department of Computer Engineering; Department of Psychology; Department of Computer Engineering; Kesen, İlker; Kobaş, Mert; Erdem, Aykut; Göksun, Tilbe; Yüret, Deniz; PhD Student; Master Student; Faculty Member; Faculty Member; Faculty Member; Department of Psychology; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Social Sciences and Humanities; College of Engineering; College of Social Sciences and Humanities; College of Engineering; N/A; N/A; 20331; 47278; 179996Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT1, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.Publication Metadata only Synthetic18K: Learning better representations for person re-ID and attribute recognition from 1.4 million synthetic images(Elsevier, 2021) Uner, Onur Can; Aslan, Cem; Ercan, Burak; Ates, Tayfun; Celikcan, Ufuk; Erdem, Erkut; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331Learning robust representations is critical for the success of person re-identification and attribute recognition systems. However, to achieve this, we must use a large dataset of diverse person images as well as annotations of identity labels and/or a set of different attributes. Apart from the obvious concerns about privacy issues, the manual annotation process is both time consuming and too costly. In this paper, we instead propose to use synthetic person images for addressing these difficulties. Specifically, we first introduce Synthetic18K, a large-scale dataset of over 1 million computer generated person images of 18K unique identities with relevant attributes. Moreover, we demonstrate that pretraining of simple deep architectures on Synthetic18K for person re-identification and attribute recognition and then fine-tuning on real data leads to significant improvements in prediction performances, giving results better than or comparable to state-of-the-art models.Publication Metadata only Leveraging semantic saliency maps for query-specific video summarization(Springer, 2022) Cizmeciler, Kemal; Erdem, Erkut; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331The immense amount of videos being uploaded to video sharing platforms makes it impossible for a person to watch all the videos understand what happens in them. Hence, machine learning techniques are now deployed to index videos by recognizing key objects, actions and scenes or places. Summarization is another alternative as it offers to extract only important parts while covering the gist of the video content. Ideally, the user may prefer to analyze a certain action or scene by searching a query term within the video. Current summarization methods generally do not take queries into account or require exhaustive data labeling. In this work, we present a weakly supervised query-focused video summarization method. Our proposed approach makes use of semantic attributes as an indicator of query relevance and semantic attention maps to locate related regions in the frames and utilizes both within a submodular maximization framework. We conducted experiments on the recently introduced RAD dataset and obtained highly competitive results. Moreover, to better evaluate the performance of our approach on longer videos, we collected a new dataset, which consists of 10 videos from YouTube and annotated with shot-level multiple attributes. Our dataset enables much diverse set of queries that can be used to summarize a video from different perspectives with more degrees of freedom.Publication Metadata only Multi3Generation: multitask, multilingual, multimodal language generation(European Association for Machine Translation, 2022) Barreiro, Anabela; de Souza, José G.C.; Gatt, Albert; Bhatt, Mehul; Lloret, Elena; Gkatzia, Dimitra; Moniz, Helena; Russo, Irene; Kepler, Fabio; Calixto, Iacer; Paprzycki, Marcin; Portet, François; Augenstein, Isabelle; Alhasani, Mirela; Department of Computer Engineering; Erdem, Aykut; Faculty Member; Department of Computer Engineering; College of Engineering; 20331This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action - Multi3Generation (CA18231), an interdisciplinary network of research groups working on different aspects of language generation. This "meta-paper" will serve as reference for citations of the Action in future publications. It presents the objectives, challenges and a the links for the achieved outcomes.
- «
- 1 (current)
- 2
- 3
- »