Researcher: Erten, Didem Unat
Name Variants
Erten, Didem Unat
Email Address
Birth Date
34 results
Search Results
Now showing 1 - 10 of 34
Publication Metadata only Object placement for high bandwidth memory augmented with high capacity memory(IEEE, 2017) N/A; N/A; Department of Computer Engineering; Laghari, Mohammad; Erten, Didem Unat; Master Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 219274High bandwidth memory (HBM) is a new emerging technology that aims to improve the performance of bandwidth limited applications. Even though it provides high bandwidth, it must be augmented with DRAM to meet the memory capacity requirement of any applications. Due to this limitation, objects in an application should be optimally placed on the heterogeneous memory subsystems. In this study, we propose an object placement algorithm that places program objects to fast or slow memories in case the capacity of fast memory is insufficient to hold all the objects to increase the overall application performance. Our algorithm uses the reference counts and type of references (read or write) to make an initial placement of data. In addition, we perform various memory bandwidth benchmarks to be used in our placement algorithm on Intel Knights Landing (KNL) architecture. Not surprisingly high bandwidth memory sustains higher read bandwidth than write bandwidth, however, placing write-intensive data on HBM results in better overall performance because write-intensive data is punished by the DRAM speed more severely compared to read intensive data. Moreover, our benchmarks demonstrate that if a basic block makes references to both types of memories, it performs worse than if it makes references to only one type of memory in some cases. We test our proposed placement algorithm with 6 applications under various system configurations. By allocating objects according to our placement scheme, we are able to achieve a speedup of up to 2x.Publication Metadata only ComScribe: identifying intra-node GPU communication(Springer Science and Business Media Deutschland GmbH, 2021) N/A; N/A; N/A; Department of Computer Engineering; Akhtar, Palwisha; Tezcan, Erhan; Qararyah, Fareed Mohammad; Erten, Didem Unat; Master Student; Master Student; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; N/A; N/A; 219274GPU communication plays a critical role in performance and scalability of multi-GPU accelerated applications. With the ever increasing methods and types of communication, it is often hard for the programmer to know the exact amount and type of communication taking place in an application. Though there are prior works that detect communication in distributed systems for MPI and multi-threaded applications on shared memory systems, to our knowledge, none of these works identify intra-node GPU communication. We propose a tool, ComScribe that identifies and categorizes types of communication among all GPU-GPU and CPU-GPU pairs in a node. Built on top of NVIDIA’s profiler nvprof, ComScribe visualizes data movement as a communication matrix or bar-chart for explicit communication primitives, Unified Memory operations, and Zero-copy Memory transfers. To validate our tool on 16 GPUs, we present communication patterns of 8 micro- and 3 macro-benchmarks from NVIDIA, Comm|Scope, and MGBench benchmark suites. To demonstrate tool’s capabilities in real-life applications, we also present insightful communication matrices of two deep neural network models. All in all, ComScribe can guide the programmer in identifying which groups of GPUs communicate in what volume by using which primitives. This offers avenues to detect performance bottlenecks and more importantly communication bugs in an application. © 2021, Springer Nature Switzerland AG.Publication Metadata only Nonintrusive AMR asynchrony for communication optimization(Springer International Publishing Ag, 2017) Nguyen, Tan; Zhang, Weiqun; Almgren, Ann; Shalf, John; N/A; Department of Computer Engineering; Farooqi, Muhammad Nufail; Erten, Didem Unat; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 219274Adaptive Mesh Refinement (AMR) is a well known method for efficiently solving partial differential equations. A straightforward AMR algorithm typically exhibits many synchronization points even during a single time step, where costly communication often degrades the performance. This problem will be even more pronounced on future supercomputers containing billion way parallelism, which will raise the communication cost further. Re-designing AMR algorithms to avoid synchronization is not a viable solution due to the large code size and complex control structures. We present a nonintrusive asynchronous approach to hiding the effects of communication in an AMR application. Specifically, our approach reasons about data dependencies automatically using domain knowledge about AMR applications, allowing asynchrony to be discovered with only a modest amount of code modification. Using this approach, we optimize the synchronous AMR algorithm in the BoxLib software framework without severely affecting the productivity of the application programmer We observe around 27-31% performance improvement for an advection solver on the Hazel Hen supercomputer using 12288 cores.Publication Metadata only Structured adaptive mesh refinement adaptations to retain performance portability with increasing heterogeneity(IEEE Computer Society, 2021) Dubey, Anshu; Berzins, Martin; Burstedde, Carsten; Norman, Michael L.; Wahib, Mohammed; Department of Computer Engineering; Erten, Didem Unat; Faculty Member; Department of Computer Engineering; College of Engineering; 219274Adaptive mesh refinement (AMR) is an important method that enables many mesh-based applications to run at effectively higher resolution within limited computing resources by allowing high resolution only where really needed. This advantage comes at a cost, however: greater complexity in the mesh management machinery and challenges with load distribution. With the current trend of increasing heterogeneity in hardware architecture, AMR presents an orthogonal axis of complexity. The usual techniques, such as asynchronous communication and hierarchy management for parallelism and memory that are necessary to obtain reasonable performance are very challenging to reason about with AMR. Different groups working with AMR are bringing different approaches to this challenge. Here, we examine the design choices of several AMR codes and also the degree to which demands placed on them by their users influence these choices.Publication Metadata only Program analysis for process migration(Assoc Computing Machinery, 2019) Department of Computer Engineering; Department of Computer Engineering; Department of Computer Engineering; Yılmaz, Buse; Turimbetov, İlyas; Erten, Didem Unat; Researcher; PhD Student; Faculty Member; Department of Computer Engineering; College of Engineering; College of Engineering; College of Engineering; N/A; N/A; 219274Today's computer systems have become increasingly heterogeneous. Data centers integrate accelerators, CPUs with heterogeneous cores and with various ISAs which exhibit different performance and power characteristics. Mobile phones, following a similar trend, switch between fast and energy-efficient cores. Process migration is an important technique to leverage such specialization and heterogeneity. In this work, we target process migration enabled OS-capable heterogeneous platforms and address how to obtain better performance by program analysis: we address the challenge of defining migration points at which the program state is the same across machines and whether these will match phase changes, changes in the program behavior. Our tool-chain employs both static and dynamic analysis to compensate for disadvantages of both techniques to reduce the analyses overhead. Six out of ten benchmarks from different benchmark suites benefit from migration and the migration cost is compensated by the performance gained from migrating.Publication Metadata only Perilla: metadata-based optimizations of an asynchronous runtime for adaptive mesh refinement(IEEE Computer Society, 2016) Nguyen, Tan; Zhang, Weiqun; Almgren, Ann; Shalf, John; Department of Computer Engineering; N/A; Erten, Didem Unat; Farooqi, Muhammad Nufail; Faculty Member; PhD Student; Department of Computer Engineering; College of Engineering; Graduate School of Sciences and Engineering; 219274; N/AHardware architecture is increasingly complex, urging the development of asynchronous runtime systems with advance resource and locality management supports. However, these supports may come at the cost of complicating the user interface while programming remains one of the major constraints to wide adoption of asynchronous runtimes in practice. In this paper, we propose a solution that leverages application metadata to enable challenging optimizations as well as to facilitate the task of transforming legacy code to an asynchronous representation. We develop Perilla, a task graph-based runtime system that requires only modest programming effort. Perilla utilizes metadata of an AMR software framework to enable various optimizations at the communication layer without complicating its API. Experimental results with different applications on up to 24K processor cores show that Perilla can realize up to 1.44x speedup over the synchronous code variant. The metadata enabled optimizations account for 25% to 100% of the performance improvement.Publication Metadata only Phase-based data placement scheme for heterogeneous memory systems(IEEE, 2018) N/A; N/A; N/A; Department of Computer Engineering; Laghari, Mohammad; Ahmad, Najeeb; Erten, Didem Unat; PhD Student; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; N/A; 219274Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. The multiple memories can vary in latency, bandwidth and capacity characteristics across systems and they come in various configurations that can be managed by the programmer. This introduces an added programming complexity for the programmer. In this paper, we present a dynamic phase-based data placement scheme to assist the programmer in making decisions about program object allocations. We devise a cost model to assess the benefit of having an object in one type of memory over the other and apply the cost model at every application phase to capture the dynamic behaviour of an application. Our cost model takes into account the reference counts of objects and incurred transfer overhead when making a suggestion. In addition, objects can be transferred across memories asynchronously between phases to mask some of the transfer overhead. We test our cost model with a diverse set of applications from NAS Parallel and Rodinia benchmarks and perform experiments on Intel KNL, which is equipped with a high bandwidth memory (MCDRAM) and a high capacity memory (DDR). Our dynamic phase-based data placement performs better than initial placement and achieves comparable or better performance than cache mode of MCDRAM.Publication Metadata only A split execution model for SpTRSV(IEEE Computer Society, 2021) Yilmaz, Buse; N/A; Department of Computer Engineering; Ahmad, Najeeb; Erten, Didem Unat; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 219274Sparse Triangular Solve (SpTRSV) is an important and extensively used kernel in scientific computing. Parallelism within SpTRSV depends upon matrix sparsity pattern and, in many cases, is non-uniform from one computational step to the next. In cases where the SpTRSV computational steps have contrasting parallelism characteristics- some steps are more parallel, others more sequential in nature, the performance of an SpTRSV algorithm may be limited by the contrasting parallelism characteristics. In this work, we propose a split-execution model for SpTRSV to automatically divide SpTRSV computation into two sub-SpTRSV systems and an SpMV, such that one of the sub-SpTRSVs has more parallelism than the other. Each sub-SpTRSV is then computed using different SpTRSV algorithms, which are possibly executed on different platforms (CPU or GPU). By analyzing the SpTRSV Directed Acyclic Graph (DAG) and matrix sparsity features, we use a heuristics-based approach to (i) automatically determine the suitability of an SpTRSV for split-execution, (ii) find the appropriate split-point, and (iii) execute SpTRSV in a split fashion using two SpTRSV algorithms while managing any required inter-platform communication. Experimental evaluation of the execution model on two CPU-GPU machines with a matrix dataset of 327 matrices from the SuiteSparse Matrix Collection shows that our approach correctly selects the fastest SpTRSV method (split or unsplit) for 88 percent of matrices on the Intel Xeon Gold (6148) + NVIDIA Tesla V100 and 83 percent on the Intel Core I7 + NVIDIA G1080 Ti platform achieving speedups up to 10x and 6.36x respectively.Publication Metadata only Access pattern-aware data placement for hybrid DRAM/NVM(TUBITAKScientific and Technical Research Council Turkey, 2017) Department of Computer Engineering; Erten, Didem Unat; Faculty Member; Department of Computer Engineering; College of Engineering; 219274in recent years, increased interest in data-centric applications has led to an increasing demand for large capacity memory systems. Nonvolatile memory (NVM) technologies enable new opportunities in terms of process-scaling and energy consumption, and have become an attractive memory technology that serves as a secondary memory at low cost. However, NVM has certain disadvantages for write references, due to its high dynamic energy consumption for writes and low bandwidth compared to DRaM writes. in this paper, we propose an access-aware placement of objects in the application code for two types of memories. Given the desired power savings and acceptable performance loss, our placement algorithm suggests candidate variables for NVM. We present an evaluation of the proposed technique on two applications and study the energy and performance consequences of different placements.Publication Metadata only Phase asynchronous AMR execution for productive and performant astrophysical flows(Institute of Electrical and Electronics Engineers Inc., 2019) Nguyen, Tan; Zhang, Weiqun; Almgren, Ann S.; Shalf, John; Department of Computer Engineering; N/A; Erten, Didem Unat; Farooqi, Muhammad Nufail; Faculty Member; PhD Student; Department of Computer Engineering; College of Engineering; Graduate School of Sciences and Engineering; 219274; N/AAdaptive Mesh Refinement (AMR) is an approach to solving PDEs that reduces the computational and memory requirements at the expense of increased communication. Although adopting asynchronous execution can overcome communication issues, manually restructuring an AMR application to realize asynchrony is extremely complicated and hinders readability and long-term maintainability. To balance performance against productivity, we design a user-friendly API and adopt phase asynchronous execution model where all subgrids at an AMR level can be computed asynchronously. We apply the phase asynchrony to transform a real-world AMR application, CASTRO, which solves multicomponent compressible hydrodynamic equations for astrophysical flows. We evaluate the performance and programming effort required to use our carefully designed API and execution model for transitioning large legacy codes from synchronous to asynchronous execution up to 278,528 Intel-KNL cores. CASTRO is about 100K lines of code but less than 0.2% code changes are required to achieve significant performance improvement.