Researcher: Ahmad, Najeeb
Name Variants
Ahmad, Najeeb
Email Address
Birth Date
7 results
Search Results
Now showing 1 - 7 of 7
Publication Metadata only Phase-based data placement scheme for heterogeneous memory systems(IEEE, 2018) N/A; N/A; N/A; Department of Computer Engineering; Laghari, Mohammad; Ahmad, Najeeb; Erten, Didem Unat; PhD Student; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; N/A; 219274Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. The multiple memories can vary in latency, bandwidth and capacity characteristics across systems and they come in various configurations that can be managed by the programmer. This introduces an added programming complexity for the programmer. In this paper, we present a dynamic phase-based data placement scheme to assist the programmer in making decisions about program object allocations. We devise a cost model to assess the benefit of having an object in one type of memory over the other and apply the cost model at every application phase to capture the dynamic behaviour of an application. Our cost model takes into account the reference counts of objects and incurred transfer overhead when making a suggestion. In addition, objects can be transferred across memories asynchronously between phases to mask some of the transfer overhead. We test our cost model with a diverse set of applications from NAS Parallel and Rodinia benchmarks and perform experiments on Intel KNL, which is equipped with a high bandwidth memory (MCDRAM) and a high capacity memory (DDR). Our dynamic phase-based data placement performs better than initial placement and achieves comparable or better performance than cache mode of MCDRAM.Publication Metadata only A split execution model for SpTRSV(IEEE Computer Society, 2021) Yilmaz, Buse; N/A; Department of Computer Engineering; Ahmad, Najeeb; Erten, Didem Unat; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 219274Sparse Triangular Solve (SpTRSV) is an important and extensively used kernel in scientific computing. Parallelism within SpTRSV depends upon matrix sparsity pattern and, in many cases, is non-uniform from one computational step to the next. In cases where the SpTRSV computational steps have contrasting parallelism characteristics- some steps are more parallel, others more sequential in nature, the performance of an SpTRSV algorithm may be limited by the contrasting parallelism characteristics. In this work, we propose a split-execution model for SpTRSV to automatically divide SpTRSV computation into two sub-SpTRSV systems and an SpMV, such that one of the sub-SpTRSVs has more parallelism than the other. Each sub-SpTRSV is then computed using different SpTRSV algorithms, which are possibly executed on different platforms (CPU or GPU). By analyzing the SpTRSV Directed Acyclic Graph (DAG) and matrix sparsity features, we use a heuristics-based approach to (i) automatically determine the suitability of an SpTRSV for split-execution, (ii) find the appropriate split-point, and (iii) execute SpTRSV in a split fashion using two SpTRSV algorithms while managing any required inter-platform communication. Experimental evaluation of the execution model on two CPU-GPU machines with a matrix dataset of 327 matrices from the SuiteSparse Matrix Collection shows that our approach correctly selects the fastest SpTRSV method (split or unsplit) for 88 percent of matrices on the Intel Xeon Gold (6148) + NVIDIA Tesla V100 and 83 percent on the Intel Core I7 + NVIDIA G1080 Ti platform achieving speedups up to 10x and 6.36x respectively.Publication Metadata only Adaptive level binning: a new algorithm for solving sparse triangular systems(Information Processing Society of Japan (IPSJ), 2020) Department of Computer Engineering; Department of Computer Engineering; N/A; Department of Computer Engineering; Erten, Didem Unat; Yılmaz, Buse; Ahmad, Najeeb; Sipahioğlu, Buğra; Faculty Member; Researcher; PhD Student; Undergraduate Student; Department of Computer Engineering; College of Engineering; College of Engineering; Graduate School of Sciences and Engineering; College of Engineering; 219274; N/A; N/A; N/ASparse triangular solve (SpTRSV) is an important scientific kernel used in several applications such as preconditioners for Krylov methods. Parallelizing SpTRSV on multi-core systems is challenging since it exhibits limited parallelism due to computational dependencies and introduces high parallelization overhead due to finegrained and unbalanced nature of workloads. We propose a novel method, named Adaptive Level Binning (ALB), that addresses these challenges by eliminating redundant synchronization points and adapting the work granularity with an efficient load balancing strategy. Similar to the commonly used level-set methods for solving SpTRSV, ALB constructs level-sets of rows, where each level can be computed in parallel. Differently, ALB bins rows to levels adaptively and reduces redundant dependencies between rows. On an Intel® Xeon® Gold 6148 processor and NVIDIA® Tesla V100 GPU, ALB obtains 1.83x speedup on average and up to 5.28x speedup over Intel MKL and, over NVIDIA cuSPARSE, an average speedup of 2.80x and a maximum speedup of 39.40x for 29 matrices selected from Suite Sparse Matrix Collection.Publication Metadata only An analysis for the performance of reservoir simulations on a multicore CPU(Springer Nature, 2020) N/A; N/A; Ahmad, Najeeb; Bakar, Recep; PhD Student; PhD Student; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; N/A; N/AReservoir simulations have been widely used in engineering applications in many industries. Computational performance of simulations is also important. The faster the results are obtained, the better it is for the company in terms of time and cost. This study aims at analyzing the performance of a reservoir simulation on Intel KNL, a multicore CPU, using different data formats, problem sizes, and vectorization modes. A dual porosity model with 3D single phase flow was used to carry out the simulations, implemented using PETSc library. Four fundamental cases in terms of problem sizes were simulated on KNL with varying data formats namely CSR and SELL and with four different vectorization modes such as AVX, AVX2, AVX512, and no vectorization mode. In the simulated cases, the best performance was achieved with the number of processes equal to the number of KNL cores for all configurations. Also, SELL with AVX-512 vectorization mode yielded the best performance for problem sizes occupying less than 50% of High Bandwidth Memory (HBM), followed by AVX2 and AVX. The performance of both SELL and CSR deteriorated with the problem size approaching to the memory of HBM. On the other hand, CSR AVX-512 was the best among CSR with all vectorization modes and marginally better than SELL AVX. With further usage of HBM, the best performance was obtained using CSR with AVX vectorization mode. However, generally, the performance of both CSR and SELL with any vectorization mode went down as problem size increased, but the rate of decline in performance was more for SELL than CSR. Among CSR with different vectorization modes, the performance of CSR-AVX degraded the least with increasing problem sizes. Finally, this study investigates, to the best of our knowledge, for the first time, the performance of SELL and CSR with different vectorization modes for numerical simulations with big problem sizes approaching and exceeding the size of the HBM.Publication Metadata only A prediction framework for fast sparse triangular solves(Springer International Publishing Ag, 2020) N/A; N/A; Department of Computer Engineering; Ahmad, Najeeb; Yılmaz, Buse; Erten, Didem Unat; PhD Student; N/A; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; N/A; College of Engineering; N/A; N/A; 219274Sparse triangular solve (SpTRSV) is an important linear algebra kernel, finding extensive uses in numerical and scientific computing. The parallel implementation of SpTRSV is a challenging task due to the sequential nature of the steps involved. This makes it, in many cases, one of the most time-consuming operations in an application. Many approaches for efficient SpTRSV on CPU and GPU systems have been proposed in the literature. However, no single implementation or platform (CPU or GPU) gives the fastest solution for all input sparse matrices. In this work, we propose a machine learning-based framework to predict the SpTRSV implementation giving the fastest execution time for a given sparse matrix based on its structural features. The framework is tested with six SpTRSV implementations on a state-of-the-art CPU-GPU machine (Intel Xeon Gold CPU, NVIDIA V100 GPU). Experimental results, with 998 matrices taken from the SuiteSparse Matrix Collection, show the classifier prediction accuracy of 87% for the fastest SpTRSV algorithm for a given input matrix. Predicted SpTRSV implementations achieve average speedups (harmonic mean) in the range of 1.4-2.7x against the six SpTRSV implementations used in the evaluation.Publication Open Access Phase-based data placement scheme for heterogeneous memory systems(Institute of Electrical and Electronics Engineers (IEEE), 2018) N/A; Erten, Didem Unat; Laghari, Mohammad; Ahmad, Najeeb; Faculty Member; Graduate School of Sciences and Engineering; 219274; N/A; N/AHeterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. The multiple memories can vary in latency, bandwidth and capacity characteristics across systems and they come in various configurations that can be managed by the programmer. This introduces an added programming complexity for the programmer. In this paper, we present a dynamic phase-based data placement scheme to assist the programmer in making decisions about program object allocations. We devise a cost model to assess the benefit of having an object in one type of memory over the other and apply the cost model at every application phase to capture the dynamic behaviour of an application. Our cost model takes into account the reference counts of objects and incurred transfer overhead when making a suggestion. In addition, objects can be transferred across memories asynchronously between phases to mask some of the transfer overhead. We test our cost model with a diverse set of applications from NAS Parallel and Rodinia benchmarks and perform experiments on Intel KNL, which is equipped with a high bandwidth memory (MCDRAM) and a high capacity memory (DDR). Our dynamic phase-based data placement performs better than initial placement and achieves comparable or better performance than cache mode of MCDRAM.Publication Open Access Load balancing for parallel multiphase flow simulation(Hindawi, 2018) Department of Computer Engineering; Ahmad, Najeeb; Farooqi, Muhammad Nufail; Erten, Didem Unat; PhD Student; Faculty Member; Department of Computer Engineering; College of Engineering; Graduate School of Sciences and Engineering; N/A; N/A; 219274This paper presents a scalable dynamic load balancing scheme for a parallel front- tracking method based multiphase flow simulation. In this simulation employing both Lagrangian and Eulerian grids, processes operating on Lagrangian grid are susceptible to load imbalance due to moving Lagrangian grid points (bubbles) and load distribution based on spatial location of bubbles. To load balance these processes, we distribute load keeping in view both current processor load distribution and bubble spatial locality and remap interprocess communication. The result is a uniform processor load distribution and predictable and less expensive communication scheme. Scalability studies on the Hazel Hen supercomputer demonstrate excellent scaling with exponential savings in execution time as the problem size becomes increasingly large. While moderate speedup is observed for strong scaling, speedup of up to 30% is achieved over nonload- balanced version when simulating 13824 bubbles on 4096 cores for weak scaling studies.