Researcher:
Qararyah, Fareed Mohammad

Loading...
Profile Picture
ORCID

Job Title

PhD Student

First Name

Fareed Mohammad

Last Name

Qararyah

Name

Name Variants

Qararyah, Fareed Mohammad

Email Address

Birth Date

Search Results

Now showing 1 - 2 of 2
  • Placeholder
    Publication
    ComScribe: identifying intra-node GPU communication
    (Springer Science and Business Media Deutschland GmbH, 2021) N/A; N/A; N/A; Department of Computer Engineering; Akhtar, Palwisha; Tezcan, Erhan; Qararyah, Fareed Mohammad; Erten, Didem Unat; Master Student; Master Student; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; N/A; N/A; 219274
    GPU communication plays a critical role in performance and scalability of multi-GPU accelerated applications. With the ever increasing methods and types of communication, it is often hard for the programmer to know the exact amount and type of communication taking place in an application. Though there are prior works that detect communication in distributed systems for MPI and multi-threaded applications on shared memory systems, to our knowledge, none of these works identify intra-node GPU communication. We propose a tool, ComScribe that identifies and categorizes types of communication among all GPU-GPU and CPU-GPU pairs in a node. Built on top of NVIDIA’s profiler nvprof, ComScribe visualizes data movement as a communication matrix or bar-chart for explicit communication primitives, Unified Memory operations, and Zero-copy Memory transfers. To validate our tool on 16 GPUs, we present communication patterns of 8 micro- and 3 macro-benchmarks from NVIDIA, Comm|Scope, and MGBench benchmark suites. To demonstrate tool’s capabilities in real-life applications, we also present insightful communication matrices of two deep neural network models. All in all, ComScribe can guide the programmer in identifying which groups of GPUs communicate in what volume by using which primitives. This offers avenues to detect performance bottlenecks and more importantly communication bugs in an application. © 2021, Springer Nature Switzerland AG.
  • Placeholder
    Publication
    A computational-graph partitioning method for training memory-constrained DNNs
    (Elsevier, 2021) Wahib, Mohamed; Dikbayir, Doga; Belviranli, Mehmet Esat; N/A; Department of Computer Engineering; Qararyah, Fareed Mohammad; Erten, Didem Unat; PhD Student; Faculty Member; Department of Computer Engineering; Graduate School of Sciences and Engineering; College of Engineering; N/A; 219274
    Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized. ParDNN is completely independent of the deep learning aspects of a DNN. It requires no modification neither at the model nor at the systems level implementation of its operation kernels. ParDNN partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. ParDNN either outperforms or qualitatively improves upon the related work.