Multi-GPU communication schemes for iterative solvers: when CPUs are not in charge

Multi-GPU communication schemes for iterative solvers: when CPUs are not in charge

dc.contributor.authorid	0000-0002-2351-0770
dc.contributor.authorid	0000-0002-9603-2466
dc.contributor.authorid	0000-0001-7235-6418
dc.contributor.authorid	N/A
dc.contributor.coauthor	Wahib, Mohamed
dc.contributor.department	Department of Computer Engineering
dc.contributor.department	N/A
dc.contributor.department	N/A
dc.contributor.department	N/A
dc.contributor.kuauthor	Erten, Didem Unat
dc.contributor.kuauthor	Sağbili Doğan
dc.contributor.kuauthor	Baydamirli Javid
dc.contributor.kuauthor	Ismayilov, Ismayil
dc.contributor.kuprofile	Faculty Member
dc.contributor.kuprofile	PhD Student
dc.contributor.kuprofile	PhD Student
dc.contributor.kuprofile	Master Student
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.contributor.schoolcollegeinstitute	Graduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstitute	Graduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstitute	Graduate School of Sciences and Engineering
dc.contributor.yokid	219274
dc.contributor.yokid	N/A
dc.contributor.yokid	N/A
dc.contributor.yokid	N/A
dc.date.accessioned	2025-01-19T10:31:47Z
dc.date.issued	2023
dc.description.abstract	This paper proposes a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch. In a typical multi-GPU application, the host serves as the orchestrator of execution by directly launching kernels, issuing communication calls, and acting as a synchronizer for devices. We argue that this orchestration, or control flow path, causes undue overhead and can be delegated entirely to devices to improve performance in applications that require communication among peers. For the proposed CPU-free execution model, we leverage existing techniques such as persistent kernels, thread block specialization, device-side barriers, and device-initiated communication routines to write fully autonomous multi-GPU code and achieve significantly reduced communication overheads. We demonstrate our proposed model on two broadly used iterative solvers, 2D/3D Jacobi stencil and Conjugate Gradient(CG). Compared to the CPU-controlled baselines, the CPU-free model can improve 3D stencil communication latency by 58.8% and provide a 1.63x speedup for CG on 8 NVIDIA A100 GPUs. The project code is available at https://github.com/ParCoreLab/CPU-Free-model. © 2023 Owner/Author(s).
dc.description.indexedby	Scopus
dc.description.openaccess	All Open Access; Bronze Open Access
dc.description.publisherscope	International
dc.description.sponsors	This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 949587).
dc.identifier.doi	10.1145/3577193.3593713
dc.identifier.isbn	979-840070056-9
dc.identifier.quartile	N/A
dc.identifier.scopus	2-s2.0-85168413592
dc.identifier.uri	https://doi.org/10.1145/3577193.3593713
dc.identifier.uri	https://hdl.handle.net/20.500.14288/26289
dc.keywords	GPU-initiated communication
dc.keywords	Iterative solvers
dc.keywords	Multi-GPU
dc.keywords	NVSHMEM
dc.keywords	Persistent kernels
dc.language	en
dc.publisher	Association for Computing Machinery
dc.relation.grantno	Horizon 2020 Framework Programme, H2020, (949587); European Research Council, ERC
dc.source	Proceedings of the International Conference on Supercomputing
dc.subject	Computer science
dc.title	Multi-GPU communication schemes for iterative solvers: when CPUs are not in charge
dc.type	Conference proceeding

Collections

Publications without Fulltext

Multi-GPU communication schemes for iterative solvers: when CPUs are not in charge

Files

Collections