CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters

Publication:
CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters

dc.conference.date	2025-11-16 through 2025-11-21
dc.conference.location	St. Louis
dc.contributor.coauthor	Trotter, James D.
dc.contributor.coauthor	Langguth, Johannes
dc.contributor.coauthor	Cai, Xing
dc.contributor.department	Graduate School of Sciences and Engineering
dc.contributor.department	Department of Computer Engineering
dc.contributor.kuauthor	Ekmekçibaşı, Sinan
dc.contributor.kuauthor	Sağbili, Doğan
dc.contributor.kuauthor	Erten, Didem Unat
dc.contributor.schoolcollegeinstitute	GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.date.accessioned	2026-01-16T08:45:46Z
dc.date.available	2026-01-16
dc.date.issued	2025
dc.description.abstract	The Conjugate Gradient (CG) method is a key building block in numerous applications, yet its low computational intensity and sensitivity to communication overhead make it difficult to scale efficiently on multi-GPU systems. In light of recent advances in multi-GPU communication technologies, we revisit CG parallelization for large-scale GPU clusters. This work presents scalable CG and pipelined CG solvers targeting NVIDIA and AMD GPUs, using GPU-aware MPI, NCCL/RCCL and NVSHMEM to implement both CPU- and GPU-initiated communication schemes. We also introduce a monolithic variant that offloads the entire CG loop to the GPU, enabling fully device-initiated execution via NVSHMEM. Optimizations across all variants reduce unnecessary data transfers and synchronization overheads; the GPU-initiated variant eliminates CPU involvement altogether. We benchmark our implementations on NVIDIA- and AMD-based supercomputers using SuiteSparse matrices and a real-world finite element application. By avoiding data transfers and synchronization bottlenecks, our single-GPU implementations achieve 8-14 % performance gains over state-of-the-art solvers. In strong scaling tests on over 1,000 GPUs, we outperform existing approaches by 5-15 %. While CPU-initiated variants remain favorable due to a lack of vendor supported device-side computational kernels and suboptimal NVSHMEM configurations at the clusters, the strong scaling properties of the GPU-initiated CG variant indicates that it will be highly competitive at even larger GPU counts and with further tuning.
dc.description.fulltext	Yes
dc.description.harvestedfrom	Manual
dc.description.indexedby	Scopus
dc.description.openaccess	Gold OA
dc.description.publisherscope	International
dc.description.readpublish	N/A
dc.description.sponsoredbyTubitakEu	N/A
dc.identifier.doi	10.1145/3712285.3759774
dc.identifier.embargo	No
dc.identifier.endpage	315
dc.identifier.isbn	9798400714665
dc.identifier.quartile	N/A
dc.identifier.scopus	2-s2.0-105023977636
dc.identifier.startpage	298
dc.identifier.uri	https://doi.org/10.1145/3712285.3759774
dc.identifier.uri	https://hdl.handle.net/20.500.14288/32046
dc.keywords	Conjugate gradient
dc.keywords	CUDA
dc.keywords	GPU
dc.keywords	GPU-aware MPI
dc.keywords	HIP
dc.keywords	NCCL
dc.keywords	NVSHMEM
dc.keywords	RCCL
dc.language.iso	eng
dc.publisher	Association for Computing Machinery
dc.relation.affiliation	Koç University
dc.relation.collection	Koç University Institutional Repository
dc.relation.ispartof	Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
dc.relation.openaccess	Yes
dc.rights	CC BY-NC-ND (Attribution-NonCommercial-NoDerivs)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject	Computer Engineering
dc.title	CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters
dc.type	Conference Proceeding
dspace.entity.type	Publication
person.familyName	Ekmekçibaşı
person.familyName	Sağbili
person.familyName	Erten
person.givenName	Sinan
person.givenName	Doğan
person.givenName	Didem Unat
relation.isOrgUnitOfPublication	3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery	3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isParentOrgUnitOfPublication	434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication	8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery	434c9663-2b11-4e66-9399-c863e2ebae43

Collections

Publications without Fulltext

Publication: CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters

Files

Collections

Publication:
CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters