Publication: CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters
| dc.conference.date | 2025-11-16 through 2025-11-21 | |
| dc.conference.location | St. Louis | |
| dc.contributor.coauthor | Trotter, James D. | |
| dc.contributor.coauthor | Langguth, Johannes | |
| dc.contributor.coauthor | Cai, Xing | |
| dc.contributor.department | Graduate School of Sciences and Engineering | |
| dc.contributor.department | Department of Computer Engineering | |
| dc.contributor.kuauthor | Ekmekçibaşı, Sinan | |
| dc.contributor.kuauthor | Sağbili, Doğan | |
| dc.contributor.kuauthor | Erten, Didem Unat | |
| dc.contributor.schoolcollegeinstitute | GRADUATE SCHOOL OF SCIENCES AND ENGINEERING | |
| dc.contributor.schoolcollegeinstitute | College of Engineering | |
| dc.date.accessioned | 2026-01-16T08:45:46Z | |
| dc.date.available | 2026-01-16 | |
| dc.date.issued | 2025 | |
| dc.description.abstract | The Conjugate Gradient (CG) method is a key building block in numerous applications, yet its low computational intensity and sensitivity to communication overhead make it difficult to scale efficiently on multi-GPU systems. In light of recent advances in multi-GPU communication technologies, we revisit CG parallelization for large-scale GPU clusters. This work presents scalable CG and pipelined CG solvers targeting NVIDIA and AMD GPUs, using GPU-aware MPI, NCCL/RCCL and NVSHMEM to implement both CPU- and GPU-initiated communication schemes. We also introduce a monolithic variant that offloads the entire CG loop to the GPU, enabling fully device-initiated execution via NVSHMEM. Optimizations across all variants reduce unnecessary data transfers and synchronization overheads; the GPU-initiated variant eliminates CPU involvement altogether. We benchmark our implementations on NVIDIA- and AMD-based supercomputers using SuiteSparse matrices and a real-world finite element application. By avoiding data transfers and synchronization bottlenecks, our single-GPU implementations achieve 8-14 % performance gains over state-of-the-art solvers. In strong scaling tests on over 1,000 GPUs, we outperform existing approaches by 5-15 %. While CPU-initiated variants remain favorable due to a lack of vendor supported device-side computational kernels and suboptimal NVSHMEM configurations at the clusters, the strong scaling properties of the GPU-initiated CG variant indicates that it will be highly competitive at even larger GPU counts and with further tuning. | |
| dc.description.fulltext | Yes | |
| dc.description.harvestedfrom | Manual | |
| dc.description.indexedby | Scopus | |
| dc.description.openaccess | Gold OA | |
| dc.description.publisherscope | International | |
| dc.description.readpublish | N/A | |
| dc.description.sponsoredbyTubitakEu | N/A | |
| dc.identifier.doi | 10.1145/3712285.3759774 | |
| dc.identifier.embargo | No | |
| dc.identifier.endpage | 315 | |
| dc.identifier.isbn | 9798400714665 | |
| dc.identifier.quartile | N/A | |
| dc.identifier.scopus | 2-s2.0-105023977636 | |
| dc.identifier.startpage | 298 | |
| dc.identifier.uri | https://doi.org/10.1145/3712285.3759774 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14288/32046 | |
| dc.keywords | Conjugate gradient | |
| dc.keywords | CUDA | |
| dc.keywords | GPU | |
| dc.keywords | GPU-aware MPI | |
| dc.keywords | HIP | |
| dc.keywords | NCCL | |
| dc.keywords | NVSHMEM | |
| dc.keywords | RCCL | |
| dc.language.iso | eng | |
| dc.publisher | Association for Computing Machinery | |
| dc.relation.affiliation | Koç University | |
| dc.relation.collection | Koç University Institutional Repository | |
| dc.relation.ispartof | Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025 | |
| dc.relation.openaccess | Yes | |
| dc.rights | CC BY-NC-ND (Attribution-NonCommercial-NoDerivs) | |
| dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | |
| dc.subject | Computer Engineering | |
| dc.title | CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters | |
| dc.type | Conference Proceeding | |
| dspace.entity.type | Publication | |
| person.familyName | Ekmekçibaşı | |
| person.familyName | Sağbili | |
| person.familyName | Erten | |
| person.givenName | Sinan | |
| person.givenName | Doğan | |
| person.givenName | Didem Unat | |
| relation.isOrgUnitOfPublication | 3fc31c89-e803-4eb1-af6b-6258bc42c3d8 | |
| relation.isOrgUnitOfPublication | 89352e43-bf09-4ef4-82f6-6f9d0174ebae | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 3fc31c89-e803-4eb1-af6b-6258bc42c3d8 | |
| relation.isParentOrgUnitOfPublication | 434c9663-2b11-4e66-9399-c863e2ebae43 | |
| relation.isParentOrgUnitOfPublication | 8e756b23-2d4a-4ce8-b1b3-62c794a8c164 | |
| relation.isParentOrgUnitOfPublication.latestForDiscovery | 434c9663-2b11-4e66-9399-c863e2ebae43 |
