Publication:
CPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters

dc.conference.date2025-11-16 through 2025-11-21
dc.conference.locationSt. Louis
dc.contributor.coauthorTrotter, James D.
dc.contributor.coauthorLangguth, Johannes
dc.contributor.coauthorCai, Xing
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorEkmekçibaşı, Sinan
dc.contributor.kuauthorSağbili, Doğan
dc.contributor.kuauthorErten, Didem Unat
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2026-01-16T08:45:46Z
dc.date.available2026-01-16
dc.date.issued2025
dc.description.abstractThe Conjugate Gradient (CG) method is a key building block in numerous applications, yet its low computational intensity and sensitivity to communication overhead make it difficult to scale efficiently on multi-GPU systems. In light of recent advances in multi-GPU communication technologies, we revisit CG parallelization for large-scale GPU clusters. This work presents scalable CG and pipelined CG solvers targeting NVIDIA and AMD GPUs, using GPU-aware MPI, NCCL/RCCL and NVSHMEM to implement both CPU- and GPU-initiated communication schemes. We also introduce a monolithic variant that offloads the entire CG loop to the GPU, enabling fully device-initiated execution via NVSHMEM. Optimizations across all variants reduce unnecessary data transfers and synchronization overheads; the GPU-initiated variant eliminates CPU involvement altogether. We benchmark our implementations on NVIDIA- and AMD-based supercomputers using SuiteSparse matrices and a real-world finite element application. By avoiding data transfers and synchronization bottlenecks, our single-GPU implementations achieve 8-14 % performance gains over state-of-the-art solvers. In strong scaling tests on over 1,000 GPUs, we outperform existing approaches by 5-15 %. While CPU-initiated variants remain favorable due to a lack of vendor supported device-side computational kernels and suboptimal NVSHMEM configurations at the clusters, the strong scaling properties of the GPU-initiated CG variant indicates that it will be highly competitive at even larger GPU counts and with further tuning.
dc.description.fulltextYes
dc.description.harvestedfromManual
dc.description.indexedbyScopus
dc.description.openaccessGold OA
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuN/A
dc.identifier.doi10.1145/3712285.3759774
dc.identifier.embargoNo
dc.identifier.endpage315
dc.identifier.isbn9798400714665
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-105023977636
dc.identifier.startpage298
dc.identifier.urihttps://doi.org/10.1145/3712285.3759774
dc.identifier.urihttps://hdl.handle.net/20.500.14288/32046
dc.keywordsConjugate gradient
dc.keywordsCUDA
dc.keywordsGPU
dc.keywordsGPU-aware MPI
dc.keywordsHIP
dc.keywordsNCCL
dc.keywordsNVSHMEM
dc.keywordsRCCL
dc.language.isoeng
dc.publisherAssociation for Computing Machinery
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartofProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
dc.relation.openaccessYes
dc.rightsCC BY-NC-ND (Attribution-NonCommercial-NoDerivs)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subjectComputer Engineering
dc.titleCPU- and GPU-initiated communication strategies for conjugate gradient methods on large GPU clusters
dc.typeConference Proceeding
dspace.entity.typePublication
person.familyNameEkmekçibaşı
person.familyNameSağbili
person.familyNameErten
person.givenNameSinan
person.givenNameDoğan
person.givenNameDidem Unat
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery434c9663-2b11-4e66-9399-c863e2ebae43

Files