Publication:
A device-side execution model for multi-GPU task graphs

dc.conference.dateJUN 8-11, 2025
dc.conference.locationSalt Lake City
dc.conference.organizerACM SIGARCH
dc.conference.organizerACM SIGHPC
dc.contributor.coauthorWahib, Mohamed
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorTurimbetov, İlyas
dc.contributor.kuauthorErten, Didem Unat
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2025-12-31T08:18:44Z
dc.date.available2025-12-31
dc.date.issued2025
dc.description.abstractExecuting task graphs on multi-GPU systems presents challenges typically managed by CPU-side runtimes, which handle memory management, track dependencies, and balance load. However, the interplay of runtime components, CPU-driven kernel initialization, and dynamic task graph construction creates significant overhead. For static graphs, recent advancements have enabled GPU-side execution, demonstrating substantial performance gains in single-GPU scenarios. However, multi-GPU execution still lags behind in both usability and performance. In particular, no GPU-side solution exists for executing task graphs on multiple nodes.In this work, we introduce Mustard, a multi-GPU execution model that shifts execution of static task graphs entirely to the devices, drastically reducing overhead. Mustard offers a clean solution for executing CUDA graphs across multiple GPUs on multiple nodes without requiring modifications to GPU kernel code or the adoption of new runtime mechanisms or APIs. By transforming the task graph, Mustard enables precise tracking of task dependencies and load balancing directly on the GPU, eliminating the need for host CPU involvement. We evaluate our approach using generated graphs, as well as LU and Cholesky decomposition graphs. In a multi-node scenario with 64 GPUs, Mustard achieves an average 5.83× speedup over the linear algebra library SLATE. On a single node, compared to the best-performing baseline, Mustard delivers an average 1.66× speedup for LU and 1.29× for Cholesky.
dc.description.fulltextYes
dc.description.harvestedfromManual
dc.description.indexedbyWOS
dc.description.indexedbyScopus
dc.description.openaccessGold OA
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuEU
dc.description.sponsorshipEuropean Research Council, ERC; Horizon 2020 Framework Programme
dc.description.versionPublished Version
dc.identifier.doi10.1145/3721145.3730426
dc.identifier.embargoNo
dc.identifier.endpage396
dc.identifier.filenameinventorynoIR06635
dc.identifier.grantno949587
dc.identifier.isbn9798400715372
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-105021478838
dc.identifier.startpage384
dc.identifier.urihttps://doi.org/10.1145/3721145.3730426
dc.identifier.urihttps://hdl.handle.net/20.500.14288/31392
dc.identifier.volumePart of 213821
dc.identifier.wos001576259700027
dc.language.isoeng
dc.publisherAssociation for Computing Machinery (ACM)
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartof39th ACM International Conference on Supercomputing, ICS 2025
dc.relation.openaccessYes
dc.rightsCC BY (Attribution)
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectComputer engineering
dc.titleA device-side execution model for multi-GPU task graphs
dc.typeConference Proceeding
dspace.entity.typePublication
person.familyNameTurimbetov
person.familyNameErten
person.givenNameİlyas
person.givenNameDidem Unat
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery434c9663-2b11-4e66-9399-c863e2ebae43

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
IR06635.pdf
Size:
2.67 MB
Format:
Adobe Portable Document Format