Publication:
A Device-Side Execution Model for Multi-GPU Task Graphs

dc.conference.date2025-06-08 through 2025-06-11
dc.conference.locationLake City
dc.contributor.coauthorTurimbetov, Ilyas (57211567866)
dc.contributor.coauthorWahib, Mohamed (60172528700)
dc.contributor.coauthorUnat, Didem (27868216500)
dc.date.accessioned2025-12-31T08:18:44Z
dc.date.available2025-12-31
dc.date.issued2025
dc.description.abstractExecuting task graphs on multi-GPU systems presents challenges typically managed by CPU-side runtimes, which handle memory management, track dependencies, and balance load. However, the interplay of runtime components, CPU-driven kernel initialization, and dynamic task graph construction creates significant overhead. For static graphs, recent advancements have enabled GPU-side execution, demonstrating substantial performance gains in single-GPU scenarios. However, multi-GPU execution still lags behind in both usability and performance. In particular, no GPU-side solution exists for executing task graphs on multiple nodes.In this work, we introduce Mustard, a multi-GPU execution model that shifts execution of static task graphs entirely to the devices, drastically reducing overhead. Mustard offers a clean solution for executing CUDA graphs across multiple GPUs on multiple nodes without requiring modifications to GPU kernel code or the adoption of new runtime mechanisms or APIs. By transforming the task graph, Mustard enables precise tracking of task dependencies and load balancing directly on the GPU, eliminating the need for host CPU involvement. We evaluate our approach using generated graphs, as well as LU and Cholesky decomposition graphs. In a multi-node scenario with 64 GPUs, Mustard achieves an average 5.83× speedup over the linear algebra library SLATE. On a single node, compared to the best-performing baseline, Mustard delivers an average 1.66× speedup for LU and 1.29× for Cholesky. © 2025 Copyright held by the owner/author(s).
dc.description.fulltextNo
dc.description.harvestedfromManual
dc.description.indexedbyScopus
dc.description.openaccessAll Open Access; Gold Open Access
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuEU
dc.description.sponsorshipEuropean Research Council, ERC; Horizon 2020 Framework Programme, (949587); Horizon 2020 Framework Programme
dc.identifier.doi10.1145/3721145.3730426
dc.identifier.embargoNo
dc.identifier.endpage396
dc.identifier.isbn9798400715372
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-105021478838
dc.identifier.startpage384
dc.identifier.urihttps://doi.org/10.1145/3721145.3730426
dc.identifier.urihttps://hdl.handle.net/20.500.14288/31392
dc.identifier.volumePart of 213821
dc.language.isoeng
dc.publisherAssociation for Computing Machinery
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartof39th ACM International Conference on Supercomputing, ICS 2025
dc.relation.openaccessNo
dc.rightsCopyrighted
dc.titleA Device-Side Execution Model for Multi-GPU Task Graphs
dc.typeConference Proceeding
dspace.entity.typePublication

Files