Publication: A Device-Side Execution Model for Multi-GPU Task Graphs
| dc.conference.date | 2025-06-08 through 2025-06-11 | |
| dc.conference.location | Lake City | |
| dc.contributor.coauthor | Turimbetov, Ilyas (57211567866) | |
| dc.contributor.coauthor | Wahib, Mohamed (60172528700) | |
| dc.contributor.coauthor | Unat, Didem (27868216500) | |
| dc.date.accessioned | 2025-12-31T08:18:44Z | |
| dc.date.available | 2025-12-31 | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Executing task graphs on multi-GPU systems presents challenges typically managed by CPU-side runtimes, which handle memory management, track dependencies, and balance load. However, the interplay of runtime components, CPU-driven kernel initialization, and dynamic task graph construction creates significant overhead. For static graphs, recent advancements have enabled GPU-side execution, demonstrating substantial performance gains in single-GPU scenarios. However, multi-GPU execution still lags behind in both usability and performance. In particular, no GPU-side solution exists for executing task graphs on multiple nodes.In this work, we introduce Mustard, a multi-GPU execution model that shifts execution of static task graphs entirely to the devices, drastically reducing overhead. Mustard offers a clean solution for executing CUDA graphs across multiple GPUs on multiple nodes without requiring modifications to GPU kernel code or the adoption of new runtime mechanisms or APIs. By transforming the task graph, Mustard enables precise tracking of task dependencies and load balancing directly on the GPU, eliminating the need for host CPU involvement. We evaluate our approach using generated graphs, as well as LU and Cholesky decomposition graphs. In a multi-node scenario with 64 GPUs, Mustard achieves an average 5.83× speedup over the linear algebra library SLATE. On a single node, compared to the best-performing baseline, Mustard delivers an average 1.66× speedup for LU and 1.29× for Cholesky. © 2025 Copyright held by the owner/author(s). | |
| dc.description.fulltext | No | |
| dc.description.harvestedfrom | Manual | |
| dc.description.indexedby | Scopus | |
| dc.description.openaccess | All Open Access; Gold Open Access | |
| dc.description.publisherscope | International | |
| dc.description.readpublish | N/A | |
| dc.description.sponsoredbyTubitakEu | EU | |
| dc.description.sponsorship | European Research Council, ERC; Horizon 2020 Framework Programme, (949587); Horizon 2020 Framework Programme | |
| dc.identifier.doi | 10.1145/3721145.3730426 | |
| dc.identifier.embargo | No | |
| dc.identifier.endpage | 396 | |
| dc.identifier.isbn | 9798400715372 | |
| dc.identifier.quartile | N/A | |
| dc.identifier.scopus | 2-s2.0-105021478838 | |
| dc.identifier.startpage | 384 | |
| dc.identifier.uri | https://doi.org/10.1145/3721145.3730426 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14288/31392 | |
| dc.identifier.volume | Part of 213821 | |
| dc.language.iso | eng | |
| dc.publisher | Association for Computing Machinery | |
| dc.relation.affiliation | Koç University | |
| dc.relation.collection | Koç University Institutional Repository | |
| dc.relation.ispartof | 39th ACM International Conference on Supercomputing, ICS 2025 | |
| dc.relation.openaccess | No | |
| dc.rights | Copyrighted | |
| dc.title | A Device-Side Execution Model for Multi-GPU Task Graphs | |
| dc.type | Conference Proceeding | |
| dspace.entity.type | Publication |
