A device-side execution model for multi-GPU task graphs

Publication:
A device-side execution model for multi-GPU task graphs

dc.conference.date	JUN 8-11, 2025
dc.conference.location	Salt Lake City
dc.conference.organizer	ACM SIGARCH
dc.conference.organizer	ACM SIGHPC
dc.contributor.coauthor	Wahib, Mohamed
dc.contributor.department	Graduate School of Sciences and Engineering
dc.contributor.department	Department of Computer Engineering
dc.contributor.kuauthor	Turimbetov, İlyas
dc.contributor.kuauthor	Erten, Didem Unat
dc.contributor.schoolcollegeinstitute	GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.date.accessioned	2025-12-31T08:18:44Z
dc.date.available	2025-12-31
dc.date.issued	2025
dc.description.abstract	Executing task graphs on multi-GPU systems presents challenges typically managed by CPU-side runtimes, which handle memory management, track dependencies, and balance load. However, the interplay of runtime components, CPU-driven kernel initialization, and dynamic task graph construction creates significant overhead. For static graphs, recent advancements have enabled GPU-side execution, demonstrating substantial performance gains in single-GPU scenarios. However, multi-GPU execution still lags behind in both usability and performance. In particular, no GPU-side solution exists for executing task graphs on multiple nodes.In this work, we introduce Mustard, a multi-GPU execution model that shifts execution of static task graphs entirely to the devices, drastically reducing overhead. Mustard offers a clean solution for executing CUDA graphs across multiple GPUs on multiple nodes without requiring modifications to GPU kernel code or the adoption of new runtime mechanisms or APIs. By transforming the task graph, Mustard enables precise tracking of task dependencies and load balancing directly on the GPU, eliminating the need for host CPU involvement. We evaluate our approach using generated graphs, as well as LU and Cholesky decomposition graphs. In a multi-node scenario with 64 GPUs, Mustard achieves an average 5.83× speedup over the linear algebra library SLATE. On a single node, compared to the best-performing baseline, Mustard delivers an average 1.66× speedup for LU and 1.29× for Cholesky.
dc.description.fulltext	Yes
dc.description.harvestedfrom	Manual
dc.description.indexedby	WOS
dc.description.indexedby	Scopus
dc.description.openaccess	Gold OA
dc.description.publisherscope	International
dc.description.readpublish	N/A
dc.description.sponsoredbyTubitakEu	EU
dc.description.sponsorship	European Research Council, ERC; Horizon 2020 Framework Programme
dc.description.version	Published Version
dc.identifier.doi	10.1145/3721145.3730426
dc.identifier.embargo	No
dc.identifier.endpage	396
dc.identifier.filenameinventoryno	IR06635
dc.identifier.grantno	949587
dc.identifier.isbn	9798400715372
dc.identifier.quartile	N/A
dc.identifier.scopus	2-s2.0-105021478838
dc.identifier.startpage	384
dc.identifier.uri	https://doi.org/10.1145/3721145.3730426
dc.identifier.uri	https://hdl.handle.net/20.500.14288/31392
dc.identifier.volume	Part of 213821
dc.identifier.wos	001576259700027
dc.language.iso	eng
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.affiliation	Koç University
dc.relation.collection	Koç University Institutional Repository
dc.relation.ispartof	39th ACM International Conference on Supercomputing, ICS 2025
dc.relation.openaccess	Yes
dc.rights	CC BY (Attribution)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Computer engineering
dc.title	A device-side execution model for multi-GPU task graphs
dc.type	Conference Proceeding
dspace.entity.type	Publication
person.familyName	Turimbetov
person.familyName	Erten
person.givenName	İlyas
person.givenName	Didem Unat
relation.isOrgUnitOfPublication	3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery	3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isParentOrgUnitOfPublication	434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication	8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery	434c9663-2b11-4e66-9399-c863e2ebae43