Publication:
Balanced and elastic end-to-end training of dynamic LLMs

dc.conference.date2025-11-16 through 2025-11-21
dc.conference.locationSt. Louis
dc.contributor.coauthorWahib, Mohamed
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.kuauthorSoytürk, Muhammet Abdullah
dc.contributor.kuauthorErten, Didem Unat
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.date.accessioned2026-01-16T08:45:37Z
dc.date.available2026-01-16
dc.date.issued2025
dc.description.abstractTo reduce the computational and memory overhead of Large Language Models, various approaches have been proposed. These include a) Mixture of Experts (MoEs), where token routing affects compute balance; b) gradual pruning of model parameters; c) dynamically freezing layers; d) dynamic sparse attention mechanisms; e) early exit of tokens as they pass through model layers; and f) Mixture of Depths (MoDs), where tokens bypass certain blocks. While these approaches are effective in reducing overall computation, they often introduce significant workload imbalance across workers. In many cases, this imbalance is severe enough to render the techniques impractical for large-scale distributed training, limiting their applicability to toy models due to poor efficiency. We propose an autonomous dynamic load balancing solution, DynMo, which provably achieves maximum reduction in workload imbalance and adaptively equalizes compute loads across workers in pipeline-parallel training. In addition, DynMo dynamically consolidates computation onto fewer workers without sacrificing training throughput, allowing idle workers to be released back to the job manager. DynMo supports both single-node multi-GPU systems and multi-node GPU clusters, and can be used in practical deployment. Compared to static distributed training solutions such as Megatron-LM and DeepSpeed, DynMo accelerates the end-to-end training of dynamic GPT models by up to 1.23x for MoEs, 3.18x for parameter pruning, 2.23x for layer freezing, 4.02x for sparse attention, 4.52x for early exit, and 1.17x for MoDs. © 2025 Copyright held by the owner/author(s).
dc.description.fulltextYes
dc.description.harvestedfromManual
dc.description.indexedbyScopus
dc.description.openaccessGold OA
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuN/A
dc.identifier.doi10.1145/3712285.3759775
dc.identifier.embargoNo
dc.identifier.endpage1367
dc.identifier.isbn9798400714665
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-105023989396
dc.identifier.startpage1351
dc.identifier.urihttps://doi.org/10.1145/3712285.3759775
dc.identifier.urihttps://hdl.handle.net/20.500.14288/32029
dc.keywordsLarge language models
dc.keywordsLoad balancing
dc.keywordsPipeline parallelism
dc.language.isoeng
dc.publisherAssociation for Computing Machinery
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartof2025 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
dc.relation.openaccessYes
dc.rightsCC BY (Attribution)
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectComputer Science
dc.titleBalanced and elastic end-to-end training of dynamic LLMs
dc.typeConference Proceeding
dspace.entity.typePublication
person.familyNameSoytürk
person.familyNameErten
person.givenNameMuhammet Abdullah
person.givenNameDidem Unat
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files