Publication:
Fast multidimensional reduction and broadcast operations on GPU for machine learning

dc.contributor.departmentN/A
dc.contributor.departmentN/A
dc.contributor.departmentN/A
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorDikbayır, Doğa
dc.contributor.kuauthorÇoban, Enis Berk
dc.contributor.kuauthorKesen, İlker
dc.contributor.kuauthorYüret, Deniz
dc.contributor.kuauthorErten, Didem Unat
dc.contributor.kuprofileMaster Student
dc.contributor.kuprofileMaster Student
dc.contributor.kuprofilePhD Student
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileFaculty Member
dc.contributor.otherDepartment of Computer Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.yokidN/A
dc.contributor.yokidN/A
dc.contributor.yokidN/A
dc.contributor.yokid179996
dc.contributor.yokid219274
dc.date.accessioned2024-11-09T22:45:40Z
dc.date.issued2018
dc.description.abstractReduction and broadcast operations are commonly used in machine learning algorithms for different purposes. They widely appear in the calculation of the gradient values of a loss function, which are one of the core structures of neural networks. Both operations are implemented naively in many libraries usually for scalar reduction or broadcast; however, to our knowledge, there are no optimized multidimensional implementations available. This fact limits the performance of machine learning models requiring these operations to be performed on tensors. In this work, we address the problem and propose two new strategies that extend the existing implementations to perform on tensors. We introduce formal definitions of both operations using tensor notations, investigate their mathematical properties, and exploit these properties to provide an efficient solution for each. We implement our parallel strategies and test them on a CUDA enabled Tesla K40m GPU accelerator. Our performant implementations achieve up to 75% of the peak device memory bandwidth on different tensor sizes and dimensions. Significant speedups against the implementations available in the Knet Deep Learning framework are also achieved for both operations.
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.issue21
dc.description.openaccessNO
dc.description.volume30
dc.identifier.doi10.1002/cpe.4691
dc.identifier.eissn1532-0634
dc.identifier.issn1532-0626
dc.identifier.scopus2-s2.0-85047664332
dc.identifier.urihttp://dx.doi.org/10.1002/cpe.4691
dc.identifier.urihttps://hdl.handle.net/20.500.14288/6138
dc.identifier.wos447267900004
dc.keywordsBroadcast
dc.keywordsCUDA
dc.keywordsGPU
dc.keywordsMachine learning
dc.keywordsMultidimensional arrays
dc.keywordsReduction
dc.keywordsTensor
dc.languageEnglish
dc.publisherWiley
dc.sourceConcurrency and Computation-Practice and Experience
dc.subjectComputer science
dc.subjectSoftware engineering
dc.subjectComputer science
dc.subjectTheory methods
dc.titleFast multidimensional reduction and broadcast operations on GPU for machine learning
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.authorid0000-0003-4673-9612
local.contributor.authorid0009-0004-1418-6966
local.contributor.authoridN/A
local.contributor.authorid0000-0002-7039-0046
local.contributor.authorid0000-0002-2351-0770
local.contributor.kuauthorDikbayır, Doğa
local.contributor.kuauthorÇoban, Enis Berk
local.contributor.kuauthorKesen, İlker
local.contributor.kuauthorYüret, Deniz
local.contributor.kuauthorErten, Didem Unat
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae

Files