Publication:
The use of class imbalanced learning methods on ULSAM data to predict the case-control status in genome-wide association studies

dc.contributor.coauthorMorris, Andrew P.
dc.contributor.coauthorTasdelen, Bahar
dc.contributor.departmentKUTTAM (Koç University Research Center for Translational Medicine)
dc.contributor.departmentSchool of Medicine
dc.contributor.kuauthorSyed, Hamzah
dc.contributor.kuauthorÖztornacı, Ragıp Onur
dc.contributor.schoolcollegeinstituteResearch Center
dc.contributor.schoolcollegeinstituteSCHOOL OF MEDICINE
dc.date.accessioned2025-01-19T10:33:36Z
dc.date.issued2023
dc.description.abstractMachine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct method for dealing with imbalanced data and data training. This article compares three ML models to identify SNPs that predict type 2 diabetes (T2D) status using the Support vector machine SMOTE (SVM SMOTE), The Adaptive Synthetic Sampling Approach (ADASYN), Random under sampling (RUS) on GWAS data from elderly male participants (165 cases and 951 controls) from the Uppsala Longitudinal Study of Adult Men (ULSAM). It was also applied to SNPs selected by the SMOTE, SVM SMOTE, ADASYN, and RUS clumping method. The analysis was performed using three different ML models: (i) support vector machine (SVM), (ii) multilayer perceptron (MLP) and (iii) random forests (RF). The accuracy of the case-control classification was compared between these three methods. The best classification algorithm was a combination of MLP and SMOTE (97% accuracy). Both RF and SVM achieved good accuracy results of over 90%. Overall, methods used against unbalanced data, all three ML algorithms were found to improve prediction accuracy.
dc.description.indexedbyWOS
dc.description.indexedbyScopus
dc.description.issue1
dc.description.openaccessgold, Green Submitted
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuN/A
dc.description.volume10
dc.identifier.doi10.1186/s40537-023-00853-x
dc.identifier.eissn2196-1115
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-85178237353
dc.identifier.urihttps://doi.org/10.1186/s40537-023-00853-x
dc.identifier.urihttps://hdl.handle.net/20.500.14288/26635
dc.identifier.wos1111246900001
dc.keywordsMachine learning
dc.keywordsClass imbalanced methods
dc.keywordsGWAS
dc.keywordsULSAM study
dc.language.isoeng
dc.publisherSpringernature
dc.relation.ispartofJournal of Big Data
dc.subjectComputer science
dc.titleThe use of class imbalanced learning methods on ULSAM data to predict the case-control status in genome-wide association studies
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.kuauthorÖztornacı, Ragıp Onur
local.contributor.kuauthorSyed, Hamzah
local.publication.orgunit1SCHOOL OF MEDICINE
local.publication.orgunit1Research Center
local.publication.orgunit2KUTTAM (Koç University Research Center for Translational Medicine)
local.publication.orgunit2School of Medicine
relation.isOrgUnitOfPublication91bbe15d-017f-446b-b102-ce755523d939
relation.isOrgUnitOfPublicationd02929e1-2a70-44f0-ae17-7819f587bedd
relation.isOrgUnitOfPublication.latestForDiscovery91bbe15d-017f-446b-b102-ce755523d939
relation.isParentOrgUnitOfPublicationd437580f-9309-4ecb-864a-4af58309d287
relation.isParentOrgUnitOfPublication17f2dc8e-6e54-4fa8-b5e0-d6415123a93e
relation.isParentOrgUnitOfPublication.latestForDiscoveryd437580f-9309-4ecb-864a-4af58309d287

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
IR05610.pdf
Size:
1.69 MB
Format:
Adobe Portable Document Format