The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies

Öztornaci, R. Onur; Syed, Hamzah; Morris, Andrew P.; Taşdelen, Bahar

Research Data:
The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies

Date

2024-01-01

Institution Author

Öztornaci, R. Onur

Syed, Hamzah

Morris, Andrew P.

Taşdelen, Bahar

Type

Collection

Abstract

Abstract Machine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct method for dealing with imbalanced data and data training. This article compares three ML models to identify SNPs that predict type 2 diabetes (T2D) status using the Support vector machine SMOTE (SVM SMOTE), The Adaptive Synthetic Sampling Approach (ADASYN), Random under sampling (RUS) on GWAS data from elderly male participants (165 cases and 951 controls) from the Uppsala Longitudinal Study of Adult Men (ULSAM). It was also applied to SNPs selected by the SMOTE, SVM SMOTE, ADASYN, and RUS clumping method. The analysis was performed using three different ML models: (i) support vector machine (SVM), (ii) multilayer perceptron (MLP) and (iii) random forests (RF). The accuracy of the case–control classification was compared between these three methods. The best classification algorithm was a combination of MLP and SMOTE (97% accuracy). Both RF and SVM achieved good accuracy results of over 90%. Overall, methods used against unbalanced data, all three ML algorithms were found to improve prediction accuracy.

Publisher

figshare

DOI

10.6084/m9.figshare.c.6958146.v1

URI

https://hdl.handle.net/20.500.14288/31236

Rights

OPEN

Collections

Research Data

Full item page

0

Views

0

Downloads

View PlumX Details

Research Data: The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies

Date

Institution Author

Departments

School / College / Institute

Program

KU-Authors

Koç University Affiliated Author

KU Authors

Co-Authors

Editor & Affiliation

Compiler & Affiliation

Translator

Other Contributor

Language

Type

Journal Title

Volume Title

Alternative Title

Other Of Anamed Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Book Series Title

DOI

URI

item.page.datauri

Link

Rights

Rights URI

Grant No

Sponsors

Copyrights Note

Related Research Data

Collections

Endorsement

Review

Supplemented By

Referenced By

Related Goal

0

Views

0

Downloads

Research Data:
The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies