RNA sequencing data with machine learning and deep learning usage: A methodological study

Publication:
RNA sequencing data with machine learning and deep learning usage: A methodological study

Files

Primary IR04730.pdf (1.31 MB)

Departments

Organizational Unit

KUTTAM (Koç University Research Center for Translational Medicine)

School / College / Institute

Organizational Unit

Research Center

KU-Authors

Öztornacı, Ragıp Onur

Date

2024

Type

Journal Article

Alternative Title

RNA sekanslama verileri ile makine öğrenimi ve derin öğrenme kullanımı: Metodolojik bir çalışma

Abstract

Amaç: Bu çalışmanın amacı, klasik istatistiksel yaklaşımlar yerine RNA sekanslama verilerini analiz etmek için popüler makine öğrenimi ve derin öğrenme yöntemlerini kullanarak farklı bir perspek- tif sunmaktır. Ayrıca makine öğrenimi ve derin öğrenme konularında bilgi sağlamaktır. Gereç ve Yöntemler: Makine öğrenimi ve derin öğrenme yöntemlerini kullanarak, astım ve böbrek transplantasyonuna ait iki farklı ham veri seti (GSE85567 ve GSE129166) “National Center for Biotechnology Information” veri tabanından indirilmiş ve gerekli kalite kontrol ve hizalama prosedürlerinden geçirilmiştir. Has- ta-kontrol ayrımını elde etmek için rastgele orman [random forest (RF)], destek vektör makineleri [support vector machines (SVM)] ve derin sinir ağları [deep neural networks (DNN)] modelleri uygulan- mıştır. Tüm veri setleri aşırı uyumu önlemek amacıyla %67,5 eğitim, %10 test ve %22,5 doğrulama verisi olarak bölünmüş ve modellerin eğitim aşamalarında 10-katlı çapraz geçerlilik kullanılmıştır. Makine öğrenimi ve derin öğrenme için Python programlama dili ve veri iş- leme için Unix işletim (AT&T Bell Laboratuvarları, ABD) sistemi kullanılmıştır. Bulgular: GSE129166 veri setinde RF modelinin validasyon setinde elde ettiği doğruluk oranı 0,89 olarak hesaplanmış- tır. Bu modelin hassasiyeti 0,88 ve duyarlılığı 0,92 olarak belirlenmiş- tir. SVM modeli validasyon setinde elde ettiği doğruluk oranı 0,88 olarak ölçülmüş, test setinde ise 0,87 olarak belirlenmiştir. GSE85567 veri seti için RF modelinin validasyon setinde doğruluk oranı 0,73 olarak ölçülmüştür. SVM için validasyon setinde doğruluk oranı 0,70 olarak ölçülmüş, DNN için ise 0,75 olarak ölçülmüştür. Sonuç: GSE85567 veri seti üzerinde yapılan çalışma, RF ve SVM modelleri- nin yüksek doğruluk ve performans sergilediğini göstermektedir. DNN modeli ise daha dengeli bir hassasiyet ve duyarlılık oranına sahip olup, önemli bir alternatif olarak gözlemlenmiştir. Üç modelin RNA-sekanslama verileri için hasta-kontrol sınıflaması için uygun olduğu sonucuna varılmıştır./ABSTRACT Objective: The aim of this study is to provide a different perspective on the analysis of RNA sequencing data by employing popular machine learning and deep learning methods, rather than classical statistical approaches. Additionally, it aims to provide insights into machine learning and deep learning concepts. Material and Methods: Utilizing machine learning and deep learning techniques, two distinct raw datasets pertaining to asthma and kidney transplantation (GSE85567 and GSE129166) were retrieved from the National Center for Biotechnology Information database and subsequently subjected to requisite quality control and alignment procedures. Random forest (RF), support vector machines (SVM), and deep neural networks (DNN) models were implemented to achieve patient-control differentiation. To prevent overfitting, all data sets were divided into 67.5% training, 10% testing, and 22.5% validation data, and 10-fold cross-validation was employed during the training stages of the models. Python programming language was used for both machine learning and deep learning, and Unix operating (AT&T Bell Laboratories, USA) system was utilized for data processing. Results: In the GSE129166 data set, the RF model obtained an accuracy rate of 0.89 in the validation set. The precision and recall of this model were determined as 0.88 and 0.92, respectively. The SVM model measured an accuracy rate of 0.88 in the validation set, and 0.87 in the test set. For the GSE85567 data set, the accuracy rate of the RF model in the validation set was measured as 0.73. For SVM, the accuracy rate in the validation set was measured as 0.70, while for DNN, it was measured as 0.75. Conclusion: The study conducted on the GSE85567 data set demonstrates that RF and SVM models exhibit high accuracy and performance. The DNN model, on the other hand, has a more balanced precision and recall rate, and is observed to be a significant alternative. Additionally, it is observed that the DNN model shows effective performance on the GSE129166 data set. Particularly, a high accuracy rate and a balanced precision-recall balance were observed in the validation set. It is concluded that all three models are suitable for patient-control classification in RNA-seq data.

Publisher

Türkiye Klinikleri Yayınevi

Subject

Biostatistics/Biyoistatistik

Source

Türkiye Klinikleri Biyoistatistik Dergisi

DOI

10.5336/biostatic.2023-100186

URI

https://doi.org/10.5336/biostatic.2023-100186
https://hdl.handle.net/20.500.14288/23209

Publication: RNA sequencing data with machine learning and deep learning usage: A methodological study

Files

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Editor & Affiliation

Compiler & Affiliation

Translator

Other Contributor

Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

Related Goal

7

Views

11

Downloads

Publication:
RNA sequencing data with machine learning and deep learning usage: A methodological study