MLSeq: Machine learning interface for RNA sequencing data

D Göksülük, G Zararsız, S Korkmaz, V Eldem, B Klaus, A Öztürk, AE Karaağaoğlu

Abstract

Background and Objective: In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network. Methods: Classification of RNA sequencing data is not straightforward since raw data should be pre-processed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-seq data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq. Materials: A real RNA-sequencing data set including gene expression levels of 714 miRNAs obtained from 29 healthy and 29 diseased subjects with cervical cancer is used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as RNA-sequencing based algorithms, and voomNSC and support vector machines (SVM) were selected as microarray based algorithms for model comparisons. Each algorithm is compared using classification accuracies on an independent test set. Results: Our voomNSC algorithm and RNA-sequencing based algorithms achieved 94.4% while selected microarray based algorithms achieved 88.9% test set accuracy. Furthermore, voomNSC was able to select best subset of features (or genes) through built-in variable selection criteria. Although NBLDA and voomNSC performs similar, voomNSC included only 2.2% of all features while NBLDA included all features in the model. Conclusion: MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data.

Type

Journal article

Publication

Computer Methods and Programs in Biomedicine

Date

April, 2019

Links

Code Project

Full Citation: Göksülük D, Zararsız G, Korkmaz S, Eldem V, Klaus B, Öztürk A, Karaağaoğlu AE. MLSeq: Machine learning interface for RNA sequencing data. Computer Methods and Programs in Biomedicine. (Accepted/In press).