Machine learning-based cancer prediction using large scale clinical data
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This study delves into the urgent challenge of cancer prediction by utilizing machine learning techniques on extensive clinical data. Conventional diagnostic techniques frequently result in delays and low sensitivity, underscoring the need for novel strategies. This work focuses on miRNA expression patterns from the Genomic Data Commons (GDC) by utilizing machine learning (ML) on large-scale clinical data. It builds a neural network model using the Keras Sequential API, investigates five machine learning approaches, and applies feature selection strategies to improve model interpretability. The research aims to provide insights into improving early cancer detection and risk assessment. Despite inherent limitations such as data quality variability and computational constraints, the study aims to rigorously examine ML methodologies in cancer prediction, with implications for future research and practical applications. Utilizing Python and a Jupyter notebook, this study gathered miRNA expression data from the Genomic Data Commons (GDC) via its API, ensuring data quality through preprocessing techniques like cleaning and normalization. Feature selection based on mutual information scores was then applied to enhance model interpretability and performance. Subsequently, five machine learning methods (k-nearest neighbors, random forest, logistic regression, support vector machine, and gradient boosting) were employed for cancer type classification, alongside a neural network model crafted with the Keras Sequential API for multi-class cancer classification. Performance evaluation metrics, including accuracy, precision, recall, and F1-score, were computed to assess model discriminative capabilities. This study investigates cancer classification via machine learning on genomic data, revealing diverse gene expression profiles across different cancer types. Preprocessing and feature selection resulted in a dataset of 2,507 samples and 1,881 features, addressing class imbalances. Support Vector Machine achieved notable performance, with high accuracy (99.04%), precision (96%), recall (98%), and F1-score (98.98%) across all cancer types. Random Forest demonstrated precision scores of 98% and an F1-score of 98.98%. Logistic Regression displayed robust performance across metrics, while Gradient Boosting showed strong accuracy and precision. K-Nearest Neighbors exhibited moderate accuracy and precision. Neural Network performed consistently well across metrics, with rapid convergence and high accuracy on unseen data. Confusion matrices and ROC curves validated accurate predictions, highlighting the potential of machine learning in precise cancer classification and early intervention. In conclusion, the five classifiers demonstrated robustness in accurately distinguishing between different cancer types with minimal false positives with Support vector machine standing out for its outstanding performance, with an accuracy score of 0.9904, reaffirming its efficacy in cancer classification tasks.