Machine learning-based cancer prediction using large scale clinical data

Osisami, Oladotun

Machine learning-based cancer prediction using large scale clinical data

Files

OladotunOsisami2024.pdf (847.51 KB)

Date

2024

Authors

Osisami, Oladotun

Publisher

Kansas State University

Abstract

This study delves into the urgent challenge of cancer prediction by utilizing machine learning techniques on extensive clinical data. Conventional diagnostic techniques frequently result in delays and low sensitivity, underscoring the need for novel strategies. This work focuses on miRNA expression patterns from the Genomic Data Commons (GDC) by utilizing machine learning (ML) on large-scale clinical data. It builds a neural network model using the Keras Sequential API, investigates five machine learning approaches, and applies feature selection strategies to improve model interpretability. The research aims to provide insights into improving early cancer detection and risk assessment. Despite inherent limitations such as data quality variability and computational constraints, the study aims to rigorously examine ML methodologies in cancer prediction, with implications for future research and practical applications. Utilizing Python and a Jupyter notebook, this study gathered miRNA expression data from the Genomic Data Commons (GDC) via its API, ensuring data quality through preprocessing techniques like cleaning and normalization. Feature selection based on mutual information scores was then applied to enhance model interpretability and performance. Subsequently, five machine learning methods (k-nearest neighbors, random forest, logistic regression, support vector machine, and gradient boosting) were employed for cancer type classification, alongside a neural network model crafted with the Keras Sequential API for multi-class cancer classification. Performance evaluation metrics, including accuracy, precision, recall, and F1-score, were computed to assess model discriminative capabilities. This study investigates cancer classification via machine learning on genomic data, revealing diverse gene expression profiles across different cancer types. Preprocessing and feature selection resulted in a dataset of 2,507 samples and 1,881 features, addressing class imbalances. Support Vector Machine achieved notable performance, with high accuracy (99.04%), precision (96%), recall (98%), and F1-score (98.98%) across all cancer types. Random Forest demonstrated precision scores of 98% and an F1-score of 98.98%. Logistic Regression displayed robust performance across metrics, while Gradient Boosting showed strong accuracy and precision. K-Nearest Neighbors exhibited moderate accuracy and precision. Neural Network performed consistently well across metrics, with rapid convergence and high accuracy on unseen data. Confusion matrices and ROC curves validated accurate predictions, highlighting the potential of machine learning in precise cancer classification and early intervention. In conclusion, the five classifiers demonstrated robustness in accurately distinguishing between different cancer types with minimal false positives with Support vector machine standing out for its outstanding performance, with an accuracy score of 0.9904, reaffirming its efficacy in cancer classification tasks.

Keywords

Cancer prediction, Machine learning, Clinical data

Graduation Month

May

Degree

Master of Science

Department

Department of Chemical Engineering

Major Professor

Davood B. Pourkargar

Type

Report

URI

https://hdl.handle.net/2097/44300

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -

Full item page

Machine learning-based cancer prediction using large scale clinical data

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Graduation Month

Degree

Department

Major Professor

Date

Type

Citation

URI

Collections