A Machine Learning – Based Classification Framework for Aiding the Prediction of Colon Cancer
Abstract
Cancer remains one of the leading causes of death globally, with a different array of types and underlying genetic factors. Among these, colorectal cancer (CRC) is the third most commonly diagnosed cancer, causing several critical public health concern in this contemporary time. The complexity of colorectal Cancer necessitates early screening and effective detection methods to improve survival rates and reduce mortality. Health practitioners advocate for regular screenings, such as colonoscopies, as these procedures can identify precancerous lesions and early-stage cancers and in addition, ultimately enhancing treatment options and outcomes. However, the challenges of accurately diagnosing colorectal cancer in a timely manner persist, underscoring the need for innovative approaches in medical diagnostics. Machine learning, a subset of artificial intelligence, emerging as a transformative tool in healthcare particularly in disease classification and prediction tasks. It leverages on algorithms that can learn from data, facilitating the classification of patients into various health categories based on a multitude of factors. In addition, machine learning techniques enable healthcare professionals to distinguish between patients diagnosed with CRC and those who are healthy, significantly improving diagnostic accuracy. A machine learning-based classification framework for aiding the prediction of colon cancer investigates the application of machine learning to enhance the classification of colorectal cancer patients. Specifically, three different classification approaches were utilized to accurately categorize patients into colorectal and healthy (non-colorectal) classes, this includes decision trees, support vector machines and random forests. The combination of these techniques was achieved through an ensemble model, employing a method known as bagging, to facilitate predictive accuracy by aggregating the outputs of multiple models, thereby reducing overfitting and improving the generalizability of the findings. Data were sourced from clinicaltrials.gov, identified as NCT030322874, which included a cohort of 362 patients. The data were derived from the study conducted over a period from January 2014 to July 2016 across three hospitals in Nigeria: Obafemi Awolowo University Teaching Hospital (OAUTHC), Ile-Ife Osun State University College Hospital (UCH), and University of Ilorin Teaching Hospital (UITH). Patients were comprehensively evaluated through a structured questionnaire and a colonoscopy, ensuring robust data collection that captures essential clinical and demographic information. In addition to clinical data, dietary information was obtained from the American Institute of Cancer Research, as diet plays a significant role in cancer risk and progression. The integration of dietary data with clinical assessments provides a more comprehensive understanding of factors influencing colorectal cancer, facilitating better-targeted interventions. To ensure the reliability and validity of the results, the dataset underwent standardization, which involved converting raw data into percentages and calculating medians within defined ranges. This standardization process enhances the comparability of results across the dataset, Overfitting challenge was reduced by employing random sampling techniques in Python programming, using a random seed to generate diverse samples from the dataset.
Authors: V.I. Ainoko, O.Y. Ogunlola, A.O. Oronti, O.O. Abereowo, O.D. Alowolodu, B.K. Alese
Published in: International Conference for Internet Technology and Secured Transactions (ICITST-2024)
- Date of Conference: 4-6 November 2024
- DOI: 10.20533/ICITST.2024.0012
- ISBN: 978-1-913572-76-1
- Conference Location: St Anne’s College, Oxford University, UK