Speaker
Description
This study presents a statistical and machine learning analysis of breast cancer data from 213 patients treated at the University of Calabar Teaching Hospital between January 2019 and August 2021. The dataset comprises key demographic and clinical variables, including age, menopause status, tumor size, invasive lymph nodes, metastasis, breast quadrant, personal/family history of breast disease, and diagnosis outcome (benign or malignant). The objective was to identify significant predictors of malignancy and evaluate the effectiveness of machine learning models in diagnostic classification.
Descriptive analysis showed a higher frequency of benign diagnoses (117) relative to malignant cases (90), with a peak age range of 45–55 years. Tumor size, lymph node involvement, and metastasis were highly right skewed, indicating that most patients presented with early stage characteristics. However, malignant tumors were typically larger and occurred in older women. Density plots and inferential tests (Chi-square, t-test, ANOVA) revealed statistically significant differences in age, tumor size, metastasis, and lymph node involvement between benign and malignant groups (p < 0.05). Menopause status was also significantly associated with tumor size, suggesting the influence of hormonal transitions in cancer development.
A Random Forest classifier was trained to predict malignancy based on the available features. The model achieved an accuracy exceeding 90% and a kappa statistic of 84.54%, reflecting high agreement and minimal classification bias. Feature importance analysis identified tumor size, invasive nodes, metastasis, and age as the top contributors to predictive performance. While variables such as breast quadrant, menopause status, and family history had lower influence, they provided complementary diagnostic information.
This study demonstrates that a combination of classical statistical methods and ensemble machine learning can provide actionable insights for early breast cancer detection and risk stratification. The findings reaffirm the clinical value of tumor size and lymph node status, and underscore the potential of data driven models to support diagnostic decisions. However, limitations include the presence of missing values and the dataset’s confinement to a single institution. Further work with larger, multicenter datasets is recommended to enhance generalizability and refine predictive accuracy.
The results contribute to the growing body of evidence supporting integrated statistical-machine learning frameworks in oncological research, with implications for screening, prognosis, and personalized care pathways.
Keywords
Breast Cancer, Random Forest, Machine Learning, Predictive Modeling, Statistical Analysis
| Registration ID | OHS25-164 |
|---|---|
| Professional Status of the Speaker | PhD Student |
| Junior Scientist Status | Yes, I am a Junior Scientist. |
Author
Co-author
External references
- 63