Abstract
Heart disease is a significant cause of death worldwide, and early prediction is vital for prevention and treatment. This project uses the Framingham Heart Study dataset for the early prediction of Coronary Heart Disease (CHD) using machine learning methods. The Framingham Heart Study is a highly unbalanced dataset, with only 16 % cases of CHD, which impacts the accuracy of the model. To overcome this, data augmentation techniques such as SMOTE and cGAN are applied to create synthetic cases of CHD. The machine learning algorithms that are compared: Random Forest, XGBoost, SVM, and MLP. XGBoost has achieved the highest AUC-ROC of 0.973 when cGAN-augmented data is used, while cGAN-augmented data improves recall and overall model performance significantly. This study identifies the potential for combining machine learning with data augmentation to improve CHD prediction.