The single most significant data-quality risk in my project is class imbalance. In the diabetes dataset, only ~8.5% of rows are labeled as “diabetic”, while the remaining ~91.5% are “non-diabetic”. This imbalance matters because a naive model could achieve over 90% accuracy simply by predicting the majority class every time, while failing to detect the very cases that matter most. In a healthcare context, this risk translates into missed early warnings for patients at risk, which undermines the project’s primary goal of proactive intervention.
To mitigate this, I plan to apply a combination of class weighting and resampling techniques (e.g., SMOTE for oversampling minority cases). During model evaluation, I will go beyond accuracy and focus on metrics like recall, precision, F1-score, and ROC-AUC, ensuring the model captures the minority class effectively. I will also monitor subgroup performance (by age, BMI, gender) to avoid hidden biases.
Finally, I will include threshold tuning as part of the deployment strategy: shifting the probability cutoff to favor higher recall in the diabetic class. This makes the system more sensitive to positive cases, which is critical in a preventative healthcare setting.
One of the most surprising EDA insights came from analyzing the BMI distribution. The visualization revealed extreme outliers (BMI > 70), which correspond to weights that are extremely rare in real life. This led me to design a cleaning step to remove or cap unrealistic values, ensuring the model does not overfit to noise.
This insight directly influenced my feature engineering: I derived BMI categories (underweight, healthy, overweight, obese) and an age×BMI interaction term. These engineered features make risk patterns more interpretable and clinically relevant.
An external resource that supported my learning was a tutorial on imbalanced classification strategies. It explained practical ways to evaluate recall and precision in health data, reinforcing why accuracy alone is insufficient. This shaped how I approached both EDA and model validation.
The engineered feature that added the greatest predictive value was the age×BMI interaction. It captured compounding risk that neither feature showed in isolation, significantly improving model discrimination. The most surprising transformation was simplifying smoking history into just four categories (never, past, current, unknown). Despite the simplification, it retained signal and reduced noise.
The longest debugging step involved aligning categorical encoding across train/test splits to avoid unseen-category errors. Careful use of OneHotEncoder(handle_unknown="ignore") solved this.
These feature engineering choices enhance interpretability: clinicians can see not only raw features but also clinically intuitive groupings (e.g., BMI categories). However, interaction terms may be less transparent without explanation tools (e.g., SHAP).
Next time, I would invest earlier in feature importance analysis to drop low-value engineered features sooner. This would streamline the pipeline and reduce debugging overhead.