<aside> 🖥️

https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

Data files © Original Authors

Electronic Health Records (EHRs) are the primary source of data for the Diabetes Prediction dataset. EHRs are digital versions of patient health records that contain information about their medical history, diagnosis, treatment, and outcomes. The data in EHRs is collected and stored by healthcare providers, such as hospitals and clinics, as part of their routine clinical practice.

To create the Diabetes Prediction dataset, EHRs were collected from multiple healthcare providers and aggregated into a single dataset. The data was then cleaned and preprocessed to ensure consistency and remove any irrelevant or incomplete information.

The use of EHRs as a data source for the Diabetes Prediction dataset has several advantages. First, EHRs contain a large amount of patient data, including demographic and clinical information, which can be used to develop accurate machine learning models. Second, EHRs provide a longitudinal view of a patient's health over time, which can be used to identify patterns and trends in their health status. Finally, EHRs are widely used in clinical practice, making the Diabetes Prediction dataset relevant to real-world healthcare settings.

The collection methodology for the diabetes prediction dataset involves gathering medical and demographic data from patients who have been diagnosed with or are at risk of developing diabetes. The data is typically collected through surveys, medical records, and laboratory tests. The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The data is then processed and cleaned to remove any errors or inconsistencies. The dataset can also be used for research purposes to identify potential risk factors for diabetes and to develop effective prevention and treatment strategies.

</aside>

Column name Explanation
gender Biological sex of the patient (e.g., male, female, or other categories)
age Age of the patient in years
hypertension Indicator if the patient has high blood pressure (0 = no, 1 = yes)
heart_disease Indicator if the patient has heart disease (0 = no, 1 = yes)
smoking_history Categorical variable describing smoking habits (never, current, former, etc.)
bmi Body Mass Index, a measure of body weight relative to height (kg/m²)
HbA1c_level Hemoglobin A1c, shows the average blood sugar level over the last 2–3 months → possible leakage since it is used to diagnose diabetes
blood_glucose_level Blood glucose measured at a single point in time → possible leakage since it directly reflects diabetes diagnosis
diabetes Target column – indicates diabetes status: • 0 = non-diabetic • 1 = diabetic