<aside> 🖥️
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
Data files © Original Authors
Electronic Health Records (EHRs) are the primary source of data for the Diabetes Prediction dataset. EHRs are digital versions of patient health records that contain information about their medical history, diagnosis, treatment, and outcomes. The data in EHRs is collected and stored by healthcare providers, such as hospitals and clinics, as part of their routine clinical practice.
To create the Diabetes Prediction dataset, EHRs were collected from multiple healthcare providers and aggregated into a single dataset. The data was then cleaned and preprocessed to ensure consistency and remove any irrelevant or incomplete information.
The use of EHRs as a data source for the Diabetes Prediction dataset has several advantages. First, EHRs contain a large amount of patient data, including demographic and clinical information, which can be used to develop accurate machine learning models. Second, EHRs provide a longitudinal view of a patient's health over time, which can be used to identify patterns and trends in their health status. Finally, EHRs are widely used in clinical practice, making the Diabetes Prediction dataset relevant to real-world healthcare settings.
The collection methodology for the diabetes prediction dataset involves gathering medical and demographic data from patients who have been diagnosed with or are at risk of developing diabetes. The data is typically collected through surveys, medical records, and laboratory tests. The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The data is then processed and cleaned to remove any errors or inconsistencies. The dataset can also be used for research purposes to identify potential risk factors for diabetes and to develop effective prevention and treatment strategies.
</aside>
| Column name | Explanation |
|---|---|
| gender | Biological sex of the patient (e.g., male, female, or other categories) |
| age | Age of the patient in years |
| hypertension | Indicator if the patient has high blood pressure (0 = no, 1 = yes) |
| heart_disease | Indicator if the patient has heart disease (0 = no, 1 = yes) |
| smoking_history | Categorical variable describing smoking habits (never, current, former, etc.) |
| bmi | Body Mass Index, a measure of body weight relative to height (kg/m²) |
| HbA1c_level | Hemoglobin A1c, shows the average blood sugar level over the last 2–3 months → possible leakage since it is used to diagnose diabetes |
| blood_glucose_level | Blood glucose measured at a single point in time → possible leakage since it directly reflects diabetes diagnosis |
| diabetes | Target column – indicates diabetes status: • 0 = non-diabetic • 1 = diabetic |