Lab 1. Diabetes Prediction lab

OIDD 2550, Professor Tambe

Untitled

Learning Objectives for this lab

Using a familiar tool - Excel — to predict a target variable.
To start to develop a common understanding about the tradeoffs that arise when using machine learning models.
Understand the concept of error when using prediction models.

Data Context

The setting for the exercise is healthcare. Machine learning models are becoming widespread in healthcare diagnostics, and using machine learning for diabetes prediction is becoming fairly common. [1, 2]

This lab uses a popular and common machine learning model — “logistic regression” — to predict outcomes from patient data. In this context, given access to other indicators of patient health (that may be readily available or easy to collect), the goal would be to predict whether or not a patient has diabetes (or will have it soon).

We have provided an Excel spreadsheet with a logistic regression model already built into the spreadsheet for the PIMA Diabetes data. For a large sample of patients, these data contain information on a variety of health indicators for patients as well as whether or not they have diabetes, denoted by a 1 or a 0. If you would like to read more about the diabetes and health issues associated with the PIMA people (a community of Native Americans from Arizona and Northwestern Mexico), read here (optional).

Modeling

It is not essential that that you understand ML concepts for this exercise - that comes later. This lab is meant to get you to go hands-on with some key concepts before we formally cover them in class.

Regression and prediction

It is important to know, however, that different types of models can be used to make predictions. You may have come across “regressions” in the context of fitting a line (or another shape) to a set of known data points. Once these models are fitted, they allow modelers to take new data and predict outcomes for unknown data points. In other words, we can use a set of known data points (x’s) with outcomes (y’s) to fit or “train” a regression model, and then use the fitted model to predict outcomes for data points (x’s) where we do not know the outcomes.

In the diabetes context, we can use data on patient indications and known diabetes outcomes to find how the patient indications can be combined to predict whether or not a patient is likely to have diabetes. We can then use this fitted model to predict whether new patients, for whom we do not know a diabetes diagnosis, might have diabetes.

OIDD 2550, Professor Tambe

Learning Objectives for this lab

Data Context

Modeling

Regression and prediction

Logistic regression