Understanding model performance

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2de9e79c-86cf-4a20-b914-cd0f003c27a8/Screen_Shot_2021-01-27_at_8.55.28_pm.png

When measuring model performance, we want to have sensible metrics that indicate how well the model is performing. This should encompass the model's ability to predict each of the training classes and give us insight into any biases that may exist on the model. By having these clear metrics, we can get a good idea of how the model will perform when deployed into production, which will ultimately drive performance of the product and its impact in the market.

Evaluation Data

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/b2a0261e-6108-49d8-8efa-f67bd7d4076b/Screen_Shot_2021-01-27_at_8.57.55_pm.png

In order to measure models performance. We use data similar to how we use data to train the model from the total set of labelled data, we generally use 80% for training data, the other 20% for testing the model. It is important to note that the training data and test data should both be as balanced as possible. Similar to the issues we saw in training the model, having unbalanced data in the test data may skew our perspective of the performance. For example, If the test set is made up of a majority of a single class, then we will not have robust enough test cases to determine whether the model is performing adequately across all classes.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/e7a1ee15-5806-4226-8487-3b6e67a8a824/Screen_Shot_2021-01-27_at_8.59.15_pm.png

In addition to test data validation data is sometimes added as another split the validation data is used during training to help inform updates to model parameters, and is different from the test data, because the test data is never seen by the model, until after the training is complete, to simulate a new set of data that the model is never seen. This way we can evaluate how the model would perform on data that is completely different than the data used while training.

The way we evaluate these models, is to start with our labeled test data. We feed this data through the network, which produces each of the labeled data point. By comparing the predicted labels and the actual labels, we start to get an idea of where the model is performing well, and where the model needs improvement.

We need a measure of performance, we could count the number of correct and incorrect predictions. However, we want to understand the performance of each class, which requires more than a simple count of incorrect vs correct. Two common lenses we use when evaluating a model are precision and recall. Both of these will measure the model, such that we can understand how the model performs for an individual class, as well as how it performs across classes.

Model Precision

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/a36e214d-37d5-4ce2-ba13-2942b1dea0b7/Screen_Shot_2021-01-27_at_9.02.21_pm.png

Looking at this graphic, we see the models predictions inside the circle, the left half of the graphic represents the ground truth and the data and the right half of the graphic represents the negative or non existent data. To give an example, if we had three images of cats and only two of them were identified as cats, we would see the two correctly predicted cats on the left half of the inner circle, and the missing cat on the left half of the outer square, representing a false negative prediction. Conversely, if we misclassified a dog as a cat, that would fall on the right half of the inner circle as a false positive. Model precision answers the question of when the model makes a prediction, how likely is that prediction to be correct. In order to calculate the precision, we will take the number of true positives, aka the number of correct predictions and divided by the total number of predictions. This will tell us what percent of all the predictions were correct.

We start with predicted labels for each of the data classes. We will first calculate the precision for each class for cats, we take the number of correctly predicted cats and divide that by the total number of cat predictions. In this case, we have one correctly predicted cat and two cat predictions. One for the actual cat, and one for the misclassified gerbil. This gives us a precision of 0.5.