7 Minute Read

Published Dec 2022

↓ Click to expand


Before we start, we should create a train, test, and validation set.

The training set is used to fit the model to the data, and to optimize the model's parameters and hyperparameters. The models performance is measured against the validation set and is used to calculate loss. The validation set is usually 20% of

The validation set is a subset of the data that is used to evaluate the model's performance during training. It is used to tune the model's hyperparameters and to select the best model among several alternatives. The validation set is usually smaller than the training set, and it is used to provide an estimate of the model's performance on unseen data.

The test set is a subset of the data that is used to evaluate the model's performance after training is complete. It is used to assess the generalization ability of the model, that is, its ability to make predictions on new, unseen data. The test set is usually the largest of the three sets, and it is used to provide an unbiased estimate of the model's performance. Chosen based on time to be predictive, not random

It is important to keep the training, validation, and test sets separate and distinct, as using the same data for multiple purposes can lead to overfitting, where the model is too closely tailored to the training data and does not generalize well to new data. Splitting the data into separate sets allows for an objective evaluation of the model's performance.

# create validation and test training sets
from sklearn.model_selection import train_test_split

random.seed(69)

large_df, test_df = train_test_split(df,test_size=0.2)
trn_df, val_df = train_test_split(large_df,test_size=0.25)

# we don't need the id column as it adds nothing the dataset for training
trn_df.drop('id', axis=1, inplace=True)
val_df.drop('id', axis=1, inplace=True)

print(trn_df.shape, val_df.shape, test_df.shape)

OUT: (3066, 11) (1022, 11) (1022, 12)

OneR

Tree

Random forest

XGBoost

Cutting edge